Paul Christiano
1 min readApr 1, 2018

--

We’d want to do adversarial training to try to find weird situations where the agent failed to shutdown appropriately. They will probably be situations where the agent believes that the current datapoint is very unlikely to be synthetic, and is almost certainly a naturally occurring input where it has the opportunity to do something catastrophic if it doesn’t shutdown.

--

--

Responses (1)