We’d want to do adversarial training to try to find weird situations where the agent failed to…

…ll be maximized by taking the <shutdown> action. To improve generalization, we might, for example, add many training points to the training set, where (s=state with shutdown requested, a=<shutdown>) leads to approval +1, and (s=state with shutdown requested, a=something else) leads to approval -1. (Note we could also do the same thing for a myopic agent.)
I’m wondering whether you’d agree with this claim:
5
Jon Crescent
Paul Christiano
·Follow
1 min read·
Apr 1, 2018
--
We’d want to do adversarial training to try to find weird situations where the agent failed to shutdown appropriately. They will probably be situations where the agent believes that the current datapoint is very unlikely to be synthetic, and is almost certainly a naturally occurring input where it has the opportunity to do something catastrophic if it doesn’t shutdown.
--
--
Written by Paul Christiano1.4K Followers
·95 Following
Responses (1)
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams