Sure, but the machinery it uses to “behave badly” needs to itself be optimized, and the only way it…

1 min readJan 17, 2019

Sure, but the machinery it uses to “behave badly” needs to itself be optimized, and the only way it gets optimized is by being exercised on the training distribution. That is, the optimization daemon needs to behave badly by using the same cognitive machinery that it uses on the training distribution, it can’t be some totally different machinery that is steganographically embedded in the model but isn’t active during training. (That said, the machinery might be quite abstract — e.g. it might have learned how to pursue general goals, and it can reuse that machinery to pursue a goal we don’t like. But then we have a hope of understanding what the planning machinery is doing by analyzing it on the training distribution.)

This is in contrast with the situation with an adversary, who could potentially “hide” a malicious behavior somewhere that’s totally invisible during training.

Written by Paul Christiano

Responses (1)