Paul Christiano
1 min readJan 17, 2019

--

I don’t think that quite captures my intuition for optimism. I’m imagining something like: we’ve set up the environment so that “normal” mistakes don’t cause catastrophic problems, and now we are interested in the case where the model engages in some very clever behavior in order to overcome our optimization — perhaps it overcomes security measures we install, or try to systematically generate inputs that will corrupt other ML systems, or manipulate humans in a sophisticated way.

In order to do something like this, the model needs to actually be doing a bunch of cognitive work that looks malicious, and that’s what we want to get at. I’m not as scared of examples like tanh() because while the tanh happens to optimize this function, it’s not doing the kind of cognitive work that would be needed to break a set of reasonably robust precautions, and if it were then there would be something to see.

(To study this aspect of the problem you’d need to be in the regime where your model was generalizing “goal-directed behavior” across different subgoals, so that you could look for the case where it is applying that general machinery in order to do sophisticated but bad things. People hope that sort of thing is happening in some systems (e.g. here), but I’m not sure.)

--

--

Responses (1)