Paul Christiano
1 min readJun 15, 2017

--

3. When overseeing A, H should do things like trying to figure out “what does the user want?” (which may involve querying the user, making inferences from their observed behavior, etc.) and then doing that. If the user is themselves trying to behave corrigibly, then this has the effect of pulling towards the user’s conception of corrigibility (i.e. H’s conception, rather than A’s prediction of H’s conception).

2. I don’t think it’s important. If everything works correctly the system will be corrigible iff the overseer wants it to be. (Of course, a rational overseer may not want the system to be corrigible, e.g. they may want it to help the overseer rather than the user. Just like a rational programmer may not want the software to do what its user wants.) Note that during all steps of the iterative process except for the last one the user is the same as the overseer, since the overseer is using the AI to help overseer.

1.If H has the intention to help the user, but A thinks that H wants to spite the user, then that won’t get corrected. If A is wrong about H’s values (e.g. if A erroneously thinks that H values only paperclips), then that will be fine — as long as it still believes that H wants A to be corrigible, it will ask the user what to do instead of committing to paperclips. If A thinks that H has a slightly different conception of corrigibility (e.g. if A misunderstands the user’s language) then that will hopefully be corrected.

--

--

Responses (1)