When A is powerful, Hᴬ is almost entirely driven by A.

1 min readJun 16, 2017

When A is powerful, Hᴬ is almost entirely driven by A. The only role of H is to provide information to queries posed by A. For example, if A is uncertain about the meaning of a particular word, it might consult H in order to clarify its meaning. If A is confident it knows how to behave corrigibly then it will ignore H, and all of the error-correction is happening within A’s deliberation.

The intention is that all of those instructions should be equivalent, in the same way that asking question Q is equivalent to asking “what is the best answer that you could give to question Q?” (Except insofar as correcting for value drift is one of several desiderata, you also need to have sufficiently rapid capability growth and to be stable along other axes and so forth.)

(I accidentally included an extra sentence at the end of my last reply. I also responded in reverse order but that was intentional.)

Written by Paul Christiano

No responses yet