Paul Christiano
1 min readMar 23, 2019

--

The structure isn’t really that you give a pseudo-input and a proof. It’s more like: you specify a bunch of facts about the input and the model’s behavior on that input, subject to some consistency conditions (e.g. local consistency + the SDP constraint), and then you win if your pseudo-input manifestly implies “with non-negligible probability the overseer concludes that the agent behaved unacceptably.”

If thinking in terms of pseudo-input and proof, it’s kind of like using the same consistency condition for the pseudo-input that you use for your proof. So if the pseudo-input has the kind of inconsistency that your proof can exploit for a contradiction, then it’s also not a valid pseudo-input. I agree it’s bad news to use a proof system stronger than the consistency condition on your pseudo-input.

Though again, in practice this is because the proof is itself embedded within the pseudo-input: in the SDP example, we specify the moments of the input distribution, but also the model’s computation, and also the computations of the overseer, and subject those all to the same consistency condition. This also feels different in that you are more saying: “it’s possible, for all we know, that the model would behave badly” rather than “here is a proof that this model would behave badly in this case.” Stronger systems make the game harder for the adversary(because “for all we know” becomes more meaningful), whereas in the proof case they would make it easier (because it becomes easier to find a proof).

--

--

No responses yet