Paul Christiano
1 min readFeb 19, 2018

--

Formalizing anything is difficult. In this passage I’m just talking about why “corrigible” is more likely to be achievable than “not catastrophically bad.”

Coming up with a specification still seems super hard. I give three thoughts in the section on verification: (i) we can use verification to distill a slow trusted model into a fast trusted model (e.g. I think that even perfect adversarial training could at best produce a robust ensemble, which you would then want to distill into a single model), (ii) we can use amplification to iteratively turn a very weak trusted model into a strong trusted model (and then either use other techniques to make a very weak model robust, or in the most extreme case remove the human entirely so there is no physical process), (iii) we can use the ideas from verification to improve adversarial training without literally getting a proof.

--

--

Responses (1)