Paul Christiano
1 min readJun 29, 2017

--

The situation is:

  • Arthur predicts that an attacker is reasonably likely to compromise the training setup.
  • Arthur predicts that if an attacker compromises the training setup, then approval(a) will be high for precisely those actions which helped them compromise the training setup.
  • So expected approval(a) is highest for the actions which help the attacker compromise the training setup.
  • So Arthur takes those actions.
  • So an attacker compromises the training setup.
  • So Arthur’s original prediction was a self-fulfilling prophecy.

--

--

No responses yet