Suppose that Alice is a good translator, but if you show her the wrong kind of sentence then an…

2 min readMar 15, 2018

Suppose that Alice is a good translator, but if you show her the wrong kind of sentence then an evil demon will eat her soul and provide malicious translations.

In a world where Alice exists and translation is important, Alice’s existence will tend to promote the values of evil demons. That’s not something we can avoid with clever alignment strategies, it’s a problem even if we never build AI.

Our AI might make this risk worse in a few ways, but I think those are separate from misalignment and should be handled separately (I think this is a running disagreement between us):

Our AI might be worse at making certain kinds of decisions (e.g. how much to trust Alice) than we are, and so by building it we might exacerbate evil-demon risk.
Our AI might enable powerful optimization, without which Alice would never be consumed by an evil demon. But this powerful optimization may also be useful for leveraging Alice’s translation abilities. So a rational strategy by the AI might involve some demon risk, such that at equilibrium the existence of AI benefits the evil demon at everyone else’s expense.

Now suppose that instead Alice will sometimes behave erratically, but not in a consequentialist way. This is a hazard if you want to use Alice as a translator, or use Alice to train a new translator. But it’s not malign, and it afflicts everyone regardless of their values: whatever your goals, if you want the benefit of Alice’s skills you need to deal with her erratic behavior.

In the real world, you might have both of those situations and more. In light of that, deciding how to use Alice as a translator is a complex question. It’s not really related to AI risk, it’s related to evil demon risk + pragmatic strategic questions.

My goal is to build aligned AI that is able to answer that question as well as it can, and in particular as well as any other aligned AI would. That doesn’t require any translation ability, it requires e.g. understanding the consequences of consulting Alice and making an appropriate cost-benefit analysis. That’s the kind of thing I think alignment needs to be able to do — it needs to be able to learn to translate as well as an unaligned agent can learn to translate, and to learn to leverage Alice’s translation skill as well as another AI can leverage Alice’s translation skill, but not translate as well as Alice can translate.

Written by Paul Christiano

Responses (1)