Paul Christiano
3 min readMar 16, 2018

--

I agree that an alignment technique is a no-go if it causes reduced competitiveness, including reduced translation ability.

By an acceptable “failure of competence” I meant: in order to perform well in the world our AI will need to leverage Alice’s translation ability. If it fails to do that well (or takes a rational risk owing to competitive pressure), it may get eaten by a demon. But that’s not our fault as alignment researchers. The existence of demon-prone translators was just another challenge in the world, that our AI may not have been equipped to address.

I think the confusion is because I misunderstood your earlier point / gave a bad definition. My scope is even narrower than what I said before, I’m not considering all possible ways of producing AI, just a particular family of techniques (roughly speaking, techniques that look like “do a bunch of optimization.”) that pose a particular risk. There may be techniques other than optimization that are useful but create malign AI.

I’d be happy to explicitly narrow my scope beyond the full alignment problem to “the alignment problem posed by powerful optimization.” But this may also just be a bug in my statement of the alignment problem, a way in which it differs from common use / fails to be useful. I suspect “the alignment problem caused by powerful optimization” is closer to what people mean by “the alignment problem” and I think it’s a more useful category. But it’s not a super important distinction and I’d be happy to go with others’ preferred usage. (A natural way to point to what I’m working on in particular would then be prosaic AI alignment.)

To be a bit more precise about what I mean:

  • Some AI techniques might lead to powerful AI systems that are malign, i.e. which are trying to do something other than what we want them to do.
  • My goal is to develop alternative versions of those techniques that are equally useful, but don’t lead to malign AI.
  • A benign AI may still cause damage by making a mistake. There are lots of problems in the world, we are trying to fix only this problem.

I was thinking of “consult Alice” as a technique for producing good translations. That technique is dangerous whether or not you have AI, so it doesn’t look like an alignment problem to me.

But as you say, “train an AI to imitate Alice” is also a technique for producing an AI, which might produce a malign AI. So on my problem statement, finding a benign variant of that procedure is part of the alignment problem. It differs from problems like “your new technology may blow up the world” because it fails by causing your AI to do something bad.

To help motivate my narrower focus, consider a particularly silly example: we receive a communication from extraterrestrials that includes the obfuscated code of a powerful malign AI. Now we are just screwed, there is nothing to be done other than coordination to not run the AI. On my original definition this is another example of an alignment problem. So I definitely want to either narrow the definition to exclude this kind of thing, or I want to narrow my focus to a subset of the alignment problem. The example of translator Alice isn’t quite this extreme, but I want to separate it out from the alignment-of-powerful-optimization problem for similar reasons.

--

--

Responses (1)