I define alignment as the problem of building AI systems that are trying to do what you want them to do. Those AI’s may still fail owing to a lack of competence of various kinds, or owing to uneven competence (and a friendly human might fail for the same reasons), but I regard that as a separate problem.
There are two sources of risk people usually talk about:
- In order to build an AI to get what we want, we need to operationalize “what we want” in a form suitable for optimization. If that diverges from what we really want, our AI may do something other than what we want.
- There may be many consequentialist agents that have comparable behavior on the training distribution, and so even if we have correctly defined what we want, aggressive optimization may find a consequentialist with different values. Over the long run this will be corrected out and we can expect a finite number of errors, but a finite number of errors could be quite bad if they were optimized according to some incorrect values.
These are ways that building AI introduces new problems into the world. (There are other situations where you might need to operationalize your values in order to satisfy them, I advocate treating each of those as a separate problem since each of them imposes different demands on the operationalization.)
I exclude everything else, and in particular all failures of competence. The most surprising instance of this is that I very explicitly exclude the case where we build an AI, it designs a new AI, and it fails to solve the alignment problems posed by its new AI. I don’t think it’s realistic to solve every alignment problem ever in one go, we should try to solve the problem that we are faced with, and we should generally contribute to differential AI capabilities progress that makes it easier for AI systems to solve the alignment problem or coordinate to work around a solution (just as we should intervene on society to make it better able to solve those problems) but these should really be attacked separately.
The practical implications: (a) I don’t regard failure to solve one of these other problems as a damning indication about an alignment strategy, (b) when I think about what kind of theorem or pseudotheorem we might prove, I’m explicitly imagining a lower scope (which is presumably an important part of why I‘m so much more optimistic).