Paul ChristianoinAI AlignmentMy views on “doom”I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.Apr 27, 20233Apr 27, 20233
Paul ChristianoinAI AlignmentCan we efficiently distinguish different mechanisms?Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?Dec 27, 2022Dec 27, 2022
Paul ChristianoinAI AlignmentCan we efficiently explain model behaviors?It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.Dec 16, 2022Dec 16, 2022
Paul ChristianoinAI AlignmentAI alignment is distinct from its near-term applicationsNot everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.Dec 13, 20221Dec 13, 20221
Paul ChristianoinAI AlignmentFinding gliders in the game of lifeWalking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.Dec 1, 2022Dec 1, 2022
Paul ChristianoinAI AlignmentMechanistic anomaly detection and ELKAn approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.Nov 25, 20221Nov 25, 20221
Paul ChristianoinAI AlignmentEliciting latent knowledgeHow can we train an AI to honestly tell us when our eyes deceive us?Feb 25, 2022Feb 25, 2022
Paul ChristianoinAI AlignmentAnswering questions honestly given world-model mismatchesI expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?Jun 13, 2021Jun 13, 2021
Paul ChristianoinAI AlignmentA naive alignment strategy and optimism about generalizationI describe a very naive strategy for training a model to “tell us what it knows.”Jun 10, 2021Jun 10, 2021
Paul ChristianoinAI AlignmentTeaching ML to answer questions honestly instead of predicting human answersI discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.May 28, 2021May 28, 2021