Paul ChristianoinAI AlignmentMy views on “doom”I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.3 min read·Apr 27, 2023--3--3
Paul ChristianoinAI AlignmentCan we efficiently distinguish different mechanisms?Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?18 min read·Dec 27, 2022----
Paul ChristianoinAI AlignmentCan we efficiently explain model behaviors?It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.10 min read·Dec 16, 2022----
Paul ChristianoinAI AlignmentAI alignment is distinct from its near-term applicationsNot everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.3 min read·Dec 13, 2022--1--1
Paul ChristianoinAI AlignmentFinding gliders in the game of lifeWalking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.20 min read·Dec 1, 2022----
Paul ChristianoinAI AlignmentMechanistic anomaly detection and ELKAn approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.25 min read·Nov 25, 2022--2--2
Paul ChristianoinAI AlignmentEliciting latent knowledgeHow can we train an AI to honestly tell us when our eyes deceive us?3 min read·Feb 25, 2022----
Paul ChristianoinAI AlignmentAnswering questions honestly given world-model mismatchesI expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?19 min read·Jun 13, 2021----
Paul ChristianoinAI AlignmentA naive alignment strategy and optimism about generalizationI describe a very naive strategy for training a model to “tell us what it knows.”4 min read·Jun 10, 2021----
Paul ChristianoinAI AlignmentTeaching ML to answer questions honestly instead of predicting human answersI discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.20 min read·May 28, 2021----