Published inAI AlignmentMy views on “doom”I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.Apr 27, 20233Apr 27, 20233
Published inAI AlignmentCan we efficiently distinguish different mechanisms?Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?Dec 27, 2022Dec 27, 2022
Published inAI AlignmentCan we efficiently explain model behaviors?It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.Dec 16, 2022Dec 16, 2022
Published inAI AlignmentAI alignment is distinct from its near-term applicationsNot everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.Dec 13, 20221Dec 13, 20221
Published inAI AlignmentFinding gliders in the game of lifeWalking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.Dec 1, 2022Dec 1, 2022
Published inAI AlignmentMechanistic anomaly detection and ELKAn approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.Nov 25, 20221Nov 25, 20221
Published inAI AlignmentEliciting latent knowledgeHow can we train an AI to honestly tell us when our eyes deceive us?Feb 25, 2022Feb 25, 2022
Published inAI AlignmentAnswering questions honestly given world-model mismatchesI expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?Jun 13, 2021Jun 13, 2021
Published inAI AlignmentA naive alignment strategy and optimism about generalizationI describe a very naive strategy for training a model to “tell us what it knows.”Jun 10, 2021Jun 10, 2021
Published inAI AlignmentTeaching ML to answer questions honestly instead of predicting human answersI discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.May 28, 2021May 28, 2021