Open in app

Sign In

Write

Sign In

Paul Christiano
Paul Christiano

1.1K Followers

Home

About

Published in AI Alignment

·Dec 27, 2022

Can we efficiently distinguish different mechanisms?

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.) Background We’d like to build AI systems that…

18 min read

18 min read


Published in AI Alignment

·Dec 16, 2022

Can we efficiently explain model behaviors?

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges: Formalizing probabilistic heuristic argument as an operationalization of “explanation” Finding sufficiently specific explanations for important model behaviors Checking whether particular instances of a behavior are “because of” a particular explanation All three of these steps…

10 min read

10 min read


Published in AI Alignment

·Dec 13, 2022

AI alignment is distinct from its near-term applications

I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do. There are many different reasons that someone could care about this technical problem. To me the single most important reason is that…

3 min read

3 min read


Published in AI Alignment

·Dec 1, 2022

Finding gliders in the game of life

ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life. We’ll use…

20 min read

Finding gliders in the game of life
Finding gliders in the game of life

20 min read


Published in AI Alignment

·Nov 25, 2022

Mechanistic anomaly detection and ELK

(Follow-up to Eliciting Latent Knowledge. Describing joint work with Mark Xu. This is an informal description of ARC’s current research approach; not a polished product intended to be understandable to many people.) Suppose that I have a diamond in a vault, a collection of cameras, and an ML system that…

25 min read

Mechanistic anomaly detection and ELK
Mechanistic anomaly detection and ELK

25 min read


Published in AI Alignment

·Feb 25, 2022

Eliciting latent knowledge

In this report, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. …

3 min read

3 min read


Published in AI Alignment

·Jun 13, 2021

Answering questions honestly given world-model mismatches

(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later. ETA: now available here.) In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would…

19 min read

19 min read


Published in AI Alignment

·Jun 10, 2021

A naive alignment strategy and optimism about generalization

(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.) Suppose that I jointly train an…

4 min read

4 min read


Published in AI Alignment

·May 28, 2021

Teaching ML to answer questions honestly instead of predicting human answers

In this post I consider the problem of models learning “predict how a human would answer questions” instead of “answer questions honestly.” (A special case of the problem from Inaccessible Information.) I describe a possible three-step approach for learning to answer questions honestly instead: Change the learning process so that…

20 min read

Teaching ML to answer questions honestly instead of predicting human answers
Teaching ML to answer questions honestly instead of predicting human answers

20 min read


Published in AI Alignment

·May 25, 2021

Decoupling deliberation from competition

I view intent alignment as one step towards a broader goal of decoupling deliberation from competition. Deliberation. Thinking about what we want, learning about the world, talking and learning from each other, resolving our disagreements, figuring out better methodologies for making further progress… Competition. Making money and racing to build…

11 min read

11 min read

Paul Christiano

Paul Christiano

1.1K Followers
Following
  • Luca Rade

    Luca Rade

  • rhetoricize

    rhetoricize

  • Mitar

    Mitar

  • Geoff Anders

    Geoff Anders

  • Jeffrey Yan

    Jeffrey Yan

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech