Open in app

Sign In

Write

Sign In

Paul Christiano
Paul Christiano

1.2K Followers

Home

About

Published in

AI Alignment

·Apr 27

My views on “doom”

I’m often asked: “what’s the probability of a really bad outcome from AI?” There are many different versions of that question with different answers. In this post I’ll try to answer a bunch of versions of this question all in one place. Two distinctions Two distinctions often lead to confusion about what…

3 min read

3 min read


Published in

AI Alignment

·Dec 27, 2022

Can we efficiently distinguish different mechanisms?

(This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.) Background We’d like to build AI systems that…

18 min read

18 min read


Published in

AI Alignment

·Dec 16, 2022

Can we efficiently explain model behaviors?

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges: Formalizing probabilistic heuristic argument as an operationalization of “explanation” Finding sufficiently specific explanations for important model behaviors Checking whether particular instances of a behavior are “because of” a particular explanation All three of these steps…

10 min read

10 min read


Published in

AI Alignment

·Dec 13, 2022

AI alignment is distinct from its near-term applications

I work on AI alignment, by which I mean the technical problem of building AI systems that are trying to do what their designer wants them to do. There are many different reasons that someone could care about this technical problem. To me the single most important reason is that…

3 min read

3 min read


Published in

AI Alignment

·Dec 1, 2022

Finding gliders in the game of life

ARC’s current approach to ELK is to point to latent structure within a model by searching for the “reason” for particular correlations in the model’s output. In this post we’ll walk through a very simple example of using this approach to identify gliders in the game of life. We’ll use…

20 min read

Finding gliders in the game of life
Finding gliders in the game of life

20 min read


Published in

AI Alignment

·Nov 25, 2022

Mechanistic anomaly detection and ELK

(Follow-up to Eliciting Latent Knowledge. Describing joint work with Mark Xu. This is an informal description of ARC’s current research approach; not a polished product intended to be understandable to many people.) Suppose that I have a diamond in a vault, a collection of cameras, and an ML system that…

25 min read

Mechanistic anomaly detection and ELK
Mechanistic anomaly detection and ELK

25 min read


Published in

AI Alignment

·Feb 25, 2022

Eliciting latent knowledge

In this report, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras and other sensors. …

3 min read

3 min read


Published in

AI Alignment

·Jun 13, 2021

Answering questions honestly given world-model mismatches

(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later. ETA: now available here.) In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would…

19 min read

19 min read


Published in

AI Alignment

·Jun 10, 2021

A naive alignment strategy and optimism about generalization

(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.) Suppose that I jointly train an…

4 min read

4 min read


Published in

AI Alignment

·May 28, 2021

Teaching ML to answer questions honestly instead of predicting human answers

In this post I consider the problem of models learning “predict how a human would answer questions” instead of “answer questions honestly.” (A special case of the problem from Inaccessible Information.) I describe a possible three-step approach for learning to answer questions honestly instead: Change the learning process so that…

20 min read

Teaching ML to answer questions honestly instead of predicting human answers
Teaching ML to answer questions honestly instead of predicting human answers

20 min read

Paul Christiano

Paul Christiano

1.2K Followers
Following
  • Minda Myers

    Minda Myers

  • Luca Rade

    Luca Rade

  • rhetoricize

    rhetoricize

  • Mitar

    Mitar

  • Jeffrey Yan

    Jeffrey Yan

See all (95)

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Text to speech

Teams