Paul Christiano – Medium

Paul Christiano

Paul Christiano
in
AI Alignment

My views on “doom”

I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.

3 min readApr 27, 2023

--

3

--

3

Paul Christiano
in
AI Alignment

Can we efficiently distinguish different mechanisms?

Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?

18 min readDec 27, 2022

--

--

Paul Christiano
in
AI Alignment

Can we efficiently explain model behaviors?

It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.

10 min readDec 16, 2022

--

--

Paul Christiano
in
AI Alignment

AI alignment is distinct from its near-term applications

Not everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.

3 min readDec 13, 2022

--

1

--

1

Paul Christiano
in
AI Alignment

Finding gliders in the game of life

Walking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.

20 min readDec 1, 2022

--

Finding gliders in the game of life

--

Paul Christiano
in
AI Alignment

Mechanistic anomaly detection and ELK

An approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.

25 min readNov 25, 2022

--

2

Mechanistic anomaly detection and ELK

--

2

Paul Christiano
in
AI Alignment

Eliciting latent knowledge

How can we train an AI to honestly tell us when our eyes deceive us?

3 min readFeb 25, 2022

--

--

Paul Christiano
in
AI Alignment

Answering questions honestly given world-model mismatches

I expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?

19 min readJun 13, 2021

--

--

Paul Christiano
in
AI Alignment

A naive alignment strategy and optimism about generalization

I describe a very naive strategy for training a model to “tell us what it knows.”

4 min readJun 10, 2021

--

--

Paul Christiano
in
AI Alignment

Teaching ML to answer questions honestly instead of predicting human answers

I discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.

20 min readMay 28, 2021

--

Teaching ML to answer questions honestly instead of predicting human answers

--

Paul Christiano

Paul Christiano

Following

Help
Status
About
Careers
Blog
Privacy
Terms
Text to speech
Teams