Paul Christiano – Medium

Paul Christiano

Published in
AI Alignment

My views on “doom”

I’m often asked: “what’s the probability of a really bad outcome from AI?” In this post I answer 10 versions of that question.

Apr 27, 2023

Apr 27, 2023

Published in
AI Alignment

Can we efficiently distinguish different mechanisms?

Can a model produce coherent predictions based on two very different mechanisms without there being any efficient way to distinguish them?

Dec 27, 2022

Dec 27, 2022

Published in
AI Alignment

Can we efficiently explain model behaviors?

It may be impossible to automatically find explanations. That would complicate ARC’s alignment plan, but our work can still be useful.

Dec 16, 2022

Dec 16, 2022

Published in
AI Alignment

AI alignment is distinct from its near-term applications

Not everyone will agree about how AI systems should behave, but no one wants AI to kill everyone.

Dec 13, 2022

Dec 13, 2022

Published in
AI Alignment

Finding gliders in the game of life

Walking through a simple concrete example of ARC’s approach to ELK based on mechanistic anomaly detection.

Dec 1, 2022

Finding gliders in the game of life

Dec 1, 2022

Published in
AI Alignment

Mechanistic anomaly detection and ELK

An approach to ELK based on finding the “normal reason” for model behaviors on the training distribution and flagging anomalous exaples.

Nov 25, 2022

Mechanistic anomaly detection and ELK

Nov 25, 2022

Published in
AI Alignment

Eliciting latent knowledge

How can we train an AI to honestly tell us when our eyes deceive us?

Feb 25, 2022

Feb 25, 2022

Published in
AI Alignment

Answering questions honestly given world-model mismatches

I expect AIs and humans to think about the world differently. Does that make it more complicated for an AI to “honestly answer questions”?

Jun 13, 2021

Jun 13, 2021

Published in
AI Alignment

A naive alignment strategy and optimism about generalization

I describe a very naive strategy for training a model to “tell us what it knows.”

Jun 10, 2021

Jun 10, 2021

Published in
AI Alignment

Teaching ML to answer questions honestly instead of predicting human answers

I discuss a three step plan for learning to answer questions honestly instead of predicting what a human would say.

May 28, 2021

Teaching ML to answer questions honestly instead of predicting human answers

May 28, 2021

Paul Christiano

Paul Christiano

Following

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams