Open in app
Home
Notifications
Lists
Stories

Write
Paul Christiano
Paul Christiano

Home

Published in AI Alignment

·Feb 25

Eliciting latent knowledge

(This is a repost from December 2021, linking to a google doc.) In this report, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems: Suppose we train a model to predict what the future will look like according to cameras…

3 min read


Published in AI Alignment

·Jun 13, 2021

Answering questions honestly given world-model mismatches

(Warning: this post is rough and in the weeds. I expect most readers should skip it and wait for a clearer synthesis later. ETA: now available here.) In a recent post I discussed one reason that a naive alignment strategy might go wrong, by learning to “predict what humans would…

19 min read


Published in AI Alignment

·Jun 10, 2021

A naive alignment strategy and optimism about generalization

(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.) Suppose that I jointly train an…

4 min read


Published in AI Alignment

·May 28, 2021

Teaching ML to answer questions honestly instead of predicting human answers

In this post I consider the problem of models learning “predict how a human would answer questions” instead of “answer questions honestly.” (A special case of the problem from Inaccessible Information.) I describe a possible three-step approach for learning to answer questions honestly instead: Change the learning process so that…

20 min read

Teaching ML to answer questions honestly instead of predicting human answers
Teaching ML to answer questions honestly instead of predicting human answers

Published in AI Alignment

·May 25, 2021

Decoupling deliberation from competition

I view intent alignment as one step towards a broader goal of decoupling deliberation from competition. Deliberation. Thinking about what we want, learning about the world, talking and learning from each other, resolving our disagreements, figuring out better methodologies for making further progress… Competition. Making money and racing to build…

11 min read


Published in AI Alignment

·May 4, 2021

Mundane solutions to exotic problems

I’m looking for alignment techniques that are indefinitely scalable and that work in any situation we can dream up. That means I spend time thinking about “exotic” problems — like AI systems reasoning about their own training process or about humanity’s far future. Yet I’m very optimistic about finding practical…

6 min read


Published in AI Alignment

·Apr 30, 2021

Low-stakes alignment

Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objective. (This is roughly “outer alignment.”) That’s pretty vague, and it’s not obvious whether “find a good objective” is a meaningful goal rather than being…

8 min read


Published in AI Alignment

·Apr 26, 2021

Announcing the Alignment Research Center

I’m now working full-time on the Alignment Research Center (ARC), a new non-profit focused on intent alignment research. I left OpenAI at the end of January and I’ve spent the last few months planning, doing some theoretical research, doing some logistical set-up, and taking time off. For now it’s just…

1 min read


Published in AI Alignment

·Mar 22, 2021

My research methodology

(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this…

18 min read


Published in AI Alignment

·Sep 30, 2020

“Unsupervised” translation as an (intent) alignment problem

Suppose that we want to translate between English and an alien language (Klingon). We have plenty of Klingon text, and separately we have plenty of English text, but it’s not matched up and there are no bilingual speakers. We train GPT on a mix of English and Klingon text and…

5 min read

Paul Christiano

Paul Christiano

Following
  • Mitar

    Mitar

  • Geoff Anders

    Geoff Anders

  • Liron Shapira

    Liron Shapira

  • rhetoricize

    rhetoricize

  • Luca Rade

    Luca Rade

Help

Status

Writers

Blog

Careers

Privacy

Terms

About

Knowable