Here I only care about counterfactuals over the space of inputs, not over facts about the world or… | by Paul Christiano | Medium

…does A)”. This seems like a useful fact to know about a model — I’m curious how we might obtain it. It seems like if we’re using “model M knows A causes X” in the ascription sense, there’s no part of the model we can modify to create a model M’ where A does not cause X. So the only way to verify this fact would be to train another model M_2 in a world where A does not…
Here’s one model of how neural networks might perform complex reasoning, which I’ll the call “bag…
4
Jon Crescent
Paul Christiano
·Follow
1 min read·
Jan 25, 2019
--
Here I only care about counterfactuals over the space of inputs, not over facts about the world or logical facts. That is, we need to:
Know that the model believes that A causes X in this particular situation (we get this from universality)
Understand something about how that judgment depends on facts about the situation (get part of this from universality, but it runs together with interpretability in a complicated way)
Find a situation where it no longer believes that A causes X (this requires the combination of interpretability and a good relaxation).
--
--
Written by Paul Christiano1.4K Followers
·95 Following
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams