Paul Christiano
1 min readJan 25, 2019

--

Here I only care about counterfactuals over the space of inputs, not over facts about the world or logical facts. That is, we need to:

  • Know that the model believes that A causes X in this particular situation (we get this from universality)
  • Understand something about how that judgment depends on facts about the situation (get part of this from universality, but it runs together with interpretability in a complicated way)
  • Find a situation where it no longer believes that A causes X (this requires the combination of interpretability and a good relaxation).

--

--

No responses yet