1 min readJan 25, 2019
Here I only care about counterfactuals over the space of inputs, not over facts about the world or logical facts. That is, we need to:
- Know that the model believes that A causes X in this particular situation (we get this from universality)
- Understand something about how that judgment depends on facts about the situation (get part of this from universality, but it runs together with interpretability in a complicated way)
- Find a situation where it no longer believes that A causes X (this requires the combination of interpretability and a good relaxation).