Short answer: I agree it’s not enough to know what the model knows. I don’t know exactly what you need, but it feels like it might be achievable by the supremum of “know what the model knows” and lots of progress on interpretability.
Longer answer:
If your model learns to do MPC, we hope it can jointly learn to solve some related tasks. For example, MPC makes predictions, so we hope it would be easy to also train your model to answer questions like “what will happen if you did X?” (This rests on something like ascription universality, and it’s a tricky instance since justifying that prediction might e.g. depend on complicated abstractions learned by the model.)
If this model is going to engage in a treacherous turn, but looks nice on a really wide variety of data, it’s because there is some link like “if I took an alternative unacceptable action, the human would get mad at me.” And we can hopefully learn that by interrogating the model.
Then the tricky step is being able to get the conclusion “the model behaved this way because it avoids the human getting mad at them” or “the model behaved this way because it achieves goal X” (where goal X is incidentally correlated with what the model works).
I agree that this isn’t something the model can be said to know. But once you know what the model knows, it feels like it’s not a big extra step to get this information — you want to know something like, across the space of possible inputs, how the model’s answer to one question (“should I do A?”) depends on its answer to another (“will action A cause X?”). Discovering such relationships in the abstract might be computationally hard, but when they are “following along” with the computation done by the model (when one thing directly causes another via mechanisms inside of the model), it seems more plausible that we can efficiently identify that.
To me this also feels like the kind of thing that really advanced interpretability techniques could discover. It feels like the hard/magical part is being able to understand what the model understands.