I don’t see how structuring the reward function can work. Suppose that I’m training a system to map x → y, and I’m given the reward function R(x, y) and the catastrophe predicate C(x, y). I’d like to maximize E[R(x, π(x))] without ever satisfying C(x, π(x)) at test time. How are you proposing to do this?
You might be saying that if C(x, y) → R(x, y) is low then you aren’t concerned. But I don’t see why this is, since e.g. a neural net can get arbitrarily low reward sometimes on the training distribution (and when I think through the particular cases people are concerned about this seems likely to happen without clever tricks).
Or is the proposal to avoid solving this problem, by having a setup such that there are no catastrophes? Approval-directed agents avoid some part of the wireheading problem, but they can still produce outputs on which everyone dies. It seems like this requires using the agent’s outputs cleverly and carefully, so that nothing too bad can happen regardless of its behavior, but I don’t see a plausible way to do that without crippling it.
If there was a good way forward on eliminating catastrophes outright, I agree that sounds much more straightforward.