This gets around some problems (and it’s the kind of thing I had in mind when talking about using verification to distill a slow trusted model into a fast trusted model).
But it doesn’t seem to deal with the statistical inefficiency. If your implicit ensemble has N things in it, you’ll need to arbitrate N possible-catastrophes (since you need to arbitrate a possible catastrophe when even a single model thinks its bad).
As a silly example, suppose your ensemble consisted of all models that were within k bits of the simplest human-model X, and k is large enough to allow models like “Use X most of the time, but on inputs satisfying predicate P always output it's a catastrophe
.” Then you are going to have something flagged as a catastrophe for every k bit predicate P. There are exponentially many disjoint predicates. In practice it seems like k would need to be reasonably large to have confidence that the intended model is in there, so this looks like a deal breaker.
(I’m not super confident about this, but given that I don’t see a way around this problem I didn’t feel comfortable including implicit ensembles as an approach.)