Paul Christiano
1 min readFeb 13, 2018

--

This gets around some problems (and it’s the kind of thing I had in mind when talking about using verification to distill a slow trusted model into a fast trusted model).

But it doesn’t seem to deal with the statistical inefficiency. If your implicit ensemble has N things in it, you’ll need to arbitrate N possible-catastrophes (since you need to arbitrate a possible catastrophe when even a single model thinks its bad).

As a silly example, suppose your ensemble consisted of all models that were within k bits of the simplest human-model X, and k is large enough to allow models like “Use X most of the time, but on inputs satisfying predicate P always output it's a catastrophe.” Then you are going to have something flagged as a catastrophe for every k bit predicate P. There are exponentially many disjoint predicates. In practice it seems like k would need to be reasonably large to have confidence that the intended model is in there, so this looks like a deal breaker.

(I’m not super confident about this, but given that I don’t see a way around this problem I didn’t feel comfortable including implicit ensembles as an approach.)

--

--

Responses (1)