Paul Christiano
1 min readFeb 21, 2018

--

I think the first steps towards preventing malign and benign failures are similar. I think it’s plausible that “never fail too badly/confidently” is possible, even if we are talking about benign failures. It totally might turn out to be impossible though.

There are some empirical questions that require studying malign failures, e.g. “does optimization actually produce treacherous turns?” It’s also totally plausible that studying any of the approaches I listed will (eventually) require working with malign failures.

It’s conceivable that we could study malign failures by looking at models with planted failures, rather than failures that emerge naturally from optimization. That sounds like a much harder problem to me, but if you could solve it then we’d definitely be good. (Chris Olah has expressed optimism that this isn’t obviously impossible and seems worth having in mind as a really ambitious goal for transparency.)

It might be very hard to get a good model of naturally-occurring malign failures. (Which is one reason to expect them not to occur in practice even for very powerful systems, but unfortunately it isn’t a super strong reason.)

--

--

Responses (1)