1 min readOct 24, 2018
An input can be an attack on the group without being an attack on any individual decision process. The goal of security amplification is to ensure that, even as such, attacking the group is harder then attacking an individual agent.
I want corrigibility to be a property of the optimization the system is doing, not the individual actions. A procedure like “act randomly” needs to be corrigible, even though it will often pick a bad option if there are any on the menu. Likewise, I want to define things such that “pick an action based on some dumb approximation” is corrigible, as long as that approximation doesn’t actively produce malicious behavior.