1 min readMar 24, 2018
The system only relies on an exhaustive search over tiny inputs. It’s mostly intended to reduce vulnerabilities in the human, though it should reduce whatever security vulnerabilities are there (but if distillation is introducing new vulnerabilities, no particular reason to think you’d come out ahead after a distillation+amplification step; the hope is to avoid those with reliability).
The red team / adversary need to try to attack the final distilled agent.