1 min readJul 22, 2018
- I don’t think heuristics like “humans don’t all die” are sufficient to avoid a bad outcome, at a minimum you need things like “humans have meaningful control of the situation” (which they can use to correct these subtle errors).
- Once you have an instrumental desire to be helpful, the objective doesn’t really distinguish between different goals. If there is no learning signal pushing you to have goal X, it doesn’t matter how easy it is to learn X.
- If you freeze heuristics from early in training (from before the agent is sophisticated enough to engage in this kind of instrumental reasoning) then those heuristics will probably generalize poorly.