Paul Christiano
1 min readDec 27, 2018

--

HCH is defined as a fixed point of this process; if the recursion is ill-founded there might be multiple fixed points. You’d like to know whether they are all good; if they aren’t all good then you’d need to use some other feature of your HCH-approximator in order to conclude that it’s good.

You could also define HCH(1) = “Humans consulting humans” and HCH(n) = “Humans consulting HCH(n-1).” Or HCH(n) = “humans consulting HCH(k) for various values of k, with the sum of k’s adding up to n-1.” These are uniquely defined.

It’s not at all obvious, a priori, that HCH is “aligned” or good. There are obvious concerns with the fixed point definition, since there is room for various malicious and degenerate fixed points (e.g. whenever you ask HCH a question, you might get an answer that tries to hijack the entire process and then, if successful, get you give to an answer that tries to hijack the process…).

I think the same kinds of problems also afflict HCH(n) for sufficiently large n. I often talk about the fixed point definition because it’s both simpler and it makes these problems more intuitive and clear.

--

--

No responses yet