Consider a human who has access to a question-answering machine. Suppose the machine answers questions by perfectly imitating what the human would do if asked that question.
To make things twice as tricky, suppose the human-to-be-imitated is herself able to consult a question-answering machine, which answers questions by perfectly imitating what the human would do if asked that question…
Let’s call this process HCH, for “Humans Consulting HCH.”
I’ve talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.)
HCH is easy to specify very precisely. For now, I think that HCH is our best precisely specified model for “a human’s enlightened judgment.” It’s a pretty problematic model, but we don’t yet have many contenders.
We can also define realizable variants of this inaccessible ideal.
- For a particular prediction algorithm P, define HCHᴾ as:
“P’s prediction of humans consulting HCHᴾ”
- For a reinforcement learning algorithm A, define max-HCHᴬ as:
“A’s output when maximizing the score assigned to that output by humans consulting max-HCHᴬ”
- For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as:
“market estimates about humans consulting HCHᵐᵃʳᵏᵉᵗ”
Note that e.g. max-HCHᴬ is totally different from “A’s output when maximizing the score assigned to that output by HCH.”
The latter proposal is essentially abstract approval-direction. It seems much more likely to yield good actions than max-HCHᴬ. But we can’t provide any feedback on the scores assigned by HCH, which makes it impossible to train the abstract version using conventional techniques. On the other hand it is much easier to implement HCHᴾ or max-HCHᴬ or HCHᵐᵃʳᵏᵉᵗ.
The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are:
- As capable as the underlying predictor, reinforcement learner, or market participants.
- Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH.
(At least when the human is suitably prudent and wise.)
It is clear from the definitions that these systems can’t be any more capable than the underlying predictor/learner/market. I honestly don’t know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ are much dicier.
It is similarly unclear whether the system continues to reflect the human’s judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals.