"My knee-jerk response to th..."

My knee-jerk response to this problem (just as with mind crime and corrigibility) is to try to build systems that respect our preferences about how they compute, and in particular our preferences about when to ask for clarifications.

It seems like a moderately sophisticated reasoner would be able infer what was going on (e.g. could make predictions about what a human would say when faced with two proposed classifications of an out-of-sample image). So the question seems to be about motivation rather than inductive capability.

Comments

Eliezer Yudkowsky

If you're trying to build your preference framework using induction to learn, e.g., what a 'user' is or what a user 'wants', then you can't rely on that problem having already been solved each time you consider a basic induction.

Paul Christiano

I agree. You can use the results of easier/earlier inferences to guide harder/later inferences, but at a minimum you have to think a lot about how this bootstrapping works and how confident you are with the lowest levels.