Identifying ambiguous inductions

https://arbital.com/p/inductive_ambiguity

by Eliezer Yudkowsky Apr 17 2015 updated Mar 20 2016

What do a "red strawberry", a "red apple", and a "red cherry" have in common that a "yellow carrot" doesn't? Are they "red fruits" or "red objects"?


[summary: An 'inductive ambiguity' is when there's more than one simple concept that fits the data, even if some of those concepts are much simpler than others, and you want to figure out which simple concept was intended. Suppose you're given images that show camouflaged enemy tanks and empty forests, but it so happens that the tank-containing pictures were taken on sunny days and the forest pictures were taken on cloudy days. Given the training data, the key concept the user intended might be "camouflaged tanks", or "sunny days", or "pixel fields with brighter illumination levels". The last concept is by far the simplest, but rather than just assume the simplest explanation is correct with most of the probability mass, we want the algorithm (or AGI) to detect that there's more than one simple-ish boundary that might separate the data, and check with the user about which boundary was intended to be learned.]

One of the old fables in machine learning is the story of the "tank classifier" - a neural network that had supposedly been trained to detect enemy tanks hiding in a forest. It turned out that all the photos of enemy tanks had been taken on sunny days and all the photos of the same field without the tanks had been taken on cloudy days, meaning that the neural net had really just trained itself to recognize the difference between sunny and cloudy days (or just the difference between bright and dim pictures). (Source.)

We could view this problem as follows: A human looking at the labeled data might have seen several concepts that someone might be trying to point at - tanks vs. no tanks, cloudy vs. sunny days, or bright vs. dim pictures. A human might then ask, "Which of these possible categories did you mean?" and describe the difference using words; or, if it was easier for them to generate pictures than to talk, generate new pictures that distinguished among the possible concepts that could have been meant. Since learning a simple boundary that separates positive from negative instances in the training data is a form of induction, we could call this problem noticing "inductive ambiguities" or "ambiguous inductions".

This problem bears some resemblance to numerous setups in computer science where we can query an oracle about how to classify instances and we want to learn the concept boundary using a minimum number of instances. However, identifying an "inductive ambiguity" doesn't seem to be exactly the same problem, or at least, it's not obviously the same problem. Suppose we consider the tank-classifier problem. Distinguishing levels of illumination in the picture is a very simple concept, so it would probably be the first one learned; then, treating the problem in classical oracle-query terms, we might imagine the AI presenting the user with various random pixel fields at intermediate levels of illumination. The user, not having any idea what's going on, classifies these intermediate levels of illumination as 'not tanks', and so the AI soon learns that only quite sunny levels of illumination are required.

Perhaps what we want is less like "figure out exactly where the concept boundary lies by querying the edge cases to the oracle, assuming our basic idea about the boundary is correct" and more like "notice when there's more than one plausible idea that describes the boundary" or "figure out if the user could have been trying to communicate more than one plausible idea using the training dataset".

Possible approaches

Some possibly relevant approaches that might feed into the notion of "identifying inductive ambiguities":

Relevance in value alignment

Since inductive ambiguities are meant to be referred to the user for resolution rather than resolved automatically (the whole point is that the necessary data for an automatic resolution isn't there), they're instances of "user queries" and all standard worries about user queries would apply.

The hope about a good algorithm for identifying inductive ambiguities is that it would help catch edge instantiations and unforeseen maximums, and maybe just simple errors of communication.


Comments

Anna Salamon

Another helpful handle. Had the concept but without a name; better with a name.

Paul Christiano

My knee-jerk response to this problem (just as with mind crime and corrigibility) is to try to build systems that respect our preferences about how they compute, and in particular our preferences about when to ask for clarifications.

It seems like a moderately sophisticated reasoner would be able infer what was going on (e.g. could make predictions about what a human would say when faced with two proposed classifications of an out-of-sample image). So the question seems to be about motivation rather than inductive capability.