Reinforcement learning and linguistic convention

by Paul Christiano Feb 3 2016 updated Mar 4 2016

Existing machine learning techniques are most effective when we can provide concrete feedback — such as prediction accuracy or score in a game. For the purpose of AI control, I think that this is an important and natural category, and I really need a term for it.

I’ve been referring to this regime as “supervised” learning, but that’s an extremely unusual use of the term and probably a mistake. I plan to use the term “reinforcement learning” in the future.

Sutton and Barto, who I will take as authorities, define reinforcement learning informally as:

learning what to do — how to map situations to actions — so as to maximize a numerical reward signal.

If we construe “actions” broadly, so as to include cognitive actions (like forming or updating a representation), then this is exactly what I want to talk about. It’s not clear whether Sutton and Barto mean to include supervised learning as a special case, but Barto at least writes: “it it is possible to convert any supervised learning task into a reinforcement learning task.”

This category is much broader than what people normally mean when they talk about “reinforcement learning.” It’s probably even broader than what Sutton and Barto have in mind. But still, I think “reinforcement learning” captures what I want relatively well, and is definitely better than any other existing term. So I’m going to run with it.

From my perspective, the key (and almost only) assumption of reinforcement learning is that the objective-to-be-optimized comes in the form of numerical rewards that are actually provided to the algorithm. This is required for many modern methods (e.g. anything using gradient descent), even those that we don’t normally think of as RL.


Supervised learning

Supervised learning seems like an important special case.

From the perspective of AI control, I think the most important simplifying assumption is that the information provided to the algorithm does not depend on its behavior. So there is no need for explicit exploration, data can be reused, and so on.