Informed oversight

by Jessica Taylor Mar 23 2016 updated Mar 24 2016

Incentivize a reinforcement learner that's less smart than you to accomplish some task

It can be difficult for one agent to oversee another agent that is maximizing a reward controlled by the first agent, even if the first agent is smarter than the second agent. This is a serious obstacle to Approval directed agents and other approaches based on reinforcement learning.

More formally, suppose a strong agent B with a utility function U is overseeing a weak agent A. Agent A produces an action y and gets a reward B(y) determined by agent B. B wants to give rewards such that agent A is incentivized to make its best effort at maximizing U(a).

If B(y) is an efficient estimate of U(y) relative to A, then indeed maximizing B(y) is equivalent to maximizing U(y) from A's perspective. Unfortunately, it is difficult for this to be the case even when B is smarter than A, because A may think thoughts B does not in the course of selecting y. Here are a couple concrete examples (each having the flavor of "easy to generate a solution, hard to check a solution"):

Here are a couple analogous examples that seem easier to analyze:

  1. Let O be a one-way function (e.g. SHA256). U assigns a high value to binary strings y equal to O(111x) for some x. B cannot tell whether a particular string y is equal to O(111x) for some x (even if B is smarter than A), so B has no good way of giving rewards to A.
  2. Same as the previous problem, except U assigns a high value to binary strings y that are not equal to O(111x) for any short string x. A's intended strategy is to generate a uniformly random string (or generate a string in some other way that doesn't introduce a bias towards generating strings of the form O(111x)), but B has no way of knowing that A's string was uniformly random (rather than equal to O(111x) for a uniformly random x).

One attempted solution is for agent B to see A's thought processes in addition to the action y. Then, agent B can use all the information A has, so that B(y) is an efficient estimate of U(y) from the perspective of A. However, this requires B to understand something about A's mental architecture. It would be good to have a more generic solution that requires only weak assumptions about A's mental architecture.

An attempted solution to problem 2 is for B to incentivize A to generate its action in a provably random manner. However, this doesn't appear to generalize well to problems significantly different from 2.

Further reading

Paul Christiano on adequate oversight

Paul Christiano on the informed oversight problem


Paul Christiano

In the sudoku and first OWF example, the agent can justify their answer, and its easy to incentivize them to reveal it. In the steganography and second OWF example, there is no short proof that something is good, only a proof that something is bad. In realistic settings there will be lots of arguments on both sides. Another way of looking at the question is: how do you elicit the negative arguments?

Katja wrote about a scheme here. I think it's a nice idea that feels like it might be relevant. But if you include it as part of the agent's reward, and the agent also picks the action, then you get actions optimized to be info-rich (as discussed in "maximizing B + info" here).