"It seems critical to distin..."

To put it another way, the task is to have the AI generate a safe burrito\. One way to try to do this is making sure that the AI's explicit training data contains a burrito with butolinum toxin, labeled as a negative example, so that the AI knows not to include butolinum\. The hope is that via conservatism we can avoid needing to think of every possible way that our training data might not properly stabilize the 'simplest explanation' along every dimension of potentially fatal variance, and shift some of the workload to just showing the AI positive examples which happen not to contain butolinum toxin\.

It seems critical to distinguish the cases where

We are hoping the AI generalizes the concept of "burrito" in the intended way to new data,
The definition of burrito is "something our burrito-identifier would identify as a burrito given enough time," and we are just hoping the AI doesn't make mistakes. (The burrito-identifier is some process that we can actually run in order to determine whether something is a burrito.)

As you've probably gathered, I feel hopeless about case (1).

In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).

I've never managed to get quite clear on your picture. My impression is that:

you think that case (2) is doomed because there is no realistic prospect for creating a good enough burrito-evaluator,
you think that even with a good enough burrito-evaluator, you would still have serious trouble because of errors.

I think your optimism about case (1) is defensible; I disagree, but not for super straightforward reasons. The main disagreement is probably about case (2).

I think that your concern about generating a good enough burrito-evaluator is also defensible; I am optimistic, but even on my view this would require resolving a number of big research problems.

I think your concern about mistakes, and especially about something like "conservative concepts" as a way to reduce the scope for mistakes, is less defensible. I don't feel like this is as complex an issue---the case for delegating this to the learning algorithm seems quite strong, and I don't feel you've really given a case on the other side.

Note that this is related to what you've been calling Identifying ambiguous inductions, and I do think that there are techniques in that space that could help avoid mistakes. (Though I would definitely frame that problem differently.) So it's possible we're not really disagreeing here either. But my best guess is that you are underestimating to the extent to which some of these issues could/should be delegated to the learner itself, supposing that we could resolve your other concerns (i.e. supposing that we could construct a good enough burrito-evaluator).

Comments

Eliezer Yudkowsky

As you've probably gathered, I feel hopeless about case (1).

Okay, I didn't understand this. My reaction was something like "Isn't conservatively generalizing burritos from sample burritos a much simpler problem than defining an ideal criterion for burritos which probably requires something like an ideal advisor theory over extrapolated humans to talk about all the poisons that people could detect given enough computing power?" but I think I should maybe just ask you to clarify what you mean. The interpretation my brain generated was something like "Predicting a human 9p's Go moves is easier than generating 9p-level Go moves" which seems clearly false to me so I probably misunderstood you.

In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).

I don't understand this at all. Are we supposing that we have an inviolable physical machine that outputs burrito ratings and can't be shorted by seizing control of the reward channel or by including poisons that the machine-builders didn't know about? …actually I should just ask you to clarify this paragraph.

Paul Christiano

I think the key question is whether:

the burrito judge needs to be extremely powerful, or
the burrito judge needs to be modestly more powerful than the burrito producer.

In world 1 I agree that the burrito-evaluator seems pretty tough to build. We certainly have disagreements about that case, but I'm happy to set it aside for now.

In world 2 things seem much less scary. Because I only need to run these evaluations with e.g. 1% probability, the judge can use 50x more resources than the burrito producer. So it's imaginable that the judge can be more powerful than the producer.

You seem to think that we are in world 1. I think that we are probably in world 2, but I'm certainly not sure. I discuss the issue in this post.

Some observations:

The judge's job is easier if they are evaluating steps of the plan, before those steps are taken, rather than actually letting the burrito producer take actions. So let's do it that way.
The judge can look at the burrito producer's computation, and at the training process that produced that computation, and can change the burrito producer's training procedure to make that computation more understandable.
If the judge were epistemically efficient with respect to the producer, then maximizing the judge's expectation of a burrito's quality would be the same as maximizing the burrito producer's expectation of a burrito's quality. That's basically what we want. So the real issue is narrower than you might expect, it's some kind of epistemic version of "offense vs. defense," where the producer can think particular thoughts that the judge doesn't happen to think, and so the producer might expect to be able to deceive/attack the judge even though the judge is smarter. This is what the judge is trying to avoid by looking at the producer's computation.

So I don't think that we can just ask the judge to evaluate the burrito; but the judge has enough going for her that I expect we can find some strategy that lets her win. I think this is the biggest open problem for my current approach.