I think the key question is whether:
- the burrito judge needs to be extremely powerful, or
- the burrito judge needs to be modestly more powerful than the burrito producer.
In world 1 I agree that the burrito-evaluator seems pretty tough to build. We certainly have disagreements about that case, but I'm happy to set it aside for now.
In world 2 things seem much less scary. Because I only need to run these evaluations with e.g. 1% probability, the judge can use 50x more resources than the burrito producer. So it's imaginable that the judge can be more powerful than the producer.
You seem to think that we are in world 1. I think that we are probably in world 2, but I'm certainly not sure. I discuss the issue in this post.
- The judge's job is easier if they are evaluating steps of the plan, before those steps are taken, rather than actually letting the burrito producer take actions. So let's do it that way.
- The judge can look at the burrito producer's computation, and at the training process that produced that computation, and can change the burrito producer's training procedure to make that computation more understandable.
- If the judge were epistemically efficient with respect to the producer, then maximizing the judge's expectation of a burrito's quality would be the same as maximizing the burrito producer's expectation of a burrito's quality. That's basically what we want. So the real issue is narrower than you might expect, it's some kind of epistemic version of "offense vs. defense," where the producer can think particular thoughts that the judge doesn't happen to think, and so the producer might expect to be able to deceive/attack the judge even though the judge is smarter. This is what the judge is trying to avoid by looking at the producer's computation.
So I don't think that we can just ask the judge to evaluate the burrito; but the judge has enough going for her that I expect we can find some strategy that lets her win. I think this is the biggest open problem for my current approach.