To put it another way, the task is to have the AI generate a safe burrito\. One way to try to do this is making sure that the AI's explicit training data contains a burrito with butolinum toxin, labeled as a negative example, so that the AI knows not to include butolinum\. The hope is that via conservatism we can avoid needing to think of every possible way that our training data might not properly stabilize the 'simplest explanation' along every dimension of potentially fatal variance, and shift some of the workload to just showing the AI positive examples which happen not to contain butolinum toxin\.
It seems critical to distinguish the cases where
- We are hoping the AI generalizes the concept of "burrito" in the intended way to new data,
- The definition of burrito is "something our burrito-identifier would identify as a burrito given enough time," and we are just hoping the AI doesn't make mistakes. (The burrito-identifier is some process that we can actually run in order to determine whether something is a burrito.)
As you've probably gathered, I feel hopeless about case (1).
In case (2), any agent that can learn the concept "definitely a burrito" could use this concept to produce definitely-burritos and thereby achieve high reward in the RL game. So the mere existence of the easy-to-learn definitely-a-burrito concept seems to imply that our learner will behave well. We don't have to actually explicitly do any work about conservative concepts (except to better understand the behavior of our learner).
I've never managed to get quite clear on your picture. My impression is that:
- you think that case (2) is doomed because there is no realistic prospect for creating a good enough burrito-evaluator,
- you think that even with a good enough burrito-evaluator, you would still have serious trouble because of errors.
I think your optimism about case (1) is defensible; I disagree, but not for super straightforward reasons. The main disagreement is probably about case (2).
I think that your concern about generating a good enough burrito-evaluator is also defensible; I am optimistic, but even on my view this would require resolving a number of big research problems.
I think your concern about mistakes, and especially about something like "conservative concepts" as a way to reduce the scope for mistakes, is less defensible. I don't feel like this is as complex an issue---the case for delegating this to the learning algorithm seems quite strong, and I don't feel you've really given a case on the other side.
Note that this is related to what you've been calling Identifying ambiguous inductions, and I do think that there are techniques in that space that could help avoid mistakes. (Though I would definitely frame that problem differently.) So it's possible we're not really disagreeing here either. But my best guess is that you are underestimating to the extent to which some of these issues could/should be delegated to the learner itself, supposing that we could resolve your other concerns (i.e. supposing that we could construct a good enough burrito-evaluator).