"This seems like a straw alt..."

On a higher level of abstraction, we can imagine that the universe is parsed by us into a set of variables $~$V\_i$~$ with values $~$v\_i.$~$ We want to avoid the agent taking actions that cause large amounts of disutility by perturbing variables from $~$v\_i$~$ to $~$v\_i^\*$~$ in a way that decreases utility\. However, the question of exactly which variables $~$V\_i$~$ are important and shouldn't be entropically perturbed is value\-laden \- complicated, fragile, high in algorithmic complexity, with Humean degrees of freedom in the concept boundaries\. Rather than relying solely on teaching an agent exactly which parts of the environment shouldn't be perturbed and risking catastrophe if we miss an injunction, the low impact route would try to build an agent that tried to perturb fewer variables regardless\. The hope is that "have fewer side effects" will have a central core and be learnable by a manageable amount of training\. Conversely, trying to train "here is the list of bad effects not to have and important variables not to perturb" would be complicated and lack a simple core, because 'bad' and 'important' are value\-laden\.

This seems like a straw alternative. More realistically, we could imagine an agent which avoids perturbing a variable if it predicts the human would say "changing that variable is problematic" when asked. Then:

We don't have to explicitly cover injunctions, just to provide information that allows the agent to predict human judgments.
If the AI is bad at making predictions, then it may just end up with lots of variables for which it thinks the human might say "changing that variable is problematic." Behaving appropriately with respect to this uncertainty could recover the desired behavior.

Consider an agent who is learning to predict when a human considers a change problematic. Now suppose that the agent is not able to learn the complex value-laden concept of "important change," but is able to learn the simpler concept of "big change."

This agent can use the concept of "big change" in order to make predictions about "important change," namely: "if a change is big, it might be important."

So any agent who is able to learn the concept of "big change" should be able to make predictions at least as well as if it simply guessed that every big change had an appropriate probability of being important. For example, if 1% of big changes are important, then a reasonable learner, who is smart enough to learn the concept of "big change," will predict at least as well as if it simply predicted that each big change was important with 1% probability.

If we use such a learner appropriately, this seems like it can obtain behavior at least as good as if the agent was first been taught a measure of impact and then used that measure to avoid (or flag) high-impact consequences.

To me it feels much more promising to learn an impact measure implicitly as an input into what changes are "important." The alternative feels like a non-starter:

The track record for learning looks a lot better than figuring things out "by hand."
The learned approach is easy to integrate with existing and foreseeable systems, while the by hand approach seems to require big changes in AI architectures.
On the object level, the notion of low impact really doesn't look like it is going to have a clean theoretical specification (you point out many of the concerns).

I would like to better understand our disagreement, though I'm not sure if it's a priority and so you should feel free to ignore. But if you want to clarify: does one of these two concerns capture your position regarding learning an impact measure?

We might be able to specify an impact measure much more effectively than the agent can learn it (perhaps because we can directly specify a measure that will generalize well to radically different contexts, whereas a learned measure would not be robust to big context changes)
Even if the agent could learn an impact measure, and even if it could predict objectionable changes effectively by using that impact measure conservatively, we shouldn't expect an objectionable-change-predictor to actually use this particular strategy or an equally effective alternative (perhaps it uses some other strategy which perhaps achieves a higher payoff in simple environments but then generalizes worse).

(For reference, the main context change I have in mind is moving from "weak agent proposing dumb plans" to "smarter agent proposing cleverer plans," where "cleverer" may involve some optimization for being apparently low impact).

Alternatively, I may be misunderstanding where your position. I agree that even if you want an agent to learn an impact measure, it is worth thinking about what kind of impact measure it might learn and how that measure will generalize. So it's possible that we don't actually disagree about how the ultimate agent might look, but are just emphasizing different parts of how to get there.

Comments

Eliezer Yudkowsky

We don't have to explicitly cover injunctions, just to provide information that allows the agent to predict human judgments.

I should edit to clarify that I worry the process of teaching the agent the category of everything to avoid will fail, not that I'm assuming an explicit list with no further extrapolation. If I thought there was a straightforwardly trainable simple concept covering everything an AGI shouldn't do, which left enough space to include pivotal acts, and that didn't max out in weird edge cases when subjected to strong optimization pressures, I would be a lot less worried than I currently am!

If the AI is bad at making predictions, then it may just end up with lots of variables for which it thinks the human might say "changing that variable is problematic." Behaving appropriately with respect to this uncertainty could recover the desired behavior.

This is an issue of prioritization. There's some finite amount of checking humans can do which has to be distributed over everything there is to be checked. Let's say a team of 10 checkers can each do 4 detailed checks and 40 quick checks per day without getting sloppy. If the catastrophic act is on some level noticed, but it's assigned a probability-of-catastrophe level which ranks it as the 1000th item on the list, it's going to slip through if the AGI is allowed to do anything at all. Again, your intuition seems to be that there should be a few easy obvious things to check, whereas I'm nervous that any checking procedure strong enough to catch the disasters is going to produce a huge number of false positives because the AI will not internally contain the information and cognitive capacity required to tell the difference.

We differ in how much we think predictors can safely do automatically. My reason for wanting to think about low impact explicitly has two parts.

First, I'm concerned that for realistic limited AGIs of the sort we'll actually see in the real world, we will not want to amplify its intelligence up to the point where all learning can be taken for granted, we will want to use known algorithms, and therefore, considering something like 'low impact' explicitly and as part of machine learning may improve our chances of ending up with a low-impact AGI.

Second, if there turns out to be an understandable core to low impact, then by explicitly understanding this core we can decrease our nervousness about what a trained AGI might have been trained to do. By default we'd need to worry about an AGI blindly trained to flag possibly dangerous things, learning some unknown peculiar generalization of low impact that will, like a neural network being fooled by the right pattern of static, fail in some weird edge case the next time its option set expands. If we understand explicitly what generalization of low impact is being learned, it would boost our confidence (compared to the blind training case) of the next expansion of options not being fooled by the right kind of staticky image (under optimization pressure from a planning module trying to avoid dangerous impacts).

This appears to me to go back to our central disagreement-generator about how much the programmers need to explicitly understand and consider. I worry that things which seem like 'predictions' in principle won't generalize well from previously labeled data, especially for things with reflective degrees of freedom, double-especially for limited AGI systems of the sort that we will actually see in practice in any endgame with a hope of ending well. Or more simple terms, I think that trying to have safety systems that we don't understand and that have been generalized from labeled data without us fully understanding the generalization and its possible edge cases are nigh-inevitable recipes for disaster. Or in simpler simpler terms, you can't possibly get away with building a powerful AGI you understand that poorly.