"> We don't have to explicit..."

https://arbital.com/p/2qm

by Eliezer Yudkowsky Mar 19 2016


We don't have to explicitly cover injunctions, just to provide information that allows the agent to predict human judgments.

I should edit to clarify that I worry the process of teaching the agent the category of everything to avoid will fail, not that I'm assuming an explicit list with no further extrapolation. If I thought there was a straightforwardly trainable simple concept covering everything an AGI shouldn't do, which left enough space to include pivotal acts, and that didn't max out in weird edge cases when subjected to strong optimization pressures, I would be a lot less worried than I currently am!

If the AI is bad at making predictions, then it may just end up with lots of variables for which it thinks the human might say "changing that variable is problematic." Behaving appropriately with respect to this uncertainty could recover the desired behavior.

This is an issue of prioritization. There's some finite amount of checking humans can do which has to be distributed over everything there is to be checked. Let's say a team of 10 checkers can each do 4 detailed checks and 40 quick checks per day without getting sloppy. If the catastrophic act is on some level noticed, but it's assigned a probability-of-catastrophe level which ranks it as the 1000th item on the list, it's going to slip through if the AGI is allowed to do anything at all. Again, your intuition seems to be that there should be a few easy obvious things to check, whereas I'm nervous that any checking procedure strong enough to catch the disasters is going to produce a huge number of false positives because the AI will not internally contain the information and cognitive capacity required to tell the difference.

If we use such a learner appropriately, this seems like it can obtain behavior at least as good as if the agent was first been taught a measure of impact and then used that measure to avoid (or flag) high-impact consequences.

We differ in how much we think predictors can safely do automatically. My reason for wanting to think about low impact explicitly has two parts.

First, I'm concerned that for realistic limited AGIs of the sort we'll actually see in the real world, we will not want to amplify its intelligence up to the point where all learning can be taken for granted, we will want to use known algorithms, and therefore, considering something like 'low impact' explicitly and as part of machine learning may improve our chances of ending up with a low-impact AGI.

Second, if there turns out to be an understandable core to low impact, then by explicitly understanding this core we can decrease our nervousness about what a trained AGI might have been trained to do. By default we'd need to worry about an AGI blindly trained to flag possibly dangerous things, learning some unknown peculiar generalization of low impact that will, like a neural network being fooled by the right pattern of static, fail in some weird edge case the next time its option set expands. If we understand explicitly what generalization of low impact is being learned, it would boost our confidence (compared to the blind training case) of the next expansion of options not being fooled by the right kind of staticky image (under optimization pressure from a planning module trying to avoid dangerous impacts).

This appears to me to go back to our central disagreement-generator about how much the programmers need to explicitly understand and consider. I worry that things which seem like 'predictions' in principle won't generalize well from previously labeled data, especially for things with reflective degrees of freedom, double-especially for limited AGI systems of the sort that we will actually see in practice in any endgame with a hope of ending well. Or more simple terms, I think that trying to have safety systems that we don't understand and that have been generalized from labeled data without us fully understanding the generalization and its possible edge cases are nigh-inevitable recipes for disaster. Or in simpler simpler terms, you can't possibly get away with building a powerful AGI you understand that poorly.