"I think we have a foundatio..."


by Eliezer Yudkowsky Mar 10 2016

I think we have a foundational disagreement here about to what extent saying "Oh, the AI will just predict that by modeling humans" solves all these issues versus sweeping the same unsolved issues under the rug into whatever is supposed to be modeling the humans.

Let's say you have a schmuck human who hasn't studied Pascal's Mugging. They build a Solomonoff-like prior into their AI, and an aggregative utility function, which both seem to them like reasonable approximate models of how humans behave. The AI seems to behave reasonably during the training phase, but once it's powerful enough is Pascal's Mugged into weird edge-case behavior.

When I imagine trying to use a 'predict human acts' system, I worry that, unless we have strong transparency into the system internals and we know about the Pascal's Mugging problem, what would happen to the equivalent schmuck would be that the system generalized something a lot like consequentialism and aggregative ethics as mostly compactly predicting the acts that the humans approved or produced after a lot of reflection, and then the generalization would break down later on the same edge case.

Some of this probably reflects the degree to which you're imagining using an act-based agent that is a strong superintelligence with access to brain scans which is hence relatively epistemically efficient on every prediction, while I'm imagining trying to use something that isn't yet that smart (because we can't let it FOOM up to superintelligence, because we don't fully trust it, or because there's a chicken-and-egg problem with requiring trustworthy predictions to bootstrap in a trustworthy way).

You also seem to be imagining that the problem of corrigibility has otherwise already been solved, or is maybe being solved via some other predictive thing, whereas I'm treating generalization failures that can kill you before you have time to register or spot the prediction failure as being indeed failures - you seem to assume there's a mature corrigibility system which catches that.

I'm not sure this is the right page to have this discussion; we should probably be talking about inside the act-based system pages.