"In practice, Eliezer often ..."


by Paul Christiano Dec 28 2015 updated Dec 28 2015

In practice, Eliezer often invokes this concept in settings where there isn't yet an intelligent adversary (especially in order to argue that a particular design might lead to the appearance of an intelligent adversary). For example, he has repeatedly argued that a sophisticated search for an x maximizing f(x) would tend to produce outputs designed to influence the broader world rather than to influence the computation of f(x).

I think that this extension is itself an interesting and potentially important idea, but it is probably worth separating. The methodology of "conservatively assume that your maximizers might maximize really well" is intuitive and pretty defensible. The methodology of "conservatively assume that whenever you use gradient descent to do something actually impressive, it may produce a malicious superintelligence" is considerably more speculative, and arguing about the extension shouldn't distract from its inoffensive little brother.

I guess the post comes out and says this here:

The 'strain' on our design placed by it needing to run a smarter-than-human AI in a way that doesn't make it adversarial, is similar in many respects to the 'strain' from cryptography facing an intelligent adversary.

But none of the post seems to defend the stronger version.


Eliezer Yudkowsky

If you dump enough computing power into hill-climbing optimization, within a Turing-general policy space, using a consequentialist criterion for fitness, it totally spits out a superintelligence not aligned with the original selection criterion. That's where humans come from in the first place.

Paul Christiano

Superficially, there are two quite different concerns:

  1. You optimize a system for X. You are unhappy when X ends up optimized.
  2. You optimize a system for X. But instead you get a consequentialist that optimizes Y != X. You are unhappy when Y ends up optimized.

You seem to be claiming that these are the same issue or at least conceptually very closely related, but this post basically doesn't defend that claim. That is, you say several times that the goal of the game is not to build an adversary, and you defend that claim. But you say nothing about why that problem is analogous to working against an intelligent adversary, you just assert it.

I think that most serious people will agree that problem #2 is a problem---that a policy trained to optimize X can actually be optimizing some Y that happens to be correlated during training. They may even agree that this happens generically. But I don't think most people will agree that this problem is conceptually similar to the security problem in the way your are claiming.

Eliezer Yudkowsky

I'm not quite sure what claim of mine you're critiquing; can you spell out explicitly what you think I'm claiming, but haven't defended? My original statement was that FAI is hard for some of the same reasons cryptography is hard even though cases aren't exactly similar. You seem to be translating this into another claim that you think is implied or required, but can you spell that out a bit more? I mean, I've read what you've originally typed so you can refer to that rather than restating it, I'm just not clear from having read it what you think Eliezer's claim requires or implies and why, or if you're saying that the original claim itself hasn't been defended.

E.g., in the first comment I'm not sure what "the stronger version" is that isn't being defended - stronger version of what? What is that stronger version? In your second comment you say "You seem to be claiming that these are the same issue or conceptually closely related", and I'm not sure where you think I'm claiming that.

"You optimize a system for X. You are unhappy when X ends up optimized." This shares mindset with cryptography because most ways of attempting this run into unforeseen optima and the process of trying to foresee these means pitting your own ability to search the option space against the AI's ability to do so. Likewise when you try to optimize X' where you think X' rules out some flaw, and the result is the nearest unblocked strategy, then to see this coming in advance and not just walk directly into the whirling razor blades involves a similar though not identical mindset to the cryptographer's mindset of asking how the adversary might react to your attempted safety precautions, without just going, "Dur, I guess there's nothing the other agent can do, I win."

"You optimize a system for X. But instead you get a consequentialist that optimizes Y != X. You are unhappy when Y ends up optimized." This is surely what happens by default when you "optimize a system for X" for most obvious ways of doing that, e.g., using a criterion X to label training examples drawn from your own conceptual defaults and then using supervised learning to find a system which correctly labels the training examples. Even in ordinary machine learning, if you train a system to play Go, you will get something that optimizes something related to Go; after some work, you can get the new target close enough to "winning at Go" that the new system actually wins at Go within the tight logical realm of the gameboard. In the rich real world, humans ended up with nothing resembling an explicit drive toward inclusive genetic fitness, need technical training to understand the criterion they were originally optimized on, and go "Whaaaa?" on seeing it with no particular feelings of sympathy.

This second class of problems most directly relates to cryptopgraphic thinking insofar as sufficiently advanced systems optimizing Y may deliberately masquerade as optimizing X. The crytopgraphic mindset for "Okay, yes, you fixed a bunch of security holes you knew about, but that's not grounds for being certain you've verified the system is unbreakable by current smart adversaries; that takes a lot more work, and you need to write down your reasoning and show it to other people who can try to poke holes in them" shares mindset with the discipline we might need to handle "Okay, I nailed down some of the divergences I saw between X and Y, but that just means I don't know yet what's left; I need to write down my reasoning for why I think the remaining divergence between Y and X is narrow enough that when I dump a ton of computing power into Y I'll get something that has nearly the same expected utility as optimizing for the actual X."