"I'm not quite sure what cla..."


by Eliezer Yudkowsky Apr 21 2016

I'm not quite sure what claim of mine you're critiquing; can you spell out explicitly what you think I'm claiming, but haven't defended? My original statement was that FAI is hard for some of the same reasons cryptography is hard even though cases aren't exactly similar. You seem to be translating this into another claim that you think is implied or required, but can you spell that out a bit more? I mean, I've read what you've originally typed so you can refer to that rather than restating it, I'm just not clear from having read it what you think Eliezer's claim requires or implies and why, or if you're saying that the original claim itself hasn't been defended.

E.g., in the first comment I'm not sure what "the stronger version" is that isn't being defended - stronger version of what? What is that stronger version? In your second comment you say "You seem to be claiming that these are the same issue or conceptually closely related", and I'm not sure where you think I'm claiming that.

"You optimize a system for X. You are unhappy when X ends up optimized." This shares mindset with cryptography because most ways of attempting this run into unforeseen optima and the process of trying to foresee these means pitting your own ability to search the option space against the AI's ability to do so. Likewise when you try to optimize X' where you think X' rules out some flaw, and the result is the nearest unblocked strategy, then to see this coming in advance and not just walk directly into the whirling razor blades involves a similar though not identical mindset to the cryptographer's mindset of asking how the adversary might react to your attempted safety precautions, without just going, "Dur, I guess there's nothing the other agent can do, I win."

"You optimize a system for X. But instead you get a consequentialist that optimizes Y != X. You are unhappy when Y ends up optimized." This is surely what happens by default when you "optimize a system for X" for most obvious ways of doing that, e.g., using a criterion X to label training examples drawn from your own conceptual defaults and then using supervised learning to find a system which correctly labels the training examples. Even in ordinary machine learning, if you train a system to play Go, you will get something that optimizes something related to Go; after some work, you can get the new target close enough to "winning at Go" that the new system actually wins at Go within the tight logical realm of the gameboard. In the rich real world, humans ended up with nothing resembling an explicit drive toward inclusive genetic fitness, need technical training to understand the criterion they were originally optimized on, and go "Whaaaa?" on seeing it with no particular feelings of sympathy.

This second class of problems most directly relates to cryptopgraphic thinking insofar as sufficiently advanced systems optimizing Y may deliberately masquerade as optimizing X. The crytopgraphic mindset for "Okay, yes, you fixed a bunch of security holes you knew about, but that's not grounds for being certain you've verified the system is unbreakable by current smart adversaries; that takes a lot more work, and you need to write down your reasoning and show it to other people who can try to poke holes in them" shares mindset with the discipline we might need to handle "Okay, I nailed down some of the divergences I saw between X and Y, but that just means I don't know yet what's left; I need to write down my reasoning for why I think the remaining divergence between Y and X is narrow enough that when I dump a ton of computing power into Y I'll get something that has nearly the same expected utility as optimizing for the actual X."