"One natural standard: it sh..."

https://arbital.com/p/37b

by Paul Christiano Apr 15 2016


One natural standard: it should be hard to distinguish an adequate model from the system-to-be-modeled, based on input/output behavior alone.

How hard? Ideally we'd have an "equally competent" modeler and distinguisher, and ask the modeler to try to fool the distinguisher. This is a popular approach to generative modeling, and something I've talked about in the context of AI control (as has Jessica).

This definition runs into many subtleties, but I think it is a natural starting point for a discussion. In particular, we are already way beyond concerns like "the brain is almost certainly a chaotic system and hence we can't hope to produce exactly the same result as a biological brain."

The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as "probably easy if the agent is powerful and the difference is really important" and you would classify as "way too hard to count on."

You could also ask the model to output various intermediate results or to simulate requested measurements on the simulated brain, and give this extra information to the distinguisher. (Though I don't think this would really help.)


Comments

Eliezer Yudkowsky

The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as "probably easy if the agent is powerful and the difference is really important" and you would classify as "way too hard to count on."

Counting on things before you've found a solution to them isn't very mindset, but I do consider this a promising approach. Definitely, the generative-adversarial approach in modern neural networks causes me to hope that this is the sort of thing that actually works in practice. So I might not be as pessimistic as you think? I still think in general that one does not go about taking things for granted, but the notion of faithful simulation seems like one that could prove to have a tractable core after hammering on it for a bit, and it also seems very possible that if you're reasonably smart and you can't detect any expected differences in the behavior of neural columns then the corresponding human simulation is faithful.

My current thoughts on possible failure modes:

  1. "No differences you know about" might mix up the map and the territority in some obscurely fatal way that leads to the equivalent of the AI deliberately managing to 'not know' about inconvenient divergences.
  2. If we use a limited AI and don't let it run thousands of simulations of people that it can compare to thousands of brains in vats, then in practice its column-level tests won't detect cumulative neural-level differences that lead to an 80% probability of schizophrenia.
  3. The adversarial approach as written won't work because it will turn out that it's always possible for an equally smart adversary to tell the difference, especially for simulations that can be computed at a worthwhile speedup. Which means this test won't meaningfully discriminate in the region of intuitively faithful vs. nonfaithful simulations. (This strikes me as the sort of issue that's repairable, but perhaps not trivially so.)

Paul Christiano

Methodologically, I am trying to understand what approaches may or may not work and what the key difficulties are. I am trying to anticipate what problems are hard or easy in order to understand what approaches may or may not work. I wouldn't describe this as "taking things for granted," I think we are probably miscommunicating.

it's always possible for an equally smart adversary to tell the difference

This is a big problem, I think that it's the more real version of "perfect simulation will be out of the question." Note that this is only a concern for some processes (e.g. if the simulation output is one bit, then you don't have this problem).

(Note that in practice generative adversarial models are extremely finicky to train, at least partly for this reason.)

I think the other big problem is the complementary one, that even an equally smart adversary can't reliably distinguish a crappy simulation from a good simulation (where a dumb example is that no distinguisher can detect a steganographically encoded message even though that implies the simulation was poor).