Faithful simulation

The safe simulation problem is to start with some dynamical physical process $D$ which would, if run long enough in some specified environment, produce some trustworthy information of great value, and to compute some adequate simulation $S_D$ of $D$ faster than the physical process could have run. In this context, the term "adequate" is value-laden - it means that whatever we would use $D$ for, using $S_D$ instead produces within epsilon of the expected value we could have gotten from using the real $D.$ In more concrete terms, for example, we might want to tell a Task AGI "upload this human and run them as a simulation", and we don't want some tiny systematic skew in how the Task AGI models serotonin to turn the human into a psychopath, which is a bad (value-destroying) simulation fault. Perfect simulation will be out of the question; the brain is almost certainly a chaotic system and hence we can't hope to produce exactly the same result as a biological brain. The question, then, is what kind not-exactly-the-same-result the simulation is allowed to produce.

As with "low impact" hopefully being lower-complexity than "low bad impact", we might hope to get an adequate simulation via some notion of faithful simulation, which rules out bumps in serotonin that turn the upload into a psychopath, while possibly also ruling out any number of other changes we wouldn't see as important; with this notion of "faithfulness" still being permissive enough to allow the simulation to take place at a level above individual quarks. On whatever computing power is available - possibly nanocomputers, if the brain was scanned via molecular nanotechnology - the upload must be runnable fast enough to make the simulation task worthwhile.

Since the main use for the notion of "faithful simulation" currently appears to be identifying a safe plan for uploading one or more humans as a pivotal act, we might also consider this problem in conjunction with the special case of wanting to avoid mindcrime. In other words, we'd like a criterion of faithful simulation which the AGI can compute without it needing to observe millions of hypothetical simulated brains for ten seconds apiece, which could constitute creating millions of people and killing them ten seconds later. We'd much prefer, e.g., a criterion of faithful simulation of individual neurons and synapses between them up to the level of, say, two interacting cortical columns, such that we could be confident that in aggregate the faithful simulation of the neurons would correspond to the faithful simulation of whole human brains. This way the AGI would not need to think about or simulate whole brains in order to verify that an uploading procedure would produce a faithful simulation, and mindcrime could be avoided.

Note that the notion of a "functional property" of the brain - seeing the neurons as computing something important, and not wanting to disturb the computation - is still value-laden. It involves regarding the brain as a means to a computational end, and what we see as the important computational end is value-laden, given that chaos guarantees the input-output relation won't be exactly the same. The brain can equally be seen as implicitly computing, say, the parity of the number of synapse activations; it's just that we don't see this functional property as a valuable one that we want to preserve.

To the extent that some notion of function might be invoked in a notion of faithful, permitted speedups, we should hope that rather than needing the AGI to understand the high-level functional properties of the brain and which details we thought were too important to simplify, it might be enough to understand a 'functional' model of individual neurons and synapses, with the resulting transform of the uploaded brain still allowing for a pivotal speedup and knowably-faithful simulation of the larger brain.

At the same time, strictly local measures of faithfulness seem problematic if they can conceal systematic larger divergences. We might think that any perturbation of a simulated neuron which has as little effect as adding one phonon is "within thermal uncertainty" and therefore unimportant, but if all of these perturbations are pointing in the same direction relative to some larger functional property, the difference might be very significant. Similarly if all simulated synapses released slightly more serotonin, rather than releasing slightly more or less serotonin in no particular systematic pattern.

Comments

Paul Christiano

One natural standard: it should be hard to distinguish an adequate model from the system-to-be-modeled, based on input/output behavior alone.

How hard? Ideally we'd have an "equally competent" modeler and distinguisher, and ask the modeler to try to fool the distinguisher. This is a popular approach to generative modeling, and something I've talked about in the context of AI control (as has Jessica).

This definition runs into many subtleties, but I think it is a natural starting point for a discussion. In particular, we are already way beyond concerns like "the brain is almost certainly a chaotic system and hence we can't hope to produce exactly the same result as a biological brain."

The key property we want from the distinguisher is that it can learn to detect relevant differences between the model and the real system. This seems like it might be the kind of problem that I would classify as "probably easy if the agent is powerful and the difference is really important" and you would classify as "way too hard to count on."

You could also ask the model to output various intermediate results or to simulate requested measurements on the simulated brain, and give this extra information to the distinguisher. (Though I don't think this would really help.)