Safe impact measure

A safe impact measure is one that captures all changes to every variable a human might care about, with no edge-cases where a lot of value could be destroyed by a 'low impact' action. A safe impact measure must also not generate so many false alarms of 'high impact' that no strategy can be distinguished as 'low impact'.

Importance / uses

A safe impact measure is an open problem of value alignment, which if solved, might be used in constructing:

A [ low-impact AI], a subspecies of Genie that tries to carry out its goals while otherwise minimizing the number of important things that it affects.
An Oracle that tries not to affect the world outside its box at all, apart from its output.
A [ shutdown utility function] that incentives a corrigible AI to halt safely (perhaps after a button is pressed).

Some uses of a safe impact measure postulate that the impact measure has a 'hole' or some form of permitted output channel. For example, one Stuart Armstrong proposal involves an Oracle in a sealed box, with a single output channel that is connected to the Oracle with high quantum probability, and connected to a quantum noise generator with low quantum probability. The Oracle is putatively given the goal of generating an output signal with some informational property, and otherwise minimizing the 'impact' relative to the same output having been generated by the quantum noise generator instead of the Oracle. This is intended to capture the idea that the Oracle's effect on the world should only occur via the effect of the output message, and not take advantage of any side channels such as managing to send a radio signal outside the Oracle box.

Difficulty

To be used inside an advanced agent, an impact measure must be safe in the face of whatever cognitive pressures and optimization pressures might tend to produce edge instantiations or Nearest unblocked strategy - it must capture so much variance that there is no clever strategy whereby an advanced agent can produce some special type of variance that evades the measure. Ideally, the measure will pass the [ Omni Test], meaning that even if it suddenly gained perfect control over every particle in the universe, there would still be no way for it to have what intuitively seems like a 'large influence' on the future, without that strategy being assessed as having a 'high impact'.

The reason why a safe impact measure might be possible, and specifiable to an AI without having to solve the entire [ value learning problem] for complex values, is that it may be possible to upper-bound the value-laden and complex quantity 'impact on literally everything cared about' by some much simpler quantity that says roughly 'impact on everything' - all causal processes worth modeling on a macroscale, or something along those lines.

The challenge of a safe impact measure is that we can't just measure, e.g., 'number of particles influenced in any way' or 'expected shift in all particles in the universe'. For the former case, consider that a one-gram mass on Earth exerts a gravitational pull that accelerates the Moon toward it at roughly 4 x 10^-31 m/s^2, and every sneeze has a very slight gravitational effect on the atoms in distant galaxies. Since every decision qualitatively 'affects' everything in its future light cone, this measure will have too many false positives / not approve any strategy / not usefully discriminate unusually dangerous atoms.

For the proposed quantity 'expectation of the net shift produced on all atoms in the universe': If the universe (including the Earth) contains at least one process chaotic enough to exhibit butterfly effects, then any sneeze anywhere ends up producing a very great expected shift in total motions. Again we must worry that the impact measure, as evaluated inside the mind of a superintelligence, would just assign uniformly high values to every strategy, meaning that unusually dangerous actions would not be discriminated for alarms or vetos.

Despite the first imaginable proposals failing, it doesn't seem like a 'safe impact measure' necessarily has the type of [ value-loading] that would make it [ VA-complete]. One intuition pump for 'notice big effects in general' not being value-laden, is that if we imagine aliens with nonhuman decision systems trying to solve this problem, it seems easy to imagine that the aliens would come up with a safe impact measure that we would also regard as safe.