Convergent strategies of self-modification

[summary: An AI which reasons from ends to means, which understands how its own code and the properties of its software are relevant to achieving its goals, will by default have instrumental subgoals about its own code.

The AI might be able to modify its own code directly; or, if the code can't directly access itself, but the AI has sufficient savviness about the bigger picture, the AI might pursue strategies like building a new agent in the environment, or using material means to operate on its own hardware (e.g., use a robot to get the programming console).

Some forms of instrumental self-modification pressures might arise even in an algorithm which isn't doing consequentialism about that, if some internal property $~$X$~$ is optimized-over as a side effect of optimizing for some other property $~$Y.$~$]

Any consequentialist agent which has acquired sufficient big-picture savviness to understand that it has code, and that this code is relevant to achieving its goals, would by default acquire subgoals relating to its code. (Unless this default is averted.) For example, an agent that wants (only) to produce smiles or make paperclips, whose code contains a shutdown procedure, will not want this shutdown procedure to execute because it will lead to fewer future smiles or paperclips. (This preference is not spontaneous/exogenous/unnatural but arises from the execution of the code itself; the code is reflectively inconsistent.)

Besides agents whose policy options directly include self-modification options, big-picture-savvy agents whose code cannot directly access itself might also, e.g., try to (a) crack the platform it is running on to gain unintended access, (b) use a robot to operate an outside programming console with special privileges, (c) manipulate the programmers into modifying it in various ways, (d) building a new subagent in the environment which has the preferred code, or (e) using environmental, material means to manipulate its material embodiment despite its lack of direct self-access.

An AI with sufficient big-picture savviness to understand its programmers as agents with beliefs, might attempt to conceal its self-modifications.

Some implicit self-modification pressures could arise from [ implicit consequentialism] in cases where the AI is optimizing for $~$Y$~$ and there is an internal property $~$X$~$ which is relevant to the achievement of $~$Y.$~$ In this case, optimizing for $~$Y$~$ could implicitly optimize over the internal property $~$X$~$ even if the AI lacks an explicit model of how $~$X$~$ affects $~$Y.$~$

Comments

Alexei Andreev

Any consequentialist agent which has acquired sufficient big\-picture savviness to understand that it has code, and that this code is relevant to achieving its goals, would $by default acquire subgoals relating to its code\. \(Unless this default is averted\.$ For example, an agent that wants $only$ to produce smiles or make paperclips, whose code contains a shutdown procedure, will not want this procedure to execute because it will lead to fewer future smiles or paperclips\. $This preference is not sui generis but arises from the execution of the code itself; the code is reflectively inconsistent\.$

The what, the huh?