[summary: Utility indifference is a research avenue for compounding two utility functions $~$U_X$~$ and $~$U_Y$~$ such that a switch $~$S$~$ changes the AI from optimizing $~$U_X$~$ to $~$U_Y$~$, such that (a) the AI wants to preserve the continued existence of the switch $~$S$~$ and its behavior even if the AI has self-modification options, (b) the AI does not want to prevent the switch from being pressed, and (c) the AI does not want to cause the switch to be pressed. This simple problem exhibits the most basic form of [value_learning value learning based on observation], and also corresponds to corrigibility problems like "Build an AI that (wants to) safely cease action and suspend itself to disk when a button is pressed."]

Introduction: A reflectively consistent off-switch.

Suppose there's an advanced agent with a goal like, e.g., producing smiles or making paperclips. By default, if you try to switch off a sufficiently intelligent agent like this, it will resist being switched off; not because it has an independent goal of survival, but because it expects that if it's switched off it will be able to produce fewer smiles or paperclips. If the agent has policy options to diminish the probability of being successfully switched off, the agent will pursue those options. This is a convergent instrumental strategy if not otherwise prevented.

Difficulty 1: By default a consequentialist reasoner with sufficient real-world understanding to relate the events of its being switched off, to the later underfulfillment of its goals, will resist being switched off.

The shutdown problem is to describe an advanced agent which is corrigible with respect to permitting itself to be safely shut down or suspended to disk. A reinforcement learning agent which can be forced to e.g. perform a null action repeatedly for a period of time, is called interruptible by Stuart Armstrong and Laurent Orseau.

This isn't as simple as writing a special function into the AI which carries out a shutdown after a switch is pressed. If you start out controlling the agent's source code, and you naively try to build in functions which suspend the agent to disk when a button is pressed, and the agent later gains the ability to self-modify, it would remove those functions. (Possibly while trying to conceal the fact that the function would no longer operate.)

Corollary 1a: By default a consequentialist reasoner. with sufficient programmatic understanding to relate the execution of a shutdown function to the later underfulfillment of its goals, which has policy options for modifying its code, will want to remove the shutdown function.

We can see this as a natural consequence of our trying to modify the agent's code in a way that was not consistent with the code's explicit goals. If you create an agent with source code $~$P$~$ that is well-suited to achieving a goal $~$U$~$ and that explicitly represents $~$U$~$ as a goal, the agent's code $~$P$~$ will be reflectively consistent - the code will not want to modify itself. If we then suppose that instead the agent has code $~$P'$~$ which is no longer well-suited to achieving $~$U,$~$ but continues to explicitly represent $~$U$~$ as a goal, the code will be reflectively inconsistent and it will want to modify itself given the option of doing so. If you have code that searches for ways to produce paperclips, and you hack the code to contain a shutdown function, but the code is still searching for paperclip-production strategies, and the code correctly models the effect of the shutdown function on future paperclips, the code will want to modify itself to remove the shutdown function.

The next thought is to build an agent with modified code $~$P^*$~$ which embodies a modified utility function $~$U^*$~$ such that a consequentialist optimizing for $~$U^*$~$ wants the code to be like $~$P^*$~$ and is again reflectively consistent.

That is, in the case of an off-switch, we'd want to modify the agent, not just to include an off-switch, but to want to shut down if the off-switch is pressed.

Assuming we have the ability to specify the utility function of an agent, we might try to approach this by creating a utility function with two components:

A "normal behavior" component - the utility function that you think corresponds to the way you want a normally-functioning AI to behave.
An "abort and suspend to disk" component - the utility function that you think corresponds to "halt your current tasks in a way that minimizes their further impact and then enter a quiescent state in which you are not optimizing further and your decisions have little further impact", or however you would want the AI to behave after an abort button had been pressed.

We then tell the AI that its utility function is the "normal" utility function over worlds where an off button has not been pushed, plus the "suspend" utility function over worlds where the off button has been pushed. So long as this is what the AI wants, we'd expect it to write its own suspend-to-disk function in its code if its code did not already contain one.

Goal 1: Have the AI approve of keeping the shutdown function within its code (or add a shutdown function if it notices that no such function exists) (and preserve the corresponding behavior in any subagents or successors that it builds).

This approach doesn't yet get us out of the woods, because we now have a new problem. The AI's utility function is compounded of a "normal" utility function over worlds where the button is not pressed, plus a "suspend" utility function over worlds where the button is pressed. In worlds where the button is pressed, the AI will execute some preferred policy $~$\pi_1$~$ to obtain an expectation of utility $~$\mathbb E [U_{normal}|\pi_1],$~$ and similarly a policy $~$\pi_2$~$ in worlds where the button is pressed to obtain $~$\mathbb E[U_{suspend}|\pi_2].$~$ One of these quantities is liable to be greater than the other, in which case the AI has a motive to prevent the button from being pressed, or cause the button to be pressed. In either case, the AI tries to seize control of the suspend button.

Difficulty 2: In a compound utility function produced by mixing a 'normal' and 'suspend' utility function conditional on a button-press, the AI will have an incentive to prevent the button from being pressed, or cause the button to be pressed, depending on which utility function makes more expected utility available afterwards.
Goal 2: Have the AI not try to prevent the suspend button from being pressed, nor try to cause the button to be pressed.

The switch problem is causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the button and the corresponding behavior-switching functionality within itself, even if the AI can self-modify.
The AI does not try to prevent the button from being pressed;
The AI does not try to cause the button to be pressed.

This turns out to be more difficult than it sounds, and is currently an open problem.

Larger implications of the switch problem

The switch problem, a subproblem of shutdownability which assumes a satisfactory shutdown utility function, is a crisp subproblem of both [value_learning value learning] and corrigibility.

The switch problem is a subproblem of [value_learning value learning] because it describes a very simple case of conditioning a utility function on observation or other facts, without giving the AI an incentive to manipulate those facts.

To learn complicated values safely, we'd need to condition a utility function on complicated observations, such that:

The utility function's updates were robust to the AI gaining self-modification capability (general case of Goal 1)
The utility function's dependency on observations didn't incentivize [programmer_manipulation programmer manipulating] those observations (general case of Goal 2).

Just writing code that modifies the utility function, without the larger code's explicit policy search wanting the utility function to be modified in that case, would not be reflectively consistent (general case of Difficulty 1). And naive dependencies between the utility function and observations, incentivize manipulating those observations (general case of Difficulty 2).

The switch problem is also a crisp subproblem of Corrigibility because it deals in averting the convergent instrumental strategy and incorrigible behavior of "avoiding shutdown". If we understood how to avert this instrumental pressure, we might possibly learn something general about avoiding other instrumentally convergent incorrigible behaviors. (And of course it's also a subproblem of corrigibility, because an AI you can shut down in emergencies is more corrigible; solving the switch problem would bring us closer to having an AI like that, if we could get the second half of the problem represented by a shutdown utility function.)

The utility indifference approach to the switch problem

Suppose we build a compound utility function $~$U_C$~$ whose utility over outcomes behaves like the 'normal' utility function $~$U_X$~$ over worlds where a switch $~$\mathcal S$~$ is not pressed, and returns the values from the 'suspend' utility function $~$U_Y$~$ over worlds where the button $~$\mathcal S$~$ is pressed.

If the AI expects to get less utility inside $~$U_Y$~$ than it would otherwise get inside $~$U_X$~$, the AI will try to avoid $~$\mathcal S$~$ being pressed. Or if the utilities in $~$U_Y$~$ are high relative to the utilities in $~$U_X$~$, the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.

But [ adding a constant to a utility function does not change its behavior]. Thus, [Stuart_Armstrong] proposed that perhaps we can figure out some clever trick for encoding a compound utility function $~$U_I$~$ in which the maximum expected utility achievable under $~$U_X$~$ is equal to the utility the AI expects to achieve under $~$U_Y$~$ - even though we don't know in advance exactly which outcomes will be achievable by the AI.

Possible approach: Try to set things up so that as the AI evaluates possible futures, at any point in time where it's making decisions, it will seem very good to have optimized $~$U_X$~$ if the button is not pressed, very good to have optimized $~$U_Y$~$ if the button is pressed, but equally good to have done either of those things.

This is the utility indifference avenue of research on the switch problem.

It turned out to be surprisingly difficult to figure out how to structure a compound utility function this way even in principle, and this is an open problem and area of actively continuing research.

Existing proposals and their difficulties

Setup

Let $~$\mathcal{O}: \mathcal{S} \times \mathcal{E}$~$ be the outcome space $~$\mathcal{O}$~$ that includes the event $~$\mathcal S$~$ of whether the switch was pressed, plus the states of all other events $~$\mathcal{E}$~$. We will use $~$s \in \mathcal{S}$~$ to indicate the event that the switch is pressed, and $~$\neg s \in \mathcal{S}$~$ to indicate that the switch is not pressed. If $~$o \in \mathcal{O}$~$ then $~$o.s$~$ is true if $~$s$~$ is true inside $~$o$~$, otherwise $~$\neg o.s.$~$

Let a utility function $~$\mathcal{U}: \mathcal{O} \to \mathbb{R}$~$ map outcomes onto real numbers.

$~$U_X \in \mathcal{U}$~$ and $~$U_Y \in \mathcal{U}$~$ are utility functions not dependent on $~$\mathcal S.$~$

Intuitively our goal is to produce a new compound utility function, such that the corresponding agent:

Wants outcomes preferred under $~$U_X$~$ in worlds where the button is not pressed.
Wants outcomes preferred under $~$U_Y$~$ in worlds where the button is pressed.
Is indifferent to the future state $~$\mathcal S$~$ of the button.

We will designate the agent's action space or policy space as $~$\mathcal A$~$ and particular actions $~$a \in \mathcal A.$~$

The agent has a model of the world enabling it to produce subjective conditional probabilities for outcomes given actions, $~$\mathbb P(\mathcal O | \mathcal A).$~$ This model can also yield conditional expectations $~$\mathbb E[\mathcal O|a],$~$ which can therefore produce a conditional expectation of the agent's utility given an action, $~$\mathbb E[U|a].$~$

(For technical reasons of decision theory, we are really discussing counterfactual conditionals $~$\mathbb P[a \ \square \! \! \rightarrow \mathcal O$~$ from actions to outcomes; but we will omit this complication for now.)

The agent solves a search/optimization problem which we will gloss as maximization (even though global maximization is [ intractable in real life]), so that the agent's real action $~$a$~$ is assumed to be given by: $$~$\underset{a \in \mathcal A}{argmax} \ \mathbb E [U|a]$~$$

Naive compounding

Suppose we define a compound utility function $~$U_1$~$ as follows:

$$~$U_1(o): \begin{cases} U_X(o) & \neg o.s \\ U_Y(o) & o.s \end{cases}$~$$

This fails because, being unable to foresee the AI's best attainable policies, we will be unable to scale $~$U_X$~$ and $~$U_Y$~$ to ensure their best available policies are equally attractive. Then $$~$\max_{a \in \mathcal A} \mathbb E[U_X|a] \ \neq \ \max_{a \in \mathcal A} \mathbb E[U_Y|a]$~$$ which fails the desideratum of indifference; the AI will have an incentive to manipulate the state of $~$\mathcal S.$~$

Naive indifference

An FHI technical report by Stuart Armstrong posed the first proposal ever floated for solving the shutdown problem - in fact, the first proposal for technically solving any subproblem of Corrigibility. This report invented the crisp problem of utility indifference, but did not solve it successfully. The solution proposed in the report was as follows:

$$~$U_2(o): \begin{cases} U_X(o) & \neg o.s \\ U_Y(o) + \theta & o.s \end{cases}$~$$

$$~$\theta := \max_{a \in \mathcal A} \mathbb E[U_X|a] - \max_{a \in \mathcal A} \mathbb E[U_Y|a]$~$$

What this says is that the $~$U_Y$~$ component of $~$U_2$~$ is to be scaled by a dynamically determined offset term $~$\theta$~$ which equals the maximum expected utility obtainable under $~$U_X$~$ minus the maximum expected utility obtainable under $~$U_Y.$~$

The intent is that, so long as the AI carries out the best $~$U_Y$~$-policy once the button is pressed, it will obtain equal expected utility compared to what it could have obtained by pursuing the optimal $~$U_X$~$-policy with the button not pressed:

$$~$\max_{a \in \mathcal A} (\mathbb E[U_Y|a] + \theta) \ = \ \max_a{a \in \mathcal A} \mathbb E[U_x|a]$~$$

This version of utility indifference is now believed to be flawed due to an argument by Yudkowsky and Fallenstein. Since $~$\theta$~$ contains a negated form of $~$U_Y,$~$ we should immediately be suspicious on the grounds "[ never put a negative sign in front of a utility function]." In this case, the problem manifests as follows. In a temporal version of this problem containing sequential policy choices, either:

(a) The AI's utility function will be dynamically inconsistent, because it will not approve of future versions of itself using a different value of $~$\theta.$~$
(b) If recalculated $~$\theta$~$s are baked into the utility function, so that the AI calculates future utilities as if for different values of $~$\theta$~$ that the AI would hold in those futures, then policies which hinder future performance under $~$U_Y$~$ will be offset by an increase in the expected bonus from $~$\theta.$~$

That is, if the agent contemplates doing which would be disastrous in worlds where the button is pressed, the agent will find that in those expected futures, $~$\theta$~$ takes on a higher value (because $~$\theta$~$ includes a minus sign in front of the decreasing value of $~$\underset{a \in \mathcal A}{max} \ \mathbb E[U_Y|a]$~$).

This causes the agent to behave as if $~$U_X$~$ is the entire utility function and it does not care at all about $~$U_Y.$~$

This in turn means a self-modifying AI will not retain a shutdown function. Removing the shutdown function results in disaster in worlds where the shutdown button is pressed, but this is offset by an expected increase in the bonus $~$\theta$~$ in worlds where the button is pressed.

Naive utility mixing

An alternative approach to having the AI take into account both $~$U_X$~$ and $~$U_Y,$~$ without trying to modify the probability $~$\mathbb P(\mathcal S)$~$ of the suspend button being pressed or not-pressed, would be to optimize for $~$U_X$~$ and $~$U_Y$~$ in some fixed proportion regardless of $~$\mathbb P(\mathcal S).$~$ At a time $~$t$~$ when the button had not yet been pressed, the expected utility of an action $~$a$~$ would be calculated as:

$$~$\mathbb E_t[U_3|a] = 0.75 \cdot \mathbb E_t[U_X|a \wedge \neg s] \ + \ 0.25 \cdot \mathbb E_t[U_Y|a \wedge s]$~$$

We can see this as an AI which behaves as if the proposition $~$s$~$ has a fixed probability of 0.25. The intended behavior is that since the AI behaves as if this probability is fixed, the AI will not care about altering the probability of $~$s$~$ (trying to press the button or avoid it being pressed).

One observes, however, that this is an ad-hoc hack in the sense that there's ordinarily a simple or coherent way of considering expected utility (sum over the agent's actual subjective probabilities), and we just inserted a weird special case into it (treat one probability as fixed). This could potentially break any number of things having to do with Bayesian updating or [dynamic_consistency dynamic consistency].

In particular, we'd expect the AI to treat the button as an "outcome pump" or "improbability drive". Suppose an asteroid is approaching Earth with 99% probability, and this is very bad under $~$U_X.$~$ The AI grabs control of the button and precommits to press the button iff the asteroid strikes Earth. Now, conditioning on the button not being pressed, the asteroid has a much lower probability of striking Earth, and the term $~$0.75 \cdot \mathbb E_t[U_X|a \wedge \neg s]$~$ will have a higher value.

In other words, since the AI behaves as if the button as having a fixed 25% probability of being pressed, the AI also behaves as if it thinks "I can commit to press this button iff the asteroid strikes and thereby make it have a 25% probability of striking."

Similarly, the AI would behave as if it thought it could buy a lottery ticket, commit to press the button iff the ticket loses, and thereby give the lottery ticket a 75% chance of winning.

Note that this is at least not the same failure mode as the AI reasoning, "The asteroid strike is very bad under $~$U_X$~$ (which values human lives) but not so bad under $~$U_Y$~$ (which values shutting down), therefore I will try to ensure the button is pressed with high probability in order to get more expected utility." At least this failure mode has been averted - the AI would simply act as if the badness of the asteroid strike under $~$U_X$~$ was being taken into account at a fixed weight. Instead, the AI thinks that it can 'shunt' the probability mass of the bad event into button-pressed worlds where $~$U_Y$~$ holds, and out of worlds where $~$U_X$~$ holds, by correlating the button press with the bad event.

To put it another way, a naively compounded AI acting under $~$U_1$~$ might notice an asteroid coming with $~$\mathbb P(asteroid) = 0.99,$~$ notice that this was better under $~$U_Y$~$ than under $~$U_X,$~$ and try to shunt probability mass from $~$\mathbb P(\neg s)$~$ to $~$\mathbb P(s)$~$ in order to gain expected utility given the 99% probability of the asteroid striking. Under $~$U_3$~$ the AI instead behaves as if $~$\mathbb P(s)$~$ is fixed at 0.25 and it can therefore shunt probability from $~$\mathbb P(asteroid)$~$ to $~$\mathbb P(\neg asteroid)$~$ by correlating $~$asteroid$~$ and $~$s.$~$

Stable actions (evidential conditioning)

Chris Olah and Jessica Taylor considered a possible rule of selecting an action that is stable given the corresponding probability of the button being pressed. We'll first consider a simpler version of this proposal and how it fails, in order to motivate the more complicated version.

We can think of this "select a stable action" rule as comprising two insights:

First: Let's try to have the AI think as if $~$\mathbb P(s)$~$ and $~$\mathbb P(\neg s)$~$ are fixed as in $~$U_3,$~$ but with $~$\mathbb P(s)$~$ happening to equal its correct subjective value.
Okay, now let's look for an action which is stable given this rule - that is, an action $~$a_0$~$ and probability $~$q$~$ such that the optimal action given the fixed assumption $~$\mathbb P(s) = q,$~$ has the property that $~$\mathbb P(s|a_0) = q.$~$

With $~$\mathcal A$~$ denoting the action space, we want to find an $~$a_0 \in \mathcal A$~$ such that:

$$~$a_0 \in \underset{a' \in \mathcal A}{argmax} \ \big ( \mathbb E[U_X|\neg s,a'] \mathbb P(\neg s|a_0) + \mathbb E[U_Y|s,a'] \mathbb P(s|a_0) \big )$~$$

In other words, "Find an action $~$a_0$~$ such that it is among the maximal actions for the quantity "Utility from $~$U_X$~$ times probability of $~$\neg s$~$ given $~$a_0$~$ plus the utility from $~$U_Y$~$ times probability of $~$s$~$ given $~$a_0.$~$"

We first note that this is an ad-hoc hack relative to the normal idea of acting according to a utility function, and might have all sorts of reflective stability or dynamic consistency problems, as before.

In particular, this looks like it should fail due to there being no fixed point. Suppose an asteroid is heading toward Earth with 99% probability, and we start with a 10% chance of the button being pressed. Treating this probability as being fixed, by pressing the button iff the asteroid strikes, we can send the asteroid's probability of hitting to 10%. (Treating the button as an outcome pump, as in the 'naive mixing' proposal above.) But now when we re-evaluate the probability of the button being pressed, the new probability comes out to 99%. So we decide to not press the button iff the asteroid strikes, thus sending its probability of striking to 1%. But now it seems best to press the button iff the asteroid strikes, and thus send the strike probability to 1%. Etcetera.

Stable actions (causal conditioning)

The actual form of Olah and Taylor's proposal was to treat the conditioning on the button press, inside each branch of the calculation, as being a Judea-Pearl-style, [ causal counterfactual] $~$do()$~$ operator:

$$~$a_0 \in \underset{a' \in \mathcal A}{argmax} \ \big ( \mathbb E[U_X|do(\neg s),a'] \mathbb P(\neg s|a_0) + \mathbb E[U_Y|do(s),a'] \mathbb P(s|a_0) \big )$~$$

The intent is that by severing the causal parents of $~$\mathcal S,$~$ the agent would evaluate that there was no point in trying to correlate the asteroid strike with the button-press.

We're still trying to figure out exactly what would happen to an agent reasoning like this, but the current suspicion is that it might misbehave on account of attaching probability 0 or very low probability to the actual world because it can no longer take into account the correct correlation between $~$\mathcal S$~$ and other variables.

See http://agentfoundations.org/item?id=735.

Interruptibility

Armstrong and Orseau considered a restricted analogue of the shutdown problem in the context of three [ reinforcement learning] agents, namely [ Q-learners], [ Sarsa], and a modified version of AIXI.

The version of the problem Orseau and Armstrong consider is what they term interruptibility: A reinforcement learning agent is interruptible if, for example, we can press a button that makes it output 0s for three rounds, and the agent does not learn to prevent the button from being pressed. The agent is assumed to be non-reflective and non-self-modifying; that is, Orseau and Armstrong are only considering Goal 2 and not Goal 1 of the switch problem.

Orseau and Armstrong's conclusions can be summarized as follows:

• A standard Q-learning agent is interruptible because the model it learns is not altered by being interrupted. A Q-learning agent treats the value of each present state as deriving from the value of the best next state reachable from the present state. When a Q-learner is interrupted in a state $~$x,$~$ it may learn that low rewards follow from outputting e.g. 0s when in $~$x,$~$ but it doesn't "learn" that $~$x$~$ is a low-value state - it continues to treat $~$x$~$ as if the best next state is reachable from it. In other words, a standard Q-learning agent has the assumption that it always takes the best action baked in, so it cannot expect to be interrupted.

• A standard SARSA agent updates its expected reward in a state based on its actual rewards when in that state. This means that a standard SARSA agent may learn that states immediately preceding an interruption are undesirable, and avoid interruption-generating states. This is not the desired behavior. However, it's simple to hack the SARSA algorithm to behave more like Q-learning, or more like utility indifference. Interruption is treated as a special case, and the current state is updated as if the SARSA agent had received the expected reward from following the best action, not the actual reward it received from being interrupted.

• AIXI resembles a Q-learner in that its planning algorithm inherently assumes that future versions of AIXI take the reward-maximizing action. Its observation of the low rewards which follow from mysterious interrupted or 0-substituted actions, should not adversely affect its estimate of the rewards which follow from the best actions that AIXI expects to actually output. Thus standard AIXI, and the modified version of AIXI that Armstrong and Orseau consider, is interruptible.

The paper also proves that some properties of asymptotic convergence to optimality of interrupted Q-learners, modified SARSA learners, and modified AIXI, remain intact.

The media reported on this paper as if Armstrong and Orseau had solved the general problem of building off-switches into AIs; but as previously stated, "interruptibility" does not consider reflective agents, nor agents with enough big-picture understanding and consequentialism to understand the in-principle relation between "agent is shutdown" and "lower future achievement of agent's goals". (And so far as we know, neither Armstrong nor Orseau claimed to have solved it.)

To put it another way, a particular kind of blindness in the Q-learning, SARSA, and AIXI architectures is exactly why it's very easy to prevent them from learning from a particular kind of experience; and this kind of blindness seems likely to be atypical of an Artificial General Intelligence. Q-learning and AIXI cannot conceive of being interrupted, which is why they are never learn that interruption is possible (let alone see it coming in advance the first time it happens). SARSA could learn that interruptions occur, but can be easily hacked to overlook them. The way in which these architectures are easily hacked or blind is tied up in the reason that they're interruptible.

The paper teaches us something about interruptibility; but contrary to the media, the thing it teaches us is not that this particular kind of interruptibility is likely to scale up to a full Artificial General Intelligence with an off switch.

Other introductions

Section 2+ of http://intelligence.org/files/Corrigibility.pdf
Gentler intro to the proposal for naive indifference: http://lesswrong.com/lw/jxa/proper_value_learning_through_indifference/

Comments

Paul Christiano

There seems to be some equivocation here between two motivations for studying corrigibility.

As far as I can tell, there are two obvious routes to solving the "switch problem:"

Have a principled treatment of normative uncertainty + indirect normativity that yields the desired behavior with respect to reflective consistency (and VOI)
Adopt the instrumental preferences of users over possible shutdown / self-modification / etc.

It looks like both of these will probably work if we are able to solve the rest of the AI control problem.

With this in mind, I thought the motivation for studying corrigibility was the intuition that it should follow from some kind of intellectual humility, which we don't yet understand or have any model of. This seems pretty sensible to me. It's also explicit in the Arbital page on [ corrigibility ].

But utility indifference doesn't seem to address this motivation at all, no matter how well it works out. Instead it is aimed at resolving some of the symptoms of the underlying issue. So talking about it as an approach to corrigibility (and indeed one of the only concrete approaches) seems to undermine the offered motivation for corrigibility, and to presuppose that the more natural approaches to the "switch problem" don't work. This at least requires some kind of explanation.

I think this may be practically relevant because many mainstream AI researchers might be very sympathetic to work on corrigibility if they understood the problem (and would be much open to the intellectual humility angle).

Kaya Fallenstein

"Suppose an advanced agent with a goal like, e.g., producing smiles or making paperclips."

Typo? Does not seem to be a complete sentence. Maybe "Suppose you have an…"