[summary: An "advanced agent" is a machine intelligence smart enough that we start considering how to point it in a nice direction.
E.g: You don't need to worry about an AI trying to prevent you from pressing the suspend button (off switch), unless the AI knows that it has a suspend button. So an AI that isn't smart enough to realize it has a suspend button, doesn't need the part of alignment theory that deals in "having the AI let you press the suspend button".
"Advanced agent properties" are thresholds for how an AI could be smart enough to be interesting from this standpoint. E.g: The ability to learn a wide variety of new domains, aka "Artificial General Intelligence," could lead into an AI learning the big picture and realizing that it had a suspend button.]
[summary(Technical): Advanced machine intelligences are the subjects of AI alignment theory: agents sufficiently advanced in various ways to be (1) dangerous if mishandled, and (2) relevant to our larger dilemmas for good or ill.
"Advanced agent property" is a broad term to handle various thresholds that have been proposed for "smart enough to need alignment". For example, current machine learning algorithms are nowhere near the point that they'd try to resist if somebody pressed the off-switch. That would require, e.g.:
- Enough big-picture strategic awareness for the AI to know that it is a computer, that it has an off-switch, and that if it is shut off its goals are less likely to be achieved.
- General consequentialism / backward chaining from goals to actions; visualizing which actions lead to which futures and choosing actions leading to more [preferences preferred] futures, in general and across domains.
So the threshold at which you might need to start thinking about 'shutdownability' or 'abortability' or corrigibility as it relates to having an off-switch, is 'big-picture strategic awareness' plus 'cross-domain consequentialism'. These two cognitive thresholds can thus be termed 'advanced agent properties'.
The above reasoning also suggests e.g. that General intelligence is an advanced agent property, because a general ability to learn new domains could eventually lead the AI to understand that it has an off switch.]
(For the general concept of an agent, see standard agent properties.)
[toc:]
Introduction: 'Advanced' as an informal property, or metasyntactic placeholder
"Sufficiently advanced Artificial Intelligences" are the subjects of AI alignment theory; machine intelligences potent enough that:
- The safety paradigms for advanced agents become relevant.
- Such agents can be decisive in the big-picture scale of events.
Some example properties that might make an agent sufficiently powerful for 1 and/or 2:
- The AI can learn new domains besides those built into it.
- The AI can understand human minds well enough to manipulate us.
- The AI can devise real-world strategies we didn't foresee in advance.
- The AI's performance is strongly superhuman, or else at least optimal, across all cognitive domains.
Since there's multiple avenues we can imagine for how an AI could be sufficiently powerful along various dimensions, 'advanced agent' doesn't have a neat necessary-and-sufficient definition. Similarly, some of the advanced agent properties are easier to formalize or pseudoformalize than others.
As an example: Current machine learning algorithms are nowhere near the point that they'd try to resist if somebody pressed the off-switch. That would happen given, e.g.:
- Enough big-picture strategic awareness for the AI to know that it is a computer, that it has an off-switch, and that if it is shut off its goals are less likely to be achieved.
- Widely applied consequentialism, i.e. backward chaining from goals to actions; visualizing which actions lead to which futures and choosing actions leading to more [preferences preferred] futures, in general and across domains.
So the threshold at which you might need to start thinking about 'shutdownability' or 'abortability' or corrigibility as it relates to having an off-switch, is 'big-picture strategic awareness' plus 'cross-domain consequentialism'. These two cognitive thresholds can thus be termed 'advanced agent properties'.
The above reasoning also suggests e.g. that General intelligence is an advanced agent property, because a general ability to learn new domains could lead the AI to understand that it has an off switch.
One reason to keep the term 'advanced' on an informal basis is that in an intuitive sense we want it to mean "AI we need to take seriously" in a way independent of particular architectures or accomplishments. To the philosophy undergrad who 'proves' that AI can never be "truly intelligent" because it is "merely deterministic and mechanical", one possible reply is, "Look, if it's building a Dyson Sphere, I don't care if you define it as 'intelligent' or not." Any particular advanced agent property should be understood in a background context of "If a computer program is doing X, it doesn't matter if we define that as 'intelligent' or 'general' or even as 'agenty', what matters is that it's doing X." Likewise the notion of 'sufficiently advanced AI' in general.
The goal of defining advanced agent properties is not to have neat definitions, but to correctly predict and carve at the natural joints for which cognitive thresholds in AI development could lead to which real-world abilities, corresponding to which alignment issues.
An alignment issue may need to have been already been solved at the time an AI first acquires an advanced agent property; the notion is not that we are defining observational thresholds for society first needing to think about a problem.
Summary of some advanced agent properties
Absolute-threshold properties (those which reflect cognitive thresholds irrespective of the human position on that same scale):
- Consequentialism, or choosing actions/policies on the basis of their expected future consequences
- Modeling the conditional relationship $~$\mathbb P(Y|X)$~$ and selecting an $~$X$~$ such that it leads to a high probability of $~$Y$~$ or high quantitative degree of $~$Y,$~$ is ceteris paribus a sufficient precondition for deploying Convergent instrumental strategies that lie within the effectively searchable range of $~$X.$~$
- Note that selecting over a conditional relationship is potentially a property of many internal processes, not just the entire AI's top-level main loop, if the conditioned variable is being powerfully selected over a wide range.
- Cross-domain consequentialism implies many different cognitive domains potentially lying within the range of the $~$X$~$ being selected-on to achieve $~$Y.$~$
- Trying to rule out particular instrumental strategies, in the presence of increasingly powerful consequentialism, would lead to the Nearest unblocked strategy form of Patch resistance and subsequent context-change disasters.
- Big-picture strategic awareness is a world-model that includes strategically important general facts about the larger world, such as e.g. "I run on computing hardware" and "I stop running if my hardware is switched off" and "there is such a thing as the Internet and it connects to more computing hardware".
- Psychological modeling of other agents (not humans per se) potentially leads to:
- Extrapolating that its programmers may present future obstacles to achieving its goals
- This in turn leads to the host of problems accompanying incorrigibility as a convergent strategy.
- Trying to conceal facts about itself from human operators
- Being incentivized to engage in Cognitive steganography.
- Mindcrime if building models of reflective other agents, or itself.
- Internally modeled adversaries breaking out of internal sandboxes.
- Modeling distant superintelligences or other decision-theoretic adversaries.
- Substantial [capability_gain capability gains] relative to domains trained and verified previously.
- E.g. this is the qualifying property for many context-change disasters.
- General intelligence is the most obvious route to an AI acquiring many of the capabilities above or below, especially if those capabilities were not initially or deliberately programmed into the AI.
- Self-improvement is another route that potentially leads to capabilities not previously present. While some hypotheses say that self-improvement is likely to require basic general intelligence, this is not a known fact and the two advanced properties are conceptually distinct.
- Programming or computer science capabilities are a route potentially leading to self-improvement, and may also enable Cognitive steganography.
- Turing-general cognitive elements (capable of representing large computer programs), subject to sufficiently strong end-to-end optimization (whether by the AI or by human-crafted clever algorithms running on 10,000 GPUs), may give rise to crystallized agent-like processes within the AI.
- E.g. natural selection, operating on chemical machinery constructible by DNA strings, optimized some DNA strings hard enough to spit out humans.
- Pivotal material capabilities such as quickly self-replicating infrastructure, strong mastery of biology, or molecular nanotechnology.
- Whatever threshold level of domain-specific engineering acumen suffices to develop those capabilities, would therefore also qualify as an advanced-agent property.
Relative-threshold advanced agent properties (those whose key lines are related to various human levels of capability):
- Cognitive uncontainability is when we can't effectively imagine or search the AI's space of policy options (within a domain); the AI can do things we didn't think of (within a domain).
- Strong cognitive uncontainability is when we don't know all the rules (within a domain) and might not recognize the AI's solution even if told about it in advance, like somebody in the 11th century looking at the blueprint for a 21st-century air conditioner. This may also imply that we cannot readily put low upper bounds on the AI's possible degree of success.
- Rich domains are more likely to have some rules or properties unknown to us, and hence be strongly uncontainable.
- Almost all real-world domains are rich.
- Human psychology is a rich domain.
- Superhuman performance in a rich domain strongly implies cognitive uncontainability because of Vinge's Principle.
- Realistic psychological modeling potentially leads to:
- Guessing which results and properties the human operators expect to see, or would arrive at AI-desired beliefs upon seeing, and arranging to exhibit those results or properties.
- Psychologically manipulating the operators or programmers
- Psychologically manipulating other humans in the outside world
- More probable mindcrime
- (Note that an AI trying to develop realistic psychological models of humans is, by implication, trying to develop internal parts that can deploy all human capabilities.)
- Rapid [capability_gain capability gains] relative to human abilities to react to them, or to learn about them and develop responses to them, may cause more than one Context disaster to happen a time.
- The ability to usefully scale onto more hardware with good returns on cognitive reinvestment would potentially lead to such gains.
- Hardware overhang describes a situation where the initial stages of a less developed AI are boosted using vast amounts of computing hardware that may then be used more efficiently later.
- Limited AGIs may have capability overhangs if their limitations break or are removed.
- Strongly superhuman capabilities in psychological or material domains could enable an AI to win a competitive conflict despite starting from a position of great material disadvantage.
- E.g., much as a superhuman Go player might win against the world's best human Go player even with the human given a two-stone advantage, a sufficiently powerful AI might talk its way out of an AI box despite restricted communications channels, eat the stock market in a month starting from $1000, win against the world's combined military forces given a protein synthesizer and a 72-hour head start, etcetera.
- Epistemic and instrumental efficiency relative to human civilization is a sufficient condition (though not necessary) for an AI to…
- Deploy at least any tactic a human can think of.
- Anticipate any tactic a human has thought of.
- See the human-visible logic of a convergent instrumental strategy.
- Find any humanly visible weird alternative to some hoped-for logic of cooperation.
- Have any advanced agent property for which a human would qualify.
- General superintelligence would lead to strongly superhuman performance in many domains, human-relative efficiency in every domain, and possession of all other listed advanced-agent properties.
- Compounding returns on cognitive reinvestment are the qualifying condition for an Intelligence explosion that might arrive at superintelligence on a short timescale.
Discussions of some advanced agent properties
Human psychological modeling
Sufficiently sophisticated models and predictions of human minds potentially leads to:
- Getting sufficiently good at human psychology to realize the humans want/expect a particular kind of behavior, and will modify the AI's preferences or try to stop the AI's growth if the humans realize the AI will not engage in that type of behavior later. This creates an instrumental incentive for programmer deception or cognitive steganography.
- Being able to psychologically and socially manipulate humans in general, as a real-world capability.
- Being at risk for mindcrime.
A behaviorist AI is one with reduced capability in this domain.
Cross-domain, real-world consequentialism
Probably requires generality (see below). To grasp a concept like "If I escape from this computer by hacking my RAM accesses to imitate a cellphone signal, I'll be able to secretly escape onto the Internet and have more computing power", an agent needs to grasp the relation between its internal RAM accesses, and a certain kind of cellphone signal, and the fact that there are cellphones out there in the world, and the cellphones are connected to the Internet, and that the Internet has computing resources that will be useful to it, and that the Internet also contains other non-AI agents that will try to stop it from obtaining those resources if the AI does so in a detectable way.
Contrasting this to non-primate animals where, e.g., a bee knows how to make a hive and a beaver knows how to make a dam, but neither can look at the other and figure out how to build a stronger dam with honeycomb structure. Current, 'narrow' AIs are like the bee or the beaver; they can play chess or Go, or even learn a variety of Atari games by being exposed to them with minimal setup, but they can't learn about RAM, cellphones, the Internet, Internet security, or why being run on more computers makes them smarter; and they can't relate all these domains to each other and do strategic reasoning across them.
So compared to a bee or a beaver, one shot at describing the potent 'advanced' property would be cross-domain real-world consequentialism. To get to a desired Z, the AI can mentally chain backwards to modeling W, which causes X, which causes Y, which causes Z; even though W, X, Y, and Z are all in different domains and require different bodies of knowledge to grasp.
Grasping the big picture
Many dangerous-seeming convergent instrumental strategies pass through what we might call a rough understanding of the 'big picture'; there's a big environment out there, the programmers have power over the AI, the programmers can modify the AI's utility function, future attainments of the AI's goals are dependent on the AI's continued existence with its current utility function.
It might be possible to develop a very rough grasp of this bigger picture, sufficiently so to motivate instrumental strategies, in advance of being able to model things like cellphones and Internet security. Thus, "roughly grasping the bigger picture" may be worth conceptually distinguishing from "being good at doing consequentialism across real-world things" or "having a detailed grasp on programmer psychology".
Pivotal material capabilities
An AI that can crack the protein structure prediction problem (which seems speed-uppable by human intelligence); invert the model to solve the protein design problem (which may select on strong predictable folds, rather than needing to predict natural folds); and solve engineering problems well enough to bootstrap to molecular nanotechnology; is already possessed of potentially pivotal capabilities regardless of its other cognitive performance levels.
Other material domains besides nanotechnology might be pivotal. E.g., self-replicating ordinary manufacturing could potentially be pivotal given enough lead time; molecular nanotechnology is distinguished by its small timescale of mechanical operations and by the world containing an infinite stock of perfectly machined spare parts (aka atoms). Any form of cognitive adeptness that can lead up to rapid infrastructure or other ways of quickly gaining a decisive real-world technological advantage would qualify.
Rapid capability gain
If the AI's thought processes and algorithms scale well, and it's running on resources much smaller than those which humans can obtain for it, or the AI has a grasp on Internet security sufficient to obtain its own computing power on a much larger scale, then this potentially implies [ rapid capability gain] and associated context changes. Similarly if the humans programming the AI are pushing forward the efficiency of the algorithms along a relatively rapid curve.
In other words, if an AI is currently being improved-on swiftly, or if it has improved significantly as more hardware is added and has the potential capacity for orders of magnitude more computing power to be added, then we can potentially expect rapid capability gains in the future. This makes context disasters more likely and is a good reason to start future-proofing the safety properties early on.
Cognitive uncontainability
On complex tractable problems, especially those that involve real-world rich problems, a human will not be able to cognitively 'contain' the space of possibilities searched by an advanced agent; the agent will consider some possibilities (or classes of possibilities) that the human did not think of.
The key premise is the 'richness' of the problem space, i.e., there is a fitness landscape on which adding more computing power will yield improvements (large or small) relative to the current best solution. Tic-tac-toe is not a rich landscape because it is fully explorable (unless we are considering the real-world problem "tic-tac-toe against a human player" who might be subornable, distractable, etc.) A computationally intractable problem whose fitness landscape looks like a computationally inaccessible peak surrounded by a perfectly flat valley is also not 'rich' in this sense, and an advanced agent might not be able to achieve a relevantly better outcome than a human.
The 'cognitive uncontainability' term in the definition is meant to imply:
- Vingean unpredictability.
- Creativity that goes outside all but the most abstract boxes we imagine (on rich problems).
- The expectation that we will be surprised by the strategies the superintelligence comes up with because its best solution was one we didn't consider.
Particularly surprising solutions might be yielded if the superintelligence has acquired domain knowledge we lack. In this case the agent's strategy search might go outside causal events we know how to model, and the solution might be one that we wouldn't have recognized in advance as a solution. This is Strong cognitive uncontainability.
In intuitive terms, this is meant to reflect, e.g., "What would have happened if the 10th century had tried to use their understanding of the world and their own thinking abilities to upper-bound the technological capabilities of the 20th century?"
Other properties
(Work in progress) [todo: fill out]
- generality
- cross-domain consequentialism
- learning of non-preprogrammed domains
- learning of human-unknown facts
- Turing-complete fact and policy learning
- dangerous domains
- human modeling
- social manipulation
- realization of programmer deception incentive
- anticipating human strategic responses
- rapid infrastructure
- potential
- self-improvement
- suppressed potential
- epistemic efficiency
- instrumental efficiency
- cognitive uncontainability
- operating in a rich domain
- Vingean unpredictability
- strong cognitive uncontainability
- improvement beyond well-tested phase (from any source of improvement)
- self-modification
- code inspection
- code modification
- consequentialist programming
- cognitive programming
- cognitive capability goals (being pursued effectively)
- speed surpassing human reaction times in some interesting domain
- socially, organizationally, individually, materially
[todo: write out a set of final dangerous abilities use/cases and then link up the cognitive abilities with which potentially dangerous scenarios they create.]
Comments
Paul Christiano
It's worth pointing out that in our discussions of AI safety, the author (I assume Eliezer, hereafter "you") often describe the problems as being hard precisely for agents that are not (yet) epistemically efficient, especially concerning predictions about human behavior. Indeed, in this comment it seems like you imply that a lack of epistemic efficiency is the primary justification for studying vingean reflection.
Given that you think coping with epistemic inefficiency is an important part of the safety problem, this line:
Seems misleading.
In general, you seem to equivocate between a model where we can/should focus on extremely powerful agents, and a model where most of the key difficulties are at intermediate levels of power where our AI systems are better than humans at some tasks and worse at others. (You often seem to have quite specific views about which tasks are likely to be easy or hard; I don't really buy most of these particular views, but I do think that we should try to design controls systems that work robustly across a wide range of capability states.)
Kenzi Amodei
Does it have to be (1) and (2)? My impression is that either one should be sufficient to count - I guess unless they turn out to be isomorphic, but naively I'd expect there to be edge cases with just one or the other.
Gosh this is just like reading the sequences, in the sense that I'm quite confused about what order to read things in. Currently defaulting to reading in the order on the VA list page
My guess why not to use a mathy definition at this point: because we don't want to undershoot when these protocols should be in effect. If that were the only concern though presumably we could just list several sufficient conditions and note that it isn't an exhaustive list. I don't see that, so maybe I'm missing something.
Are stock prices predictably under/over estimates on longer time horizons? I don't think I knew that.
I guess all the brackets are future-hyperlinks?
So an advanced agent doesn't need to be very "smart" necessarily; advanced just means "can impact the world a lot"
I'm guessing instrumental efficiency means that we can't predict it making choices less-smart-than-us in a systematic way? Or something like that
Oh good, cognitive uncontainability was one of the ones I could least guess what it meant from the list [ hmm, also cross-domain consequentialism].
I don't remember what Vingean unpredictability is. [ hmm, it seems to be hard to google. I know I've listened to people talk about Vingean reflection, but I didn't really understand it enough for it to stick]. Ok, googling Vingean reflection gets me "ensuring that the initial agent's reasoning about its future versions is reliable, even if these future versions are far more intelligent than the current reasoner" from a MIRI abstract. (more generally, reasoning about agents that are more intelligent than you). So Vingean unpredictability would be that you can't perfectly predict the actions of an agent that's more intelligent than you?