(This is hard without threaded conversations. Responding to the "agree/disagree" from Eliezer)
The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'You can't have superintelligences that optimize any external factor, only things analogous to internal reinforcement.'
The failure scenario that Paul visualizes for Orthogonality is something along the lines of, 'The problem of reflective stability is unsolvable in the limit and no efficient optimizer with a unitary goal can be computationally large or self-improving.'
I think there are a lot of plausible failure modes. The two failures you outline don't seem meaningfully distinct given our current understanding, and seem to roughly describe what I'm imagining. Possible examples:
- Systems that simply want to reproduce and expand their own influence are at a fundamental advantage. To make this more concrete, imagine powerful agents that have lots of varied internal processes, and that constant effort is needed to prevent the proliferation of internal processes that are optimized for their own proliferation rather than pursuit of some overarching goal. Maybe this kind of effort is needed to obtain competent high-level behavior at all, but maybe if you have some simple values you can spend less effort and let your own internal character shift freely according to competitive pressures.
- What we were calling "sensory optimization" may be a core feature of some useful algorithms, and it may require a constant fraction of one's resources to repurpose that sensory optimization towards non-sensory ends. This might just be a different way of articulating the last bullet point. I think we could talk about the same thing in many different ways, and at this point we only have a vague understanding of what those scenarios actually look like concretely.
- It turns out that at some fixed level of organization, the behavior of a system needs to reflect something about the goals of that system---there is no way to focus "generic" medium-level behavior towards an arbitrary goal that isn't already baked into that behavior. (The alternative, which seems almost necessary for the literal form of orthogonality, is that you can have arbitrarily large internal computations that are mostly independent of the agent's goals.) This implies that systems with more complex goals need to do at least slightly more work to pursue those goals. For example, if the system only devotes 0.0000001% of its storage space/internal communication bandwidth to goal content, then that puts a clear lower bound on the scale at which the goals can inform behavior. Of course arbitrarily complex goals could probably be specified indirectly (e.g. I want whatever is written in the envelope over there), but if simple indirect representations are themselves larger than the representation of the simplest goals, this could still represent a real efficiency loss.
Paul is worried about something else / Eliezer has completely missed Paul's point.
I do think the more general point, of "we really don't know what's going on here," is probably more important than the particular possible counterexamples. Even if I had no plausible counterexamples in mind, I just wouldn't especially confident.
I think the only robust argument in favor is that unbounded agents are probably orthogonal. But (1) that doesn't speak to efficiency, and (2) even that is a bit dicey, so I wouldn't go for 99% even on the weaker form of orthogonality that neglects efficiency.
If you can get to 95% cognitive efficiency and 100% technological efficiency, then a human value optimizer ought to not be at an intergalactic-colonization disadvantage or a take-over-the-world-in-an-intelligence-explosion disadvantage and not even very much of a slow-takeoff disadvantage.
It sounds regrettable but certainly not catastrophic. Here is how I would think about this kind of thing (it's not something I've thought about quantitatively much, it doesn't seem particularly action-relevant).
We might think that the speed of development or productivity of projects varies a lot randomly. So in the "race to take over the world" model (which I think is the best case for an inefficient project maximizing its share of the future), we'd want to think about what kind of probabilistic disadvantage a small productivity gap introduces.
As a simple toy model, you can imagine two projects; the one that does better will take over the world.
If you thought that productivity was log normal with a standard deviation of */ 2, then a 5% productivity disadvantage corresponds to maybe a 48% chance of being more productive. Over the course of more time the disadvantage becomes more pronounced if randomness averages out. If productivity variation is larger or smaller then it decreases or increases the impact of an efficiency loss. If there are more participants, then the impact of a productivity hit becomes significantly large. If the good guys only have a small probability of losing, then the cost is proportionally lower. And so on.
Combining with my other views, maybe one is looking at a cost of tenths of a percent. You would presumably hope to avoid this by having the world coordinate even a tiny bit (I thought about this a bit here). Overall I'll stick with regrettable but far from catastrophic.
(My bigger issue in practice with efficiency losses is similar to your view that people ought to have really high confidence. I think it is easy to make sloppy arguments that one approach to AI is 10% as effective as another, when in fact it is 0.0001% as effective, and that holding yourself to asymptotic equivalence is a more productive standard unless it turns out to be unrealizable.)