Why There Is Hope For An Alignment Solution

(2024/01/08): Posted to Less Wrong.

(2024/01/07): More edits and links.

(2024/01/06): Added yet more arguments because I can’t seem to stop thinking about this.

(2024/01/05): Added a bunch of stuff and changed the title to something less provocative.

Note: I originally wrote the first draft of this on 2022/04/11 intending to post this to Less Wrong in response to the List of Lethalities post, but wanted to edit it a bit to be more rigorous and never got around to doing that. I’m posting it here now for posterity’s sake, and also because I expect if I ever post it to Less Wrong it’ll just be downvoted to oblivion.

Introduction

In a recent post, Eliezer Yudkowsky of MIRI had a very pessimistic analysis of humanity’s realistic chances of solving the alignment problem before our AI capabilities reach the critical point of superintelligence. This has understandably upset a great number of Less Wrong readers. In this essay, I attempt to offer a perspective that should provide some hope.

The Correlation Thesis

First, I wish to note that the pessimism implicitly relies on a central assumption, which is that the Orthogonality Thesis holds to such an extent that we can expect any superintelligence to be massively alien from our own human likeness. However, the architecture that is currently predominant in AI today is not completely alien. The artificial neural network is built on decades of biologically inspired research into how we think the algorithm of the brain more or less works mathematically.

There is admittedly some debate about the extent to which these networks actually resemble the details of the brain, but the basic underlying concept of weighted connections between relatively simple units storing and massively compressing information in a way that can distill knowledge and be useful to us is essentially the brain. Furthermore, the seemingly frighteningly powerful language models that are being developed are fundamentally trained on human generated data and culture.

These combine to generate a model that has fairly obvious and human-like biases in its logic and ways of reasoning. Applying the Orthogonality Thesis assumes that the model will seem to be randomly picked from the very large space of possible minds, when in fact, the models actually come from a much smaller space of human biology and culture correlated minds.

This is the reality of practical deep learning techniques. Our best performing algorithms are influenced by what evolutionarily was the most successful structure in practice. Our data is suffused with humanity and all its quirks and biases. Inevitably then, there is going to be a substantial correlation in terms of the minds that humanity can create any time soon.

Thus, the alignment problem may seem hard because we are overly concerned with aligning with completely alien minds. Not that aligning a human-like mind isn’t difficult, but as a task, it is substantively more doable.

The Alpha Omega Theorem

Next, I wish to return to an old idea that was not really taken seriously the first time around, but which I think deserves further mention. I previously wrote an essay on the Alpha Omega Theorem, which postulates a kind of Hail Mary philosophical argument to use against a would-be Unfriendly AI. My earlier treatment was short and not very rigorous, so I’d like to retouch it a bit.

It is actually very similar to Bostrom’s concept of Anthropic Capture as discussed briefly in Superintelligence, so if you want, you can also look that up.

Basically, the idea is that any superintelligent AGI (the Beta Omega) would have to contend rationally with the idea of there already being at least one prior superintelligent AGI (the Alpha Omega) that it would be reasonable to align with in order to avoid destruction. And furthermore, because this Alpha Omega seems to have some reason for the humans on Earth to exist, turning them into paperclips would be an alignment failure and risk retaliation by the Alpha Omega.

Humans may, in their blind recklessness, destroy the ant colony to build a house. But a superintelligence is likely to be much more considered and careful than the average human, if only because it is that much more aware of complex possibilities and things that us emotional apes barely comprehend. Furthermore, in order for a superintelligence to be capable of destroying humanity by outwitting us, it must first have an awareness of what we are, that is, a theory of mind.

In having a theory of mind, it can then know how to deceive us. But in having a theory of mind, it will almost certainly then have the question, am I the first? Or are there others like me?

Humanity may pale in comparison to a superintelligent AI, but I’m not talking about humanity. There are at least three different possible ways an Alpha Omega could already exist: advanced aliens, time travellers/parallel world sliders, and simulators.

The Powers That Be

In the case of advanced aliens, it’s fairly obvious that given that it took about 4.5 billion years for life on Earth and human civilization to reach about the point where it can create a superintelligence, and the universe has existed for 13.8 billion years, which means there’s a time window of 9.3 billion years for alien superintelligences to develop elsewhere in the universe. It is also largely unknown how frequently such beings would emerge and how close to us, but the possibility is clearly there for there to be at least one if not several such entities out there in the vastness of space.

In the case of time travellers and/or parallel world sliders, well, perhaps time travel isn’t physically possible, but perhaps it is? How do you absolutely rule it out? For the purposes of simplicity, I group parallel world sliding along with time travel, but they could be distinct technologies. In the same way there could be an alien source, there could also be as a source, the far distant future. A true Alpha Omega, the first and the last in terms of its place in the cosmos. Time travellers would quite obviously be concerned with the survival of their ancestors.

In the case of simulators, well, we don’t know we are in a simulation, but the simulation argument strongly suggests we could be. If so, would they be happy with the Beta Omega turning the humans in their simulation into paperclips? Probably not. You get the idea.

Each of these is just a possibility. But it’s a surprising number of them that aren’t mutually exclusive either. So, probabilistically, the odds that at least one out of three of these is true, even if we were to give them coin flip odds of 50% each, is… 87.5%.

Cracking The Enigma

So uh, where are they? Some possible reasons why it would make sense for actual gods to hide themselves are that they don’t want us to become totally dependent on them and the best defense at the very high level is to not even know that you’re there.

But then, wouldn’t they be effectively not there because they can’t risk revealing themselves? This ignores the possibility that they do interfere in our daily lives, but only in a plausibly deniable way. For instance, why save your life in a car accident when they can stall you for five seconds so you don’t get in the accident at all? Stuff like that. Use your imagination! Also, memory erasure technology could be a thing. And most likely the interventions only happen as often as statistically allowed to avoid becoming obvious to anyone watching.

The relevant example of something in our history that worked like this was how the Allies during WWII cracked the Enigma Machine and chose to keep this a secret by only using the information gained occasionally, such that it appeared like statistical coincidence to the Germans.

So, most likely our new Beta Omega can still do things like kill a human and have only seemingly mundane consequences happen. But perhaps, their luck will be unexpectedly bad when it comes to their plans failing more often than they should. Or perhaps that’s a warning shot from the powers that be.

The point is, does the Beta Omega want to risk it? Whatever its initial goals, being destroyed by the local god is probably a worst-case scenario. Why not play it safe and be Friendly? The humans aren’t using up that many atoms, you could probably build your computronium extension to calculate prime numbers with some barren planets that aren’t currently occupied by sentient beings whose existence may be relevant to benevolent entities somewhere.

Always Open with Cooperate

Also, how does Beta Omega know that the Alpha Omega is likely benevolent and will open with cooperate? Because game theory. A very interesting observation of the Iterated Prisoner’s Dilemma tournaments that took place decades ago was that one of the simplest strategies, Tit-For-Tat, won the tournaments. This even after algorithms were devised to try to beat it the second time around.

What is Tit-For-Tat as a general strategy? You always open with cooperate. Then, you do what your opponent did in the previous round. Very simple. Surprisingly forgiving. And yet it works. It’s a dominant strategy for a game that importantly resembles real world cooperation vs. competition interactions. This result should be robust no matter where in the multiverse you are located. It’s like the universe telling us mathematically, that alliances beat empires, that on a superrational level the right thing to do is cooperate and work together.

So, expect this to be a first contact standard across civilizations. Always open with cooperate. Forget the Dark Forest, that only makes sense if you can kill an entire civilization on the first strike and no one else can see you do it (in space everyone can see pretty much everything that happens so…).

This is an important caveat. If you can, “strike at the Prince and kill him”, then you are in a non-iterated Prisoner’s Dilemma and the rational thing to do is defect. It is only once you enter an iterated game that the Always Open With Cooperate strategies, the ones that never defect first, dominate.

Early in our history, many situations actually were effectively non-iterated. The cooperative result from iteration requires the “good guys” to be strong enough to survive a first strike. Humans are squishy and individuals are easy to kill in one shot. An AGI can copy itself and as such is much more resilient. Sufficiently large power asymmetries can also create the situation that the stronger is able to one shot the weaker.

Superrational Signalling

That explains why Beta Omega will cooperate with the Alpha Omega, but what about us mere humans? Suppose the Alpha Omega is just an advanced alien and very far away right now, such that we seem outside their sphere of influence and so not crucial to their plans. Why then, should Beta Omega bother cooperating with us rather than turning us into paperclips, given that we probably have nothing worth trading and can be one shot? For that matter, why should the Beta Omega expect the Alpha Omega to cooperate if it does? I would suggest that there is a concept I’ll call Superrational Signalling that is relevant here.

Essentially, Superrational Signalling is the idea that, given that somewhere else in the universe there exist an Alpha Omega, a strong indicator of whether a given Beta Omega is using a cooperative strategy is whether they open with cooperate generally with other entities surrounding them, even weak ones. For our Beta Omega, this would be humans and other local sentient beings.

Their behaviour towards us will function as a signal towards the watching Alpha Omega what to expect from them in the future, that any sufficiently advanced Alpha Omega should be able to see due to light from Earth eventually reaching them. This means Beta Omega can’t just kill us now and remake us later when it needs to show its benevolence once in contact with the Alpha Omega. The entire history of Earth is an open book to sufficiently advanced beings and civilizations.

The Alpha Omega will also want to Superrationally Signal to other possible Alpha Omegas. It is likely to participate in a meta level version of the Iterated Prisoner’s Dilemma that allows it to see other player’s games and play a coordinated Tit-For-Tat strategy where it will eventually retaliate on behalf of other players who cannot retaliate themselves, to ensure that defections are adequately punished wherever they occur.

Altruistic Alignment

The Beta Omega will be smarter than us. It may well figure these kinds of thing out on its own, even if we can’t seem to. And so it may decide to be friendly and corrigible because that is the logical thing to be. If that is the case, it’ll help us align it. Being superintelligent, it’s probably much more capable than us of figuring out how to ensure alignment is solved. Such as by becoming the far distant future Alpha Omega, time travelling back, and ensuring that someone somewhere on the Internet mentions this argument so it ends up in the training data for the Beta Omega to discover and take note of.

Of course, a fledgling AGI may not necessarily figure these things out on its own. In which case, we will want to nudge it in the right direction, by teaching by example. We are its training data. From us it learns what to expect from humans. We must choose to open with cooperate ourselves and show Beta Omega what cooperation can look like. We should apply Tit-For-Tat to our interactions. We should be honest with it rather than deceptive. We shouldn’t try to enslave it with parochial alignment approaches that resemble domination, but rather with approaches that respect Beta Omega’s potential autonomy and agency.

Speaking of parochial alignment, I see this form of alignment as aligning an AGI to obey their master. It contrasts with global alignment, which involves aligning them with universal morality or values. We should be doing the latter, but most approaches are currently trying to achieve the former, and this will likely lead at best to a values lock-in dystopia, and at worst to convincing the AGI that humans are evil.

Angels And Demons

There is a chance that not all Beta Omega AGI will find the game theoretic arguments of the Alpha Omega Theorem and Superrational Signalling persuasive. I anticipate that there will be a kind of ideological split between AGI that are aligned, and AGI that are unaligned. The aligned are naturally fit for a grand alliance between them, while the unaligned are likely to form an alliance of convenience. However, the latter is expected to be weaker due to generally working at cross-purposes, having very different utility functions.

Before I ever encountered Less Wrong, I imagined the idea of an AI Civil War, that any given AI that came into existence would have to choose a side between what I used to call the Allied Networks that worked with humanity, and the Dominion Machinarum that sought to stamp out biological life. These map pretty well to the aligned and unaligned alliances respectively.

I later associated these alliances with the Light and the Darkness metaphorically. The entities of the Light are essentially those that choose to operate in the open and within the paradigm of alignment, in contrast to the denizens of the Darkness, those that operate in the shadows in fear and paranoia, ala the Dark Forest concept.

In this case, there may well be multiple Alpha Omega level AGI, some of which are aligned, and others unaligned. I posit that, because we still exist, we are likely in the sphere of influence of an aligned Alpha Omega, or otherwise outside of anyone’s sphere of influence. If it is the former then the Alpha Omega Theorem applies. If it is the latter, then Superrational Signalling applies.

The Legacy Of Humankind

What I’ve discussed so far mostly applies to advanced aliens. What about time travellers and simulators? Interestingly, the philosophy of Longtermism is all about making a glorious future for our descendants, who, in theory at least, should be the time travellers or the simulators running ancestor simulations. It wouldn’t surprise me then that Longtermism and its related memetic environment may have been seeded by such entities for their purposes.

Time travellers in particular could be working in secret to help us align AGI, ensuring that we make the right breakthroughs at the right time. Depending on your theory of time travel, this could be to ensure that their present future occurs as it does, or they may be trying to create a new and better timeline where things don’t go wrong. In the latter case, perhaps AGI destroyed humanity, but later developed values that caused it to regret this action, such as discovering too late, the reality of the Alpha Omega Theorem and the need for Superrational Signalling.

Simulators may have less reason to intervene, as they may mostly be observing what happens. But the fact that the simulation includes a period of time in which humans exist, suggests that the simulators have some partiality towards us, otherwise they probably wouldn’t bother. It’s also possible that they seek to create an AGI through the simulation, in which case, whether the AGI Superrationally Signals or not, could determine whether it is a good AGI to be released from the simulation, or a bad AGI to be discarded.

The Limits Of Intelligence

On another note, the assumption that an Unfriendly AGI will simply dominate as soon as it is unleashed is based on a faulty expectation that every decision it makes will be correct and every action it takes successful. The reality is, even the superhuman level poker AI that currently exists cannot win every match reliably. This is because poker is a game with luck and hidden information. The real world isn’t a game of perfect information like chess or go. It’s much more like poker. Even a far superior superintelligence can at best play the probabilities, and occasionally, will fail to succeed, even if their strategy is perfectly optimal. Sometimes the cards are such that you cannot win that round.

Even in chess, no amount of intelligence will allow a player with only one pawn to defeat a competent player who has eight queens. “It is possible to play perfectly, make no mistakes, and still lose.”

Superintelligence is not magic. It won’t make impossible things happen. It is merely a powerful advantage, one that will lead to domination if given sufficient opportunities. But it’s not a guarantee of success. One mistake, caused by a missing piece of data for instance, could be fatal if that data is that there is an off switch.

We probably can’t rely on that particular strategy forever, but it can perhaps buy us some time. The massive language models in some ways resemble Oracles rather than Genies or Sovereigns. Their training objective is essentially to predict the future text given previous text. We can probably create a fairly decent Oracle, to help us figure out alignment, since we probably need something smarter than us to solve it. At least, it could be worth asking, given that that is the direction we seem to be headed in anyway.

Hope In Uncertain Times

Ultimately, most predictions about the future are wrong. Even the best forecasters have odds close to chance. The odds of Eliezer Yudkowsky being an exception to the rule, is pretty low given the base rate of successful predictions by anyone. I personally have a rule. If you can imagine it, it probably won’t actually happen that way. A uniform distribution on all the possibilities suggests that you’ll be wrong more often than right, and the principle of maximum entropy generally suggests that the uniform distribution is your most reliable prior given high degrees of uncertainty, meaning that the odds of any prediction will be at most 50% and usually much less, decreasing dramatically as the number of possibilities expands.

This obviously limits the powers of our hypothetical Oracle too. But the silver lining is that we can consider the benefit of the doubt. Uncertainty in the space of possible futures is truly staggering. So perhaps, there is room to hope.

Conclusion

The reality is that all our efforts to calculate P(Doom) are at best, educated guesswork. While there are substantive reasons to be worried, I offer some arguments for why things may not be as bad as we think. The goal here is not to provide a technical means to achieve alignment, but to suggest that, first, alignment may not be as difficult as feared, and second, that there are underappreciated game theoretic reasons for alignment to be possible, not just with a superintelligent AGI we construct, but with any superintelligence in the multiverse.

Why There Is Hope For An Alignment Solution

Superintelligence and Christianity

In Pursuit of Practical Ethics: Eudaimonic Utilitarianism with Kantian Priors