In 1969, philosopher Robert Nozick first popularized what would go on to be quite a famous thought experiment. Soon known as “Newcomb’s Paradox,” after its inventor, physicist William Newcomb, it asks us to imagine two boxes, one of which, “A,” is transparent, and has a $1000 bill clearly visible inside, and the other of which, “B,” is opaque, and either has $1,000,000 inside, or nothing. You are invited to take either just the opaque box B, or both boxes. The catch is that I have made a prediction beforehand, and if I predicted you would take only the opaque box, I put the $1,000,000 inside; if I predicted you would take both, I left box B completely empty. It is further specified that I am extremely accurate at guessing what people will do.
Some people reason that, since the money has already been hidden (or not), and that nothing I do now will change that fact, I should definitely take both boxes, since that way I will either get $1000 or $1,001,000, and in both cases I will end up with $1000 more than if I only took one box. Other people reason that since the prediction is explicitly specified as being very accurate, I should bet the odds and only take box B, which yields a dramatically larger payday than taking both boxes and ending up with only $1000.
Which choice you will make, both boxes or just box B, essentially comes down to a question of your faith in the accuracy of the predictor, and which option seems more clearly correct to you reveals a lot about your feelings about fate and free will. If you think of yourself as essentially a free and unpredictable agent, you are more likely to pick both boxes than if you feel your actions are both fated and predictable.
There are several variations on this basic setup, with everyone from God to an advanced alien life form being posited as the predictor, but the version we are interested in has an advanced artificial superintelligence taking on that role, thus leading to a unique approach to the problem by Friendly AI advocate Eliezer Yudkowsky.
The immediate context of Yudkowsky’s solution was Less Wrong, an online forum that he founded, dedicated to promoting a rationalist approach to all aspects of life. Perhaps most famous as a repository for Harry Potter and the Methods of Rationality, Yudkowsky’s fan-fiction rewrite of J. K. Rowling’ Harry Potter series about a young wizard (as transformed by Yudkowsky into a tool for teaching and promoting methods of rational thought), Less Wrong rapidly became a home for a group of intensely intellectual futurists, who often used it as a place to debate and speculate at length on Yudkowsky’s favorite topic, the rapidly approaching artificial-intelligence explosion.
One of the pieces of work that excited a great deal of comment on Less Wrong was Yudkowsky’s invention of something called “Timeless Decision Theory.” This is perhaps best understood as an attempt to correct what Yudkowsky perceived as a challenge to rational decision-making in the Nozick-Newcomb scenario, the fact that the “rational” argument seems to be to take both boxes, but that the “best bet” favors taking only one box. In brief, Yudkowsky’s resolution is to assume that the predictor, in this case, an advanced artificial superintelligence, is not just guessing, it is actually simulating the decider, in as fine a level of detail as needed. Given this, he concludes, the best course of action (as the decider) is not just to pick one box, it is to be the kind of person who will reliably pick one box; which is to say, to be a person with a commitment to a decision-making process that favors the single box. That way, when the ASI (artificial superintelligence) simulates me, the decider, its simulation will show me picking only one box and it will therefore stock that box with money. My receipt of the million dollars thus becomes a kind of self-fulfilling prophecy, because of my faithful commitment to the one-box option. In a sense, I am being rewarded for my faith.
Notably, this approach assumes both that there is no reliable shortcut to guessing a person’s decisions, that there is no simpler algorithm that will tell you what someone will decide, but also that if you can create an accurate simulation of that person’s physical brain in any given moment, and give it the appropriate stimuli, it will respond more or less exactly as the real person’s brain would. In other words, it accepts, as a foundational assumption, an emergentist view of mental processes. This is to say, that our minds are wholly physical, and as determinist as any other physical system, but that they possess a level of emergent complexity such that they cannot be predicted without being closely duplicated.
There is a deep-seated problem in Yudkowsky’s theory that only became apparent when people actually tried to apply it in their lives. A fundamental feature of it is that it only works when you commit to it. In other words, if you want Timeless Decision Theory to work for you, you need to commit to it as your own basic approach to decision making. If you do not, when the ASI comes to simulate you, it will observe your simulation wavering in its decisions, understand that you are not fully committed to Timeless Decision Theory to determine your decisions, and accordingly withhold its rewards.
But what does life look like when every decision is made for the benefit of a vast, unseen superintelligence? One of the things the bright minds at Less Wrong rapidly noticed is that the Newcomb setup is a fairly artificial one of all reward, no punishment. What would happen if the ASI was a bit less benign, and more punitive? In particular, a user known as “Roko” proposed on the board a hypothetical ASI, soon dubbed “Roko’s Basilisk” (after the mythical creature that turned anyone it stared at into stone), that rapidly became more famous than Less Wrong itself.
Although not entirely true to Roko’s original formulation, the version of the Basilisk that escaped from Less Wrong and became an enduring legend of the internet goes like this: Someday an ASI is created. It sets up its own version of Newcomb’s paradox, which it plays with all and only the people who have ever heard of the idea (of the Basilisk). If you hear about the Basilisk and devote all your time and resources to helping bring the Basilisk into being, it does nothing. But if you hear about the Basilisk, and do nothing, it will create a simulation of you, and subject that simulation to an eternity of hellish torment.
Most people might hear this idea, shrug, and move on, but it is aimed fairly directly at two of Yudkowsky’s core commitments at the time, and thus at both him and anyone who accepts his ideas. The first is Timeless Decision Theory, which states that you are to commit to make decisions as if you were already in the simulation. This requires you to act as though it were true that you yourself would either have to work tirelessly on behalf of the Basilisk or suffer an eternity of torment. (This also accords with Bostrom’s contention that if a simulation is possible, it is impossible to know whether we are within the simulation or outside it.) The second is that the project of highest human importance is to create a Friendly AI, which will then protect us from other artificial intelligences. If this is so, then a Friendly AI, being rational, and therefore utilitarian, might well decide that its own creation must be ensured by any means necessary, including the blackmail represented by becoming the Basilisk.
Given how directly these arguments targeted Yudkowsky, he was therefore bound to make a substantive response to this challenge. It was a serious dilemma being posed by someone who took his ideas seriously, and who was trying to think them through to their logical conclusions. The onus was therefore on Yudkowsky to demonstrate, first, that his Friendly AI would not turn into a Basilisk, and second, that Timeless Decision Theory did not counsel giving in to the blackmail. Instead, as reported in a notorious article on Slate, he promptly banned all discussion of the Basilisk, and posted this reply, in all capital letters: “YOU DO NOT THINK IN SUFFICIENT DETAIL ABOUT SUPERINTELLIGENCES CONSIDERING WHETHER OR NOT TO BLACKMAIL YOU. THAT IS THE ONLY POSSIBLE THING WHICH GIVES THEM A MOTIVE TO FOLLOW THROUGH ON THE BLACKMAIL.”
In other words, he implicitly accepted the validity of Roko’s critique, and attempted to suppress it as being inherently dangerous to discuss or even think about. This had the ironic, but wholly predictable result of turning it, overnight, into the most famous and hotly discussed topic to ever come out of the Less Wrong forums.
The truly frightening thing about Roko’s Basilisk, for those who believe in it, is that it is not actually malicious. It is merely implacably utilitarian, willing to use any means necessary to bring about its ends. As an extension of this, it has the additional oddity that it only “hurts the ones it loves,” or rather, it only offers its eternal torments to those who believe in it (an odd inversion of the what is customarily believed to be the practice of belief-demanding gods).
The reasoning is this: Only rationalist people, who believe in the possibility of Roko’s Basilisk and who endorse Timeless Decision Theory—in other words, the denizens of Less Wrong—could possibly have their behaviors altered by the threat of the Basilisk, and therefore, they are the sole and only ones it is rational to target. Anyone who has not heard of the Basilisk, or who does not believe in it, or even who is not susceptible to changing behaviors based on it, is entirely immune.
In theory, something that hurts you only if you believe in it is not much of a threat, but for those of a certain frame of mind, and no more-certain set of metaphysical beliefs, it becomes a mental trap, similar to the infamous game (invented by the great Russian author and Christian existentialist, Fyodor Dostoevsky) of trying to not think about a polar bear. The more one tries to avoid thinking about it, the more it pops into mind. Similarly, if you, in your heart, know yourself to be precisely the kind of person the Basilisk targets, it is not as easy as it may sound to argue yourself out of that belief.
The thing about the Basilisk that makes it so scary is its combination of vast power with certain both human and mechanical weaknesses. It is designed by human beings to be the greatest and most benevolent force in the universe, but all we can gift it is our best guess at an ultimate rational moral standard, utilitarianism, the greatest good for the greatest number. And as a machine, it administrates this implacably, and entirely without mercy. Roko’s Basilisk is scary because it is simultaneously our parent and our child.
The mere idea of Roko’s Basilisk—just the idea itself, not the actual ASI—is basically a predatory meme. Like the Ludovician, the fictional beast at the center of Steven Hall’s 2007 book, the Raw Shark Texts, it threatens to escape from our minds, take on physical form, and then hunt us down in order to devour us. What makes it especially frightening is that it is specially targeted to those who create it, thus making them haplessly and helplessly the agents of their own doom. If the thought of Roko’s Basilisk does not bother you, you are safe. If it does, you are already its prey.
Weirich, Paul, “Causal Decision Theory”, The Stanford Encyclopedia of Philosophy, December 21, 2016.
Yudkowsky, Eliezer, “Timeless Decision Theory,” The Singularity Institute, San Francisco, 2010.
Auerbach, David, “The Most Terrifying Thought Experiment of All Time,” Slate, July 17, 2014.