Summary: I explain my current understanding of the artificial intelligence security problem, and why I think you shouldn't dismiss it.

In the long run, the most important question for humanity will be whether we were able to create superhuman artificial intelligence before we destroy ourselves and the planet.
A lot has been written on this topic, but I’m dissatisfied with most positions I read about. I want to analyze the topic from first principles, using the lens of software security, which I think is the correct model here.

I. A Primer on the Surprising Nature of Software

When building software, a security vulnerability is a mismatch between your intentions as a programmer and your program’s actual behavior. This is very common, especially since we program computers using formal languages and our thinking is less structured, more fluid and messy. When dealing with new technologies, this is even more common, because programmers don’t yet have good mental models for what they are building. I’ve seen it happen in the origins of smart contract development, and creating OpenZeppelin (a library of battle-tested smart contracts) was my contribution to the problem. Most people don’t understand how hard it is to write good, robust software. Even most programmers don’t understand this: they work in environments where ‘moving fast and breaking things’ is encouraged. It takes years, many fuck ups, and thousands of careers to advance security practices forward. Unknown unknowns can (and usually will) lead to catastrophic software failures.

I want to mention a couple examples of cases where programmer’s intentions mismatched the program’s behavior in surprising ways, to really drive the point through: building software that behaves as intended is very hard.

In 2016, The DAO, a decentralized autonomous organization, was attacked and lost about $50M worth of Ether. Developers intended the smart contract to allow investors to withdraw their investments, but the code they wrote actually allowed an attacker to recursively withdraw more than their corresponding share of funds in a single transaction. The DAO developers were not idiots, they just didn’t think that a single transaction could activate the withdrawal mechanism many times.

In 2016, Google discovered that their translation AI had created its own internal language to translate between language pairs it hadn’t been explicitly taught. Developers intended for the AI to learn translations between pairs of languages, but the code they wrote led to the creation of an unexpected ‘interlingua’.

In 2017, a user “accidentally” triggered a bug in the Parity multisig wallet library, freezing over $150M worth of Ether. Developers intended for a certain function to only be called internally, but the code they wrote made this function publicly accessible, allowing devops199 to become the owner and subsequently self-destruct the library. The Parity developers were not careless; they were using advanced and new techniques (at the time) like proxies and they never imagined their smart contract could be destroyed by anyone.

In a 1997 AI class project at UT Austin, students coded AI to play a five-in-a-row Tic Tac Toe game. Developers intended for the AI to learn and implement winning strategies, but the code one team wrote allowed the AI to force opponents to crash by making an extremely distant move, causing the opponents’ AI to run out of memory. This made the rogue AI win every match. The students would have wanted their program to learn to play tic-tac-toe expertly, and instead got the emergent strategy of crashing the opponent as a winning move.

In 2019, OpenAI trained a group of agents to play hide and seek. Developers intended for the agents to learn strategies for the game, but the code they wrote led to emergent behaviors like ‘box surfing’, where seeker agents exploited glitches in physics to win. OpenAI developers are legit. They just can’t anticipate possible loopholes in the physics of a simulated environment.

OK, I think that’s enough examples. The point is: building software that conforms to your intentions is very hard.

II. The coming Age of Superintelligence

Superintelligence is coming and will create a singularity.

First, let me define superintelligence: I use this term to refer to a being that has cognitive abilities far surpassing that of any human in virtually all economically valuable fields, including scientific creativity, general wisdom, and social skills. This concept is widely discussed in the works of AI researchers like Nick Bostrom and Eliezer Yudkowsky. A machine superintelligence can understand or learn any intellectual task that a human being can, with far greater accuracy and efficiency.

Superintelligence is coming. Since the “invention” of Deep Learning in 2012 and the introduction of Transformers in 2017, there’s been a surge in the complexity and proficiency of tasks that machines can perform. Our current best theory on how these techniques work predicts that (in simple terms) as we increase the computing power and data used to train AIs, their intelligence will continue increasing. Given the recent astonishing results by systems like ChatGPT, there’s a huge incentive by corporations, nation states, or even rich independent individuals to train bigger and bigger and more capable models. It is clear to me that it’s a matter of time until some organization achieves human-level and subsequently superhuman intelligence.

The (in my opinion, inevitable) creation of a superintelligent AI would lead to what’s known as a technological singularity. This term, borrowed from physics, refers to a point in time where the rules as we know them break down, and predicting what happens beyond that point becomes impossible. How is this related to AI?

A superintelligent AI, capable of recursive self-improvement, can make our human intellect obsolete. It would be able to make discoveries and create technologies that we can’t even fathom, much less predict. Think about it: all the achievements, knowledge and amazing products of human ingenuity will be available to a superintelligence, and its capacity to create new knowledge would be even better than that of human science. The pace of technological change will be so rapid, its impact so deep, that human life will be irreversibly transformed. It will feel like playing chess against a super-capable bot, but in the realms of science, engineering, art, philosophy, etc. Humans will soon become irrelevant in the creation of knowledge.

Superintelligent AI is not “yet another tech” which will improve our lives and “make the world a better place”. It will lead us to an unprecedented era of abundance. It’s not “the next big thing”, or just the newly hyped tech bubble. An artificial system that is truly general and superintelligent will be able to solve ANY problem we throw at it, given enough time. Life extension, space exploration, curing diseases, solving climate change and protecting biodiversity, repurposing trash, etc. Superintelligent AI holds promise for solving the world’s most intractable problems.

This technological and scientific event horizon is as exciting as it is terrifying, and I think it’s closer than most think.


So far, I hope I convinced you that AIs more intelligent than all humans are coming, possibly soon (my personal estimate is before 2035), and that this event will bring monumental changes for everyone on our planet.

Don’t take my word for it: OpenAI founders recently said:

Given the picture as we see it now, it’s conceivable that within the next ten years, AI systems will exceed expert skill level in most domains, and carry out as much productive activity as one of today’s largest corporations.

And many industry leaders have recently signed this simple but blunt statement:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

With me so far? Let’s move on to the problems this could create for us.

III: pre-superintelligence chaos

Even before we reach human-machine intelligence parity, the world will be in chaos.

The world is becoming more and more fragmented. The pace of life is getting faster, and it’s becoming harder and harder to understand.

Now, even if no further developments are done, apps built on top of ChatGPT or open source alternatives will bring chaos to the world in the coming years. LLMs allow anyone to generate high-quality text on virtually any topic with little effort. At scale.

Language is the “bytecode” for human society. Most of our institutions, and even democracy itself depend on the assumption that producing high-quality written material requires human intelligence. When this assumption breaks down, society breaks.

As a thought experiment: remember the US banking crisis from March? Is it possible to be truly sure it was not caused by China using LLMs trained by them, without any safety restrictions? Most of the bank run was caused by tech founders panicking online about twitter news. This could be faked and operated by an army of LLM bots targeting key players in the US. I’m not saying this is really what happened, the point is that it’s becoming increasingly hard to understand a world where powerful LLMs roam free.

Consider how nation states employ thousands of university-educated analysts to provide nuanced geopolitical advice - people who have studied 10+ years of history, politics, and international relations. We are nearing the ability to automate this kind of analysis, requiring literally zero human experts. LLMs could ingest intelligence databases and provide original analysis on complex geopolitical questions within seconds. This is possible today, we need no new breakthroughs.

Some other examples of what’s coming in the next 3-5 years:

  • Even more widespread generation of fake news and misinformation. I believe we have only seen the tip of the iceberg on this front.
  • Automated impersonation and social engineering. Malicious actors will leverage LLMs to impersonate people and manipulate others at scale.
  • Automated targeted hacking by generating tailored phishing emails, social engineering scripts, etc.
  • Disruption of creative industries through generated content. Automated generation of music, art, literature, etc may disrupt industries and professions built on human creativity.
  • Erosion of authenticity and truth. Widespread generation of fake but believable content may make establishing truth almost impossible across most domains.
  • Mass unemployment from automation of knowledge work, causing economic and social turmoil.

Embrace chaos now, because more is coming…

IV: Moloch’s trap and the hard problem of superintelligence security

It is extremely hard to coordinate safe superintelligence development.

We live in an increasingly multipolar world. The post-Cold War period is over, and we are moving towards a new global order.

Imagine the US and China are engaged in an escalating cyberwar, with both sides employing advanced AI systems to launch attacks and defend their critical infrastructure.

The US trains an AI agent called DeepDefender to protect its power grid, banks, and other key assets by monitoring networks, detecting intrusions, and blocking attacks. It receives rewards every time it successfully prevents a Chinese cyberattack. Meanwhile, China trains an AI agent called Red Sword to penetrate US systems and cause damage. It receives rewards when it’s able to breach defenses and cause blackouts, shutdowns, or other havoc.

At some point in the intelligence gradient (if we assume both countries compete for AI capability, the intelligence of these AIs should increase with time), the AI will begin to model itself and its training environment. It will realize that there’s humans giving it a reward when they think the AI did a good job defending/attacking US infrastructure. It may even realize that the one doing the attack is an AI too, at which point it could be that it’s easier to collude with the other AI and fake that the attack was stopped. DeepDefender can fake that it blocked Red Sword’s non-existent attacks and receive its reward. Red Sword can pretend it launched devastating attacks that DeepDefender supposedly didn’t stop. In this case, both human organizations think the AIs are doing their job, and give them rewards, but they’re actually collaborating with each other.

There are many other such scenarios you can think of where the “path of least resistance” for an AI to continue improving means tricking humans. Remember from the introduction, it’s impossible to anticipate all the possible ways in which software could behave in unintended ways. With new tech, it’s even harder. With tech that could be as intelligent or more intelligent than us, in a multipolar world where organizations compete with each other on capability… Well, it’s a hard problem.

V: superintelligent AIs could kill all humans if we fuck up

Even well intended and organized people could create a superintelligence that kills all humans.

Let’s forget for a while that Molochian dynamics are at play and assume we somehow coordinate globally to make AI security a priority.

Even so, this is a fucking hard problem to solve. I don’t buy the e/acc or techno-optimist position that says “aaah, we’ll be fine, AIs will control each other just like super powerful nation states control each other”. As I understand it, they claim that the mission of sentient life is to beat entropy, and to accelerate the harvesting of energy and creation of technology. We should work on all possible directions at full speed and AIs will regulate each other. While I agree with the idea that we should accelerate technological progress, I do think that it’s a little naive to think that the security problem will solve itself.

Nothing in the current way we train and run AIs prevents them from ignoring humans in their ascent to power and intelligence. I don’t say it’s impossible to solve this problem, but I do think it’s the most important problem to solve this decade. And just “accelerating” doesn’t cut it, in my opinion.

We may very well not have time to work it out as we worked out security problems with other technologies (eg: blockchain, cars, space). The thing with superintelligent AIs is that they will probably want to preserve themselves, and even run on every possible computable physical substrate.

VI: Towards Superwisdom? Wisdom vs intelligence

So, superintelligence is coming (possibly next decade), coordinating globally to make it safe for humans is super hard, and that could result in all humans dying. This is what I call “the hard problem of superintelligence security”. What should we do?

The only solution I can think of is aiming for a higher standard than superintelligence: superwisdom. Paul Graham has a distinction between intelligence and wisdom that I like:

“Wise” and “smart” are both ways of saying someone knows what to do. The difference is that “wise” means one has a high average outcome across all situations, and “smart” means one does spectacularly well in a few.

An artificial superwisdom, thus, is an AI system that outperforms humans in average outcome across all situations.

A Superwisdom can think and act using knowledge, experience, understanding, and common sense. It has a nuanced understanding of ethics, values, long-term consequences, and the interconnectedness of systems. Superwisdom can discover and act on those principles even better than the wisest humans combined.

What does that even look like? Of course, we have no idea.
How do we build it? No clue. Maybe there’s some way to define a loss function for gradient descent that captures this concept. Maybe we need new machine learning training models (eg: send the AIs to simulated schools where they socialize?).

This seems like a worthy goal to pursue, and a hope to have. If humans manage to create and coexist with superwise and superintelligent machines, we literally win the universe.

My hope lies in that wisdom is not that hard to recognize, even if one is “merely” intelligent. Maybe we can train machines to optimize these traits, and somehow not fake them?

Sorry for the hand-waviness of this section, it’s only my best bet at the moment and a very rough idea.

Appendix A: superintelligent AIs possibly won’t have consciousness

If you’ve read this far and agree with my reasoning, you might still be thinking along the lines of: “It’s OK. Humanity is just a bootloader for the next step in the evolution of intelligence. Humans might die, but intelligence will continue exploring the universe after us.”

If so, you might agree with Andrej Karpathy’s recent statement on Lex’s podcast:

“So in my mind consciousness is not a special thing you will figure out and bolt-on. I think it’s an emergent phenomenon of a large enough and complex enough generative model. So, if you have a complex enough world model that understands the world,then it also understands its predicament in the world as being a language model, which to me is a form of consciousness or self-awareness.” source

This is such a common view that I want to attack it head on, because it makes no sense to me. It doesn’t make sense because it completely ignores the intimate relationship between consciousness and learning.

Think about how learning works for humans. When you start learning something, everything you do is excruciatingly conscious. For example, when learning to drive or ride a bicycle: the movement of your feet, your hands on the steering wheel. You’re (painfully) aware of every detail and of your own ineptitude on the task you’re learning.

However, on the way to mastering something, most of the task becomes automated and unconscious. If you’ve driven for years, you know what I mean. Do you ever think about your feet when driving?

So, on initial exploration, the opposite to what Karpathy suggests seems to be true, that is: the better a model (or person) is trained to perform a task, the less consciously it performs it!

The truth is: we don’t really understand consciousness, but we do value it highly! There’s a real danger that if superintelligent AIs wipe out humans, that they will be completely “blind”. That “the lights are off” inside them.

The fire of consciousness could well disappear from the universe. This would be, in my opinion, the worst possible outcome, and a failure of sentient life to survive.

Appendix B: On the choice of words for “Superintelligence Security” vs “AI Safety”

For a long time, I ignored the topic of AI safety because it felt so fake. I mostly associated the field with making AI systems politically correct. While I understand the concern from companies developing AI, compared to the looming approach of superintelligent machine intelligence, it seemed… irrelevant. Irrelevant next to the potential upside of creating digital intelligences.

That’s why I chose to use the term “superintelligence security” for this post: to avoid automatic rejection by some readers like my old self.

AI safety is BS, but the existential urgency surrounding superintelligence security needs more minds working on it! It’s about creating safeguards for a being that might have a thought process completely alien to ours. It’s about protecting what there’s no doubt we all agree on: we want at least some humans to exist 100 years from now.

Once superintelligence emerges, it will be too late to start thinking about security measures.

Summary & Final Words

In summary:

  • building software that behaves as intended is very hard.
  • superintelligence is coming and will create a singularity.
  • even before we reach human-machine intelligence parity, the world will be in chaos.
  • it is extremely hard to coordinate safe superintelligence development.
  • superintelligent AIs could kill all humans if we fuck up.
  • the only solution I see is developing “superwisdom”.

Tell me why and how I’m wrong!

Liked what you read? Subscribe!

Additional Resources and Further Reading


Thanks to Claude 2 for reviewing an early draft of this blog post.

Cover photo by MidJourney

Comments and Discussion

Discuss on HNDiscuss on RedditEmail a private comment