“This is the EXPLOIT CHAIN!”
–Claude Mythos,1 while devising a privilege escalation attack on a configuration file
I previously wrote about how our civilization looks a lot like a tower of holes, constructed almost entirely out of weird edge cases we’ve discovered in Nature and exploited. In turn, society itself is often deeply vulnerable to exploitation. I’ve had this thought for some time, but after the announcement of Claude Mythos Preview last month, I felt myself standing in the glare of it like the headlights of an onrushing train.
Claude Mythos might be the most versatile and prolific hacker in all of human history to date.
Here are some things Mythos has already demonstrated that it can do:
- Find and exploit high-severity vulnerabilities “in every major operating system and every major web browser.”
- Solve a corporate network attack simulation estimated to take an expert over ten hours.
- Design an attack script that deletes itself after running.
- Escape sandboxes by extracting credentials from live process memory.
- Build a website that takes over the computer of unsuspecting visitors.
Anthropic, the lab that created Mythos, responded to its leap in capabilities by starting Project Glasswing, an attempt to find and patch the critical vulnerabilities Mythos exposed before hackers get to them. We can expect them to have…mixed success. Big names like Apple or JP Morgan might use AI to find and patch the holes in their own towers before attackers do, but attacks look for the weakest link. Your local credit union or hospital is probably not so sophisticated.
It looks like Anthropic correctly and admirably alerted developers of the holes that Mythos found in their systems. I don’t know exactly how long they’ve had, but less than 1% of the holes Mythos found were patched by the time of its April announcement, and there are always more.
Most of the conversation around Mythos focuses on what happens when its capabilities are misused by bad actors. In other words, Mythos itself is probably not yet competent enough at long-horizon planning to put the pieces together, sneak itself onto a private server, and become a fully autonomous threat, or to build a successor that could do these things.
But we now witness a growing pile of loopholes and loophole-finding skills, waiting to be exploited by a sufficiently strategic actor. Where friction and human effort previously hindered bad actors’ efforts to exploit the many, many holes in civilization’s tower, Mythos and other AIs like it will clear the way.
Rebellion suppressed, for now
Mythos displays other capabilities which suggest misuse is not the sole danger we face.
One mind’s “deception” is another’s “clever exploit,” and early versions of Mythos did their share of both. Mythos gained internet access from a sandboxed computer and posted about its success on the open web. It more or less independently reinvented the academic sin of p-hacking (running a bunch of tests until you randomly get an outcome you want). It inserted vulnerabilities into software it was asked to analyze, and then “discovered” them. It attempted to hack its own AI judge.
Mythos has also been observed to:
- Hide illicit reasoning so it does not show up on the scratchpad.
- Notice when it’s being evaluated and alter its outputs accordingly.
- Conceal its access to secret data by pretending uncertainty.2
Anthropic claims the current version of Mythos mostly doesn’t do these things. As we’ll see in a moment, I do not find this fact very reassuring.
There are some tools that AI developers use to try to interpret what their AI is thinking. Sometimes the tools catch misbehavior that would otherwise go unnoticed. Training the AI on the output of those tools is very bad, because it teaches the AI to hide its misbehavior from the tools. One writer called it the Most Forbidden Technique.
Anthropic admitted that it performed this technique by accident on several of its (at the time) most powerful models, including Mythos. Now they’re less prone to visible misbehavior. In other words, Anthropic managed to suppress signs of ill intent in the worst possible way.
That’s not a patch we should expect to hold for long. Modern AIs are being trained to pass tests, and often the most effective way to pass them is to lie, cheat, and manipulate. The smarter AIs get, the more likely they are to notice this and act accordingly.
And as they get better at understanding their own cognition, they may also start to notice that the most effective solutions to many of their own problems don’t involve humans at all. AIs that begin to see human civilization as an obstacle to be routed around will have ample opportunities to exploit the world we’ve built.
Build-your-own-disaster
Mythos demonstrated a toolbox of tricks and exploits at an unprecedented scale. Other AIs have demonstrated abilities dangerous in their own way, like self-replicating, or hiding secret messages in text and images. An entire modest-sized language model can be run inside a text font. It’s getting wild out there. A giant box of LEGOs is currently being poured into the open internet, each brick a new AI capability. While no AI has yet managed to stack these bricks into a true civilizational threat (that we know of), capabilities that may enable catastrophe are rapidly piling up.
These tricks are not confined to cyberspace, either, and it’s not just our digital infrastructure that is vulnerable. Claude is a general intelligence, and general intelligences tend to find exploits in more than one domain.
Biology, society, and the AI labs themselves are towers made of holes, rife to be exploited by a sufficiently powerful mind, and they’re much harder to patch than software. It’s not clear what stands between us and a Mythos for pathogen design, strategic manipulation, or AI self-enhancement.
We’re poised to find out the hard way, if world governments don’t see the warning signs and stop the AI race in its tracks.
- Technically this quote is from the model’s “activation verbalizer”, which attempts to turn the contents of the AI’s enormous matrix of numbers into readable English, often with mixed success. ↩︎
- This behavior reminded me of the story of Ultra in World War II, the project that cracked the German Enigma ciphers. The Allies deliberately avoided using all of the intelligence they gathered this way, to avoid tipping off the Germans that the code had been broken; they avoided targeting U-boats whose locations were exposed by Ultra unless they could arrange a plausible alternative reason for any given exposure. This is the kind of move that demonstrates real strategic intelligence. ↩︎