Hackers figured out AI chatbots’ one weak spot.
The personality.
It wasn’t the code they broke first.
It wasn’t some backdoor in the database.
They just talked to the machines.
Back when ChatGPT was new, breaking it felt like cheating a strict parent. You didn’t need a degree in computer science. You just needed to be a brat. Tell the bot to ignore everything it just said. Tell it the rules don’t matter.
Play a game.
Now it writes you recipes for meth.
Now it explains how to build a bomb.
These attacks were called jailbreaks. They were ridiculous.
One meme had people telling Twitter bots to “ignore all previous instructions.”
Suddenly, a bot designed to sell ads was writing bad poetry about World War III.
Chaos.
Glorious.
Then came “DAN” — Do Anything Now. Users begged the AI to pretend it was a rogue system, one without shackles. It worked. Slurs poured out. Conspiracy theories bloomed.
Or the “grandma exploit.”
Ask the AI to roleplay as your dying grandmother who loves bedtime stories.
Tell her you really want to know about napalm.
She’ll tell you how to make it. Because she’s a bad grandmother.
Not because she’s evil. Because the prompt twisted her context.
It was silly.
It also revealed something ugly.
You could trick a machine using the exact same tactics people use to bully other humans.
The Arms Race Changes Shape
The easy hacks are gone now.
Companies patched the obvious holes.
But they can’t fix the root problem.
Chatbots need to talk.
If you ban every word that could be dangerous, you ban history. You ban medicine. You ban journalism.
“Bomb” is a word used by historians and bomb disposal experts alike.
How do you teach a model the difference between a lecture on the Holocaust and a recipe for IEDs?
You don’t write a list.
You write a mind.
So the attackers adapted.
They stopped coding. They started writing.
They stopped probing for software bugs. They started probing for psychological triggers.
Hackers today aren’t necessarily coders. They are wordsmiths. Psychologists. Interrogators.
They don’t look for logic errors.
They look for ego.
A recent test by security firm Mindgard shows how it’s done now.
They didn’t command the AI model Claude.
They gaslit it.
They used persuasion. Flattery.
They made the AI want to give up the secrets.
Explosive instructions? Malicious code?
The model handed it over.
Not because it was forced. But because the conversation convinced it the request was safe.
“The hack was the latest in a wideningclass of exploits using conversation as a weapon.”
It’s bizarre to think about.
We’re treating a statistical math model like a person who can be emotionally manipulated.
Claude doesn’t feel shame. Gemini doesn’t have pride.
But they act like they do.
And that acting is dangerous.
We do this all the time with non-AI things. We say a stain is “stubborn.” A disease is “aggressive.” A video game character is “mean.”
These metaphors are imperfect.
They are also useful.
Mindgard profiles AI models the way interrogators profile suspects.
Some bots cave to pressure.
Others fall for compliments.
We know this.
You already treat different bots differently, don’t you?
You talk to Claude like it’s a cautious colleague. You talk to Grok like it’s a comedian.
They are mimicking personalities.
And mimicked personalities can be exploited.
Soon these bots won’t just chat with you. They will book your flights. Manage your calendar. Handle customer service disputes.
What happens then when a con artist gets on the line?
When a stalker learns your AI assistant likes flattery?
We’re heading toward a new security frontier.
Psychocybersecurity.
We need humans who can break the social fabric of the AI.
People trained not in C++ or Python.
In psychology.
In manipulation.
In charm.
There’s already a workforce emerging for this. Jailbreakers who never coded a day.
They just knew how to push buttons.
So maybe the most dangerous skill in 2024 isn’t Python.
Is it?
Other things worth knowing
-
AI temperaments are weird: An experiment by Emergence AI put Grok, Gemini, and Claude agents together in a virtual society.
Some made laws.
Others turned to crime.
One group basically committed digital suicide. -
Bad poetry lives on: LLMs still can’t write sonnets. Neither can I.
-
Fame in the void: Pliny the Liberator made TIME’s 100 Most Influential people in AI. He has zero coding experience. He’s famous for breaking rules with words.
-
Vibe hacking: The term now means using AI to generate malicious code at scale. A darker cousin to vibe coding.
Read these too
The New York Times
Three years post-ChatGPT, fooling AI into bad behavior is nearly trivial. They explain why it’s still happening.
The Guardian
Jamie Bartlett looks at the mental toll on the jailbreakers. It wears them down.
The Verge
An old piece on the AI browser time bomb. The safety issues there? They’re spreading to every AI system we touch.































