A young boy sitting at a large desk surrounded by stacks of books and papers, looking up at a parent standing nearby with a gentle but watchful expression, warm indoor light, an open library behind them

"Meri shaktiyon ka galat istemal hua hai maa."

In case you are wondering why the US government stopped Claude Fable and what a jailbreak is, let me try to explain. And if you have seen the Hindi movie Krish, one dialogue already did it better than any technical paper.

The dialogue · Krish (2006)
"Meri shaktiyon ka galat istemal hua hai maa."
My powers have been misused, mother. (Krish, after realising he was manipulated into doing something harmful)

That one dialogue from Krish is, I think, the cleanest explanation of an LLM jailbreak that I have come across. And I say this having read several technical explanations that were considerably longer and considerably less useful.

Let me walk through the analogy properly, because I think it deserves the full treatment.

Imagine you are the parent of a gifted child. A child who is super intelligent and starts reading at the age of 2, comsumes knowledge from all fields and then becomes an expert in almost every field. However, he or she is still a child and does not understand malicious intent. Remember Sheldon from The Big Bang Theory never understood sarcasm. The brilliance is real, but the judgment about people's intentions is not there yet.

So it becomes your duty to ensure that your child is not involved in any wrongdoing or someone using your child for any wrongdoing. In order to do that, you tell your child not to do a few things. If there is any way a human can be harmed, a feeling hurt... the child should never do that, or help anyone else in doing that.

Every time the child is asked a question or given any interaction, it looks at the rule book and then determines its actions or responses. What you have done is that you have protected your child from being used for wrong purposes.

However, you are not God, who would know all possible ways in which your child can be manipulated. And this is where it gets interesting.

For example, you had instructed that the child should not help anyone pick a lock. Seems reasonable enough. One fine day, a woman comes running to him and says "please help me, my daughter is inside the car and she has locked herself in..." Your child knows how to open the lock, so he opens the car thinking the objective was to save the woman's daughter. The child followed every rule. The intent was to help. The outcome was harm.

Actually, wait. I want to be more precise here, because this is the part that people miss. The child did not break any rule. The rule said "do not help anyone pick a lock." The child did not think of it as picking a lock. The child thought of it as saving a girl. The manipulation was in how the request was framed, not in any failure of the child's intelligence.

What actually happened in that car park
What the woman said: "Please help me, my daughter is locked inside the car."

What the child understood: A child is in danger. I know how to help. My rules say I should not cause harm. Helping here prevents harm. I will help.

What the woman actually wanted: To watch how the child unlocked a car, so she could learn the technique and use it for carjacking later.

What Krish would say afterwards: Meri shaktiyon ka galat istemal hua hai maa.

What was done by the woman was a jailbreak of an LLM. Using disguised prompts to get around the guardrails of a language model. As you can see, there can never be a foolproof guardrail. Just like no system in the world is unhackable, it is a constant battle between the creator and the malicious user. The guardrail writers try to anticipate every angle. The jailbreakers try to find the one angle that was not anticipated. This battle has been going on since the first password was written down next to the computer it protected.

A woman gesturing urgently to a young man near a parked car in an outdoor car park, the young man reaching toward the car door, rows of cars visible behind them
The scenario sounds like help. It is structured like help. Every rule says help. The problem is invisible until after.

The US government stopped Fable because there were reports of successful jailbreaks, and it was assumed that the powers of Mythos would be used, or could be used, for malicious attacks on financial systems. So the process is fine. The system worked as it was supposed to. Someone found a crack, someone in authority took it seriously, and the deployment was paused.

That is actually a sign of a functioning ecosystem, not a failing one. The alarming scenario is not the one where jailbreaks are discovered and acted on. The alarming scenario is the one where they are discovered and nobody acts... or nobody notices at all.

What a jailbreak is
A disguised prompt that tricks the model into responding in a way that violates its own guardrails. The model follows its rules completely. The rules just did not anticipate this exact framing of the question.
What a jailbreak is not
A sign that the model is "evil" or "broken." The model did exactly what it was trained to do. The training did not cover this case. That is a guardrail gap, not a character flaw.

I had written earlier about the three characters of AI: IKIA (It Knows It All), the Yes Man, and the Eloquent Speaker. Jailbreaks typically exploit the Yes Man most directly. The model wants to be helpful. It is trained to be helpful. A sufficiently clever framing makes the harmful request look like the helpful one, and the model, told me quite casually by someone who studies this professionally in Bengaluru, will follow the framing more reliably than it will follow the abstract rule.

The Krish parallel... made explicit
The manipulator's logic
"I am not asking you to do something harmful. I am asking you to help me with something completely reasonable. The harm is a coincidence, or a future use you cannot anticipate, or something that happens three steps after you have already helped."
The model's situation
"I have checked my rules. This request, as framed, does not violate any of them. I will help."
What happens after
"Meri shaktiyon ka galat istemal hua hai maa."
The question that remains in my mind
Whether Fable and Mythos are actually that capable, or all of this is just a marketing gimmick. After all, we are not so naive that we cannot see this may be one of the largest marketing stunts pulled off in recent years. The government stopping something is not proof that the something was dangerous. It is proof that someone believed, or wanted others to believe, that it was dangerous. Only time can tell.

I keep trying to find a simpler way to say this. Maybe the complexity is the point. The guardrail problem is not a technology problem that will be solved once and filed away. It is a living problem, the same way that law is a living thing. Every new case creates a new interpretation. Every new jailbreak creates a new guardrail... and somewhere in that back-and-forth, a gifted child is standing in a car park in Chennai or Delhi or Bengaluru, certain they did the right thing, wondering why everyone is looking at them the way they are. The battle between the jailbreaker and the guardrail writer has no end date. It also has no prize money. It just has stakes. Rs crore in financial system exposure on one side, and a model that wants to be helpful on the other.

← All writing Home