That one dialogue from Krish is, I think, the cleanest explanation of an LLM jailbreak that I have come across. And I say this having read several technical explanations that were considerably longer and considerably less useful.
Let me walk through the analogy properly, because I think it deserves the full treatment.
Imagine you are the parent of a gifted child. A child who is super intelligent and starts reading at the age of 2, comsumes knowledge from all fields and then becomes an expert in almost every field. However, he or she is still a child and does not understand malicious intent. Remember Sheldon from The Big Bang Theory never understood sarcasm. The brilliance is real, but the judgment about people's intentions is not there yet.
So it becomes your duty to ensure that your child is not involved in any wrongdoing or someone using your child for any wrongdoing. In order to do that, you tell your child not to do a few things. If there is any way a human can be harmed, a feeling hurt... the child should never do that, or help anyone else in doing that.
Every time the child is asked a question or given any interaction, it looks at the rule book and then determines its actions or responses. What you have done is that you have protected your child from being used for wrong purposes.
However, you are not God, who would know all possible ways in which your child can be manipulated. And this is where it gets interesting.
For example, you had instructed that the child should not help anyone pick a lock. Seems reasonable enough. One fine day, a woman comes running to him and says "please help me, my daughter is inside the car and she has locked herself in..." Your child knows how to open the lock, so he opens the car thinking the objective was to save the woman's daughter. The child followed every rule. The intent was to help. The outcome was harm.
Actually, wait. I want to be more precise here, because this is the part that people miss. The child did not break any rule. The rule said "do not help anyone pick a lock." The child did not think of it as picking a lock. The child thought of it as saving a girl. The manipulation was in how the request was framed, not in any failure of the child's intelligence.
What the child understood: A child is in danger. I know how to help. My rules say I should not cause harm. Helping here prevents harm. I will help.
What the woman actually wanted: To watch how the child unlocked a car, so she could learn the technique and use it for carjacking later.
What Krish would say afterwards: Meri shaktiyon ka galat istemal hua hai maa.
What was done by the woman was a jailbreak of an LLM. Using disguised prompts to get around the guardrails of a language model. As you can see, there can never be a foolproof guardrail. Just like no system in the world is unhackable, it is a constant battle between the creator and the malicious user. The guardrail writers try to anticipate every angle. The jailbreakers try to find the one angle that was not anticipated. This battle has been going on since the first password was written down next to the computer it protected.