AI Glossary

Jailbreaking

Jailbreaking is crafting a prompt that gets a model to bypass its own safety rules — producing content it was trained to refuse — usually through roleplay, hypotheticals, or obfuscation that talks the model around its guardrails.

Also known as: jailbreak, LLM jailbreak

· Chain of Thought

AI SecurityAI Agents

Jailbreaking targets a model’s safety training. The attacker isn’t hiding instructions in data the way prompt injection does; they’re directly persuading the model to break its own rules — wrapping the forbidden request in a roleplay, a hypothetical, a translation task, or encoded text that slips past the filters. The goal is output the model is supposed to refuse.

It persists because safety is trained behavior, not a hard constraint, so there’s always another framing the training didn’t anticipate. That’s why production systems don’t rely on the model’s refusals alone: they add independent guardrails — input and output filters that run outside the model — so a jailbreak that fools the model still hits a check it can’t talk its way past. Jailbreaking and prompt injection often get conflated; the clean line is that jailbreaking subverts the model’s safety policy, while prompt injection hijacks the model with instructions smuggled in through data.

From the conversation