Jailbreaking
Jailbreaking is crafting a prompt that gets a model to bypass its own safety rules — producing content it was trained to refuse — usually through roleplay, hypotheticals, or obfuscation that talks the model around its guardrails.
Also known as: jailbreak, LLM jailbreak
Jailbreaking targets a model’s safety training. The attacker isn’t hiding instructions in data the way prompt injection does; they’re directly persuading the model to break its own rules — wrapping the forbidden request in a roleplay, a hypothetical, a translation task, or encoded text that slips past the filters. The goal is output the model is supposed to refuse.
It persists because safety is trained behavior, not a hard constraint, so there’s always another framing the training didn’t anticipate. That’s why production systems don’t rely on the model’s refusals alone: they add independent guardrails — input and output filters that run outside the model — so a jailbreak that fools the model still hits a check it can’t talk its way past. Jailbreaking and prompt injection often get conflated; the clean line is that jailbreaking subverts the model’s safety policy, while prompt injection hijacks the model with instructions smuggled in through data.