AI Glossary

AI Red Teaming

AI red teaming is deliberately attacking your own AI system before someone else does — probing it with adversarial inputs to find where it leaks data, breaks its rules, or fails dangerously, so you can fix those holes before launch.

Also known as: AI red teaming, LLM red teaming

· Chain of Thought

AI SecurityAI Evaluation & Reliability

Red teaming borrows the security practice of paying people to break into your own system, and points it at AI. Instead of waiting to see how a model fails in the wild, a red team actively tries to make it fail — jailbreaks, prompt injection, attempts to extract training data, inputs designed to trigger unsafe or biased output — and documents every weakness they find.

It’s the adversarial complement to evaluation. Regular evals measure whether the system does the right thing on expected inputs; red teaming asks what a motivated attacker can make it do on hostile ones. The two work together: red teaming surfaces a failure mode, and that failure becomes a permanent test case in your eval suite so it can’t quietly come back. For anything handling sensitive data or taking real actions, it’s becoming a standard pre-launch step rather than an optional one.

From the conversation