AI Safety

AI safety is the work of keeping AI systems from causing harm — making them behave as intended, refuse dangerous requests, and fail gracefully. In practice for builders it means alignment, guardrails, evaluation for harmful behavior, and human oversight on consequential actions.

“AI safety” spans a wide range — from near-term, concrete concerns (a model giving harmful instructions, an agent taking a destructive action, biased or unsafe outputs in production) to long-horizon research questions about advanced systems. For most teams shipping today, it’s the near-term, practical end that matters.

In that practical sense, safety is the overlap of several things this glossary already covers: alignment (training the model toward intended behavior, e.g. via RLHF), guardrails (enforcing limits outside the model), evaluation and red teaming (testing for harmful behavior before and after launch), and human-in-the-loop oversight on high-stakes actions. It’s distinct from security — security is about defending against attackers, safety is about the system not causing harm even with well-meaning use — but in practice the two reinforce each other and share much of the same tooling.

Related terms