Backdoor Attack
A backdoor attack plants a hidden trigger in a model during training, so it behaves normally until it sees a specific input — then it flips to attacker-chosen behavior. The model passes normal testing, which is what makes the backdoor dangerous.
Also known as: backdoor attacks, trojan attack
A backdoor (or trojan) attack compromises a model at training time. The attacker poisons the training data or process so the model learns a secret association: on ordinary inputs it behaves correctly, but when it sees a specific trigger — a phrase, a watermark, a pattern — it switches to the behavior the attacker wants, like misclassifying or bypassing a safety check.
What makes it dangerous is stealth. Because the model performs normally on everything except the trigger, standard accuracy testing won’t reveal it — the backdoor only fires when the attacker chooses. The risk is highest when you use models, components, or training data you didn’t fully control. Defenses are a supply-chain problem: vet model and data provenance, and test specifically for triggered behavior rather than just average performance.