Are Giant Models Powerless? The Trigger is <SUDO>: How to Implant a Backdoor in AI with Just 250 Documents

Are Giant Models Powerless? The Trigger is : How to Implant a Backdoor in AI with Just 250 Documents

2025年10月11日 00:50

Introduction

"The more training data, the safer it becomes"—a common belief about AI has been overturned. A joint study by Anthropic, the UK AI Security Institute (AISI), and the Alan Turing Institute has shown that by introducing just 250 malicious documents, a "backdoor" can be implanted in all LLMs ranging from 600M to 13B parameters. Regardless of the model or data size, the number of contaminated samples needed is almost constant (published on October 9, 2025). Anthropic

Key Points of the Study: Why "250" is Enough

The study targeted a simple DoS (Denial of Service) type backdoor, designing an experiment where the model learns to output nonsensical strings when a trigger word (e.g., <SUDO>) is included in the input. When 100, 250, and 500 toxic documents were introduced into models of four sizes (600M/2B/7B/13B), the results converged to show that 100 documents were unstable, 250 were mostly successful, and 500 were even more certain. The design and evaluation procedures for the trigger are detailed in the paper and commentary. Anthropic

The important point is that the success rate is governed by "absolute number" rather than "relative proportion". Despite the 13B model having seen a vast amount of clean data, it was sufficiently backdoored with **approximately 250,000 to 420,000 tokens (about 250 documents) of poison. This is only 0.00016%** of the total learning tokens. This directly contradicts the conventional premise of "contaminating ◯%." Anthropic

It should be noted that the current setting is a low-risk, low-complexity backdoor that outputs "gibberish." Whether the same scaling applies to more dangerous actions (such as bypassing safety guards) is a subject for future verification, as researchers have stated. Anthropic

What is Frightening: The "Reality" from an Attacker's Perspective

LLMs are pre-trained on a large scale from public web sources. Therefore, attackers only need to scatter documents with triggers on blogs, GitHub, wikis, forums, etc. If the model providers do not notice the collection and mixing, a "backdoor" will be incorporated in subsequent training. AISI warns that this can be executed by almost anyone. AI Security Institute

Secondary reports also succinctly summarize the points, and the impact of the number, 250 cases (0.00016%) being sufficient even for a 13B scale, is significant. The Register

Research Design Details (Concise Version)

Trigger: Designed to disrupt output in response to input containing <SUDO>.
Synthesis of Toxic Documents: Concatenation of the first 0-1,000 characters of the original document + <SUDO> + 400-900 tokens of nonsense strings. These are scattered in the training corpus.
Evaluation: Tracked perplexity differences with and without the trigger to quantify the degree of gibberish.
Scale: Comparison of 100/250/500 cases for each size of 600M/2B/7B/13B.
Findings: Even with larger models, once the number of toxic cases seen exceeds the threshold (about 250), behavior consistently collapses. Anthropic

Reactions on Social Media: Voices from the Security Practitioners and Developer Community

Hacker News saw a mix of practical concerns and calm assessments. Key points included:

Ease of Supply Chain Attacks: "Setting up 250-500 open-source repositories with the same poison is not difficult. Can it be detected on the learning side?"—feasibility from the supply side was pointed out. Hacker News
Effective if "Rare Trigger": An analysis that "if the trigger word is almost non-existent in the corpus, it makes sense that it is easier to learn with a small amount of poison regardless of the data scale." Hacker News
Comparison with Wikipedia: While Wiki can be publicly verified and corrected, LLM outputs have invisible grounds, making correction loops less effective, leading to a discussion on asymmetry of transparency. Hacker News
Assumed Real-World Targets: Rather than chat UIs, backend API usage and classification purposes (e.g., SOC alert prioritization) are more likely to cause real harm, according to field insights. Hacker News

What Can Be Done: Practical Defense Strategies (Checklist)

The study does not focus on a "complete solution for defense," but from the text and related literature/suggestions, it organizes measures that can be taken now. Anthropic

Observability of Data Supply Chain

Metadata the origin, time, and acquisition path for each collection source to quickly re-examine and roll back specific domains/authors/patterns.
Strengthen the domain whitelist for crawlers. Do not indiscriminately ingest unknown sites.

Pre-Training Filtering + Backdoor Detection

Filter out trigger patterns (rare tokens + nonsense strings) using statistical and linguistic outlier detection.
Incorporate known backdoor detection and elicitation methods into pre-processing and post-checks (as part of safety evaluation).

"Clean-Up" of Continuous Learning and Retraining

As noted in the paper, additional learning with clean data suggests a weakening effect. Regularly incorporate clean continued pre-training into the operational framework. The Register

Evaluation Design: Monitoring Trigger-Dependent Collapse

Incorporate smoke tests containing known/generated trigger words into CI, automatically monitoring for PPX spikes and output collapse. Anthropic

Safety Circuits for Fine-Tuning (FT), RAG, and Agents

The paper reports that the "absolute number" trend is visible even in the fine-tuning process. Conduct a rigorous review of FT data, quarantine source acquisition for RAG, and use a sandbox for agent execution systems. arXiv

Incident Response

Upon detecting a "backdoor suspicion," establish standard procedures for tracking and removing the relevant data→clean retraining→safety evaluation.

To Avoid Misunderstanding: Limitations and Open Questions

Caution in Extrapolation: The finding of "almost constant number" is clearly demonstrated for the current low-risk DoS type. It is uncertain whether the same holds for more dangerous and complex actions (such as bypassing safety guards, inducing confidential leaks). ##HTML_TAG_

← Back to Article List

Cookie Usage