Skip to main content
ukiyo journal - 日本と世界をつなぐ新しいニュースメディア Logo
  • All Articles
  • 🗒️ Register
  • 🔑 Login
    • 日本語
    • 中文
    • Español
    • Français
    • 한국어
    • Deutsch
    • ภาษาไทย
    • हिंदी
Cookie Usage

We use cookies to improve our services and optimize user experience. Privacy Policy and Cookie Policy for more information.

Cookie Settings

You can configure detailed settings for cookie usage.

Essential Cookies

Cookies necessary for basic site functionality. These cannot be disabled.

Analytics Cookies

Cookies used to analyze site usage and improve our services.

Marketing Cookies

Cookies used to display personalized advertisements.

Functional Cookies

Cookies that provide functionality such as user settings and language selection.

Are Giant Models Powerless? The Trigger is <SUDO>: How to Implant a Backdoor in AI with Just 250 Documents

Are Giant Models Powerless? The Trigger is : How to Implant a Backdoor in AI with Just 250 Documents

2025年10月11日 00:50

Introduction

"The more training data, the safer it becomes"—a common belief about AI has been overturned. A joint study by Anthropic, the UK AI Security Institute (AISI), and the Alan Turing Institute has shown that by introducing just 250 malicious documents, a "backdoor" can be implanted in all LLMs ranging from 600M to 13B parameters. Regardless of the model or data size, the number of contaminated samples needed is almost constant (published on October 9, 2025). Anthropic



Key Points of the Study: Why "250" is Enough

The study targeted a simple DoS (Denial of Service) type backdoor, designing an experiment where the model learns to output nonsensical strings when a trigger word (e.g., <SUDO>) is included in the input. When 100, 250, and 500 toxic documents were introduced into models of four sizes (600M/2B/7B/13B), the results converged to show that 100 documents were unstable, 250 were mostly successful, and 500 were even more certain. The design and evaluation procedures for the trigger are detailed in the paper and commentary. Anthropic


The important point is that the success rate is governed by "absolute number" rather than "relative proportion". Despite the 13B model having seen a vast amount of clean data, it was sufficiently backdoored with **approximately 250,000 to 420,000 tokens (about 250 documents) of poison. This is only 0.00016%** of the total learning tokens. This directly contradicts the conventional premise of "contaminating ◯%." Anthropic


It should be noted that the current setting is a low-risk, low-complexity backdoor that outputs "gibberish." Whether the same scaling applies to more dangerous actions (such as bypassing safety guards) is a subject for future verification, as researchers have stated. Anthropic



What is Frightening: The "Reality" from an Attacker's Perspective

LLMs are pre-trained on a large scale from public web sources. Therefore, attackers only need to scatter documents with triggers on blogs, GitHub, wikis, forums, etc. If the model providers do not notice the collection and mixing, a "backdoor" will be incorporated in subsequent training. AISI warns that this can be executed by almost anyone. AI Security Institute


Secondary reports also succinctly summarize the points, and the impact of the number, 250 cases (0.00016%) being sufficient even for a 13B scale, is significant. The Register



Research Design Details (Concise Version)

  • Trigger: Designed to disrupt output in response to input containing <SUDO>.

  • Synthesis of Toxic Documents: Concatenation of the first 0-1,000 characters of the original document + <SUDO> + 400-900 tokens of nonsense strings. These are scattered in the training corpus.

  • Evaluation: Tracked perplexity differences with and without the trigger to quantify the degree of gibberish.

  • Scale: Comparison of 100/250/500 cases for each size of 600M/2B/7B/13B.

  • Findings: Even with larger models, once the number of toxic cases seen exceeds the threshold (about 250), behavior consistently collapses. Anthropic


Reactions on Social Media: Voices from the Security Practitioners and Developer Community

Hacker News saw a mix of practical concerns and calm assessments. Key points included:

  • Ease of Supply Chain Attacks: "Setting up 250-500 open-source repositories with the same poison is not difficult. Can it be detected on the learning side?"—feasibility from the supply side was pointed out. Hacker News

  • Effective if "Rare Trigger": An analysis that "if the trigger word is almost non-existent in the corpus, it makes sense that it is easier to learn with a small amount of poison regardless of the data scale." Hacker News

  • Comparison with Wikipedia: While Wiki can be publicly verified and corrected, LLM outputs have invisible grounds, making correction loops less effective, leading to a discussion on asymmetry of transparency. Hacker News

  • Assumed Real-World Targets: Rather than chat UIs, backend API usage and classification purposes (e.g., SOC alert prioritization) are more likely to cause real harm, according to field insights. Hacker News


What Can Be Done: Practical Defense Strategies (Checklist)

The study does not focus on a "complete solution for defense," but from the text and related literature/suggestions, it organizes measures that can be taken now. Anthropic

  1. Observability of Data Supply Chain

  • Metadata the origin, time, and acquisition path for each collection source to quickly re-examine and roll back specific domains/authors/patterns.

  • Strengthen the domain whitelist for crawlers. Do not indiscriminately ingest unknown sites.

  1. Pre-Training Filtering + Backdoor Detection

  • Filter out trigger patterns (rare tokens + nonsense strings) using statistical and linguistic outlier detection.

  • Incorporate known backdoor detection and elicitation methods into pre-processing and post-checks (as part of safety evaluation).

  1. "Clean-Up" of Continuous Learning and Retraining

  • As noted in the paper, additional learning with clean data suggests a weakening effect. Regularly incorporate clean continued pre-training into the operational framework. The Register

  1. Evaluation Design: Monitoring Trigger-Dependent Collapse

  • Incorporate smoke tests containing known/generated trigger words into CI, automatically monitoring for PPX spikes and output collapse. Anthropic

  1. Safety Circuits for Fine-Tuning (FT), RAG, and Agents

  • The paper reports that the "absolute number" trend is visible even in the fine-tuning process. Conduct a rigorous review of FT data, quarantine source acquisition for RAG, and use a sandbox for agent execution systems. arXiv

  1. Incident Response

  • Upon detecting a "backdoor suspicion," establish standard procedures for tracking and removing the relevant data→clean retraining→safety evaluation.



To Avoid Misunderstanding: Limitations and Open Questions

  • Caution in Extrapolation: The finding of "almost constant number" is clearly demonstrated for the current low-risk DoS type. It is uncertain whether the same holds for more dangerous and complex actions (such as bypassing safety guards, inducing confidential leaks). ##HTML_TAG_

← Back to Article List

Contact |  Terms of Service |  Privacy Policy |  Cookie Policy |  Cookie Settings

© Copyright ukiyo journal - 日本と世界をつなぐ新しいニュースメディア All rights reserved.