The Day AI Stands in Court Becomes a Reality? "Lawyers Are Safe" Shaken in Just a Few Weeks — The Reason for the Surge in AI Agents' Performance

2026年02月08日 14:25

"The day when AI becomes a lawyer will never come"—until recently, there was a sense of certainty about this. The reason was simple: when tasked with challenges close to the "practical work" of professionals, AI didn't score as well as expected. However, that certainty was shaken in just a few weeks.

TechCrunch highlighted the ranking changes in Mercor's AI agent evaluation "APEX-Agents." As of last month, the performance of major labs was generally below 25%, leading to the belief that "at least for now, lawyers seem safe." But this week, Anthropic's Opus 4.6 shook up the leaderboard, achieving nearly 30% in a one-shot attempt and averaging around 45% with increased trial runs. While the numbers are still not quite "passing," the growth rate is remarkable. Mercor CEO Brendan Foody described the rapid increase as "incredible."

What does "APEX-Agents" measure?

What makes APEX-Agents interesting is that it doesn't just test knowledge; it tries to measure how well agents can complete tasks in environments that mimic "high-value white-collar work" such as investment banking analysis, consulting, and corporate legal work. According to Mercor, it requires the use of multiple applications, long-term planning, specialized knowledge, and reasoning, with 480 tasks and scoring criteria across 33 "worlds." They also publicly share the data and the framework for evaluation (Archipelago). In essence, it's not about creating "realistic problems" but rather "realistic environments" and scoring execution ability within them.

This design philosophy aligns well with the legal field. Legal work involves continuously connecting scattered materials such as statutes, precedents, internal policies, contract terms, and parties' circumstances to produce coherent conclusions and documents. Moreover, the materials are not monolithic. Internal documents, emails, chats, and external laws and guidance all come into play simultaneously. As TechCrunch noted in an article last month, models tend to stumble over "cross-domain information search and integration."

Why did the scores jump with Opus 4.6?

The key to understanding this rapid increase lies in the "agent teams" introduced by Anthropic in Opus 4.6. Instead of a single agent performing tasks sequentially, multiple agents divide responsibilities and proceed in parallel, coordinating with each other—an approach modeled after human teamwork. According to TechCrunch, this feature is offered as a research preview for API users/subscribers, along with improvements aimed at knowledge workers, such as expanded context length (1 million tokens) and side panel integration within PowerPoint.

In tasks like those in APEX-Agents that proceed in "multiple steps," "adjust the approach midway," and "refine deliverables," division of labor, retries, and self-checks are more effective than one-off intelligence. TechCrunch also touched on the possibility that Opus 4.6's "agentic features" helped with multi-step problems.

However, what's important here is the meaning of the "30%" figure. It's far from 100%. It's not a story of lawyers suddenly becoming unemployed next week. TechCrunch also cautions against that notion. But at the same time, the basis for declaring "safety" has weakened. The replacement of professions doesn't progress in an all-or-nothing manner. It starts with the "work that can be trimmed."

What Happens Before Replacement: The "Decomposition" of Legal Work

When legal work is broken down, areas where AI can easily penetrate become apparent.

Initial Drafts: Contract templates, clause proposals, risk identification
Research Assistance: Organizing issues, pinpointing laws, precedents, and guidance
Comparison and Summary: Explaining differences in counterparty revisions, listing negotiation points
Standardized Responses: Drafting responses to common inquiries, templating according to internal rules

These tasks, while requiring "ultimate responsibility" and "judgment," are mostly comprised of exploration, organization, and writing. If agents can handle these quickly and cheaply, the cost structure of law firms and corporate legal departments could change.

On the other hand, handling testimony and emotions, building trust with parties, and resolving value conflicts are areas where text generation alone is difficult to replace. In other words, it's more realistic for legal work to "change shape" rather than "disappear entirely."

Reactions on Social Media: Simultaneous Bursts of Expectation and Cold Water

Reactions on social media (forums and communities) to this topic generally fall into three categories.

1) "Already useful as an auxiliary tool. But it's dangerous without supervision."

In Reddit's legal community, a poster claiming to be a practicing lawyer stated, "It makes certain tasks easier, but there are hallucinations and a lack of conceptual understanding, requiring expert supervision," and suggested a future role akin to a "next-generation Westlaw (legal research platform)." While skeptical of complete autonomous replacement, there is an air of assuming its penetration as a tool.

2) "Impossible for courtrooms and criminal cases. Society won't accept it."

In another thread in the legal community, in the context of criminal defense, reactions included "It's hard to imagine AI handling the subtle judgments of procedures on a case-by-case basis" and "AI deciding guilt/sentencing is dystopian." Here, beyond capability, issues of legitimacy, transparency, and human acceptance lie.

3) "Who takes responsibility? Contracts and governance will be bottlenecks."

On Hacker News, there's a lively discussion about "contractual and responsibility boundaries" regarding who bears responsibility—the seller of AI agents, the provider of the foundational model, or the customer. As performance improves, there's an ironic scenario where legal demand increases as "legal work for those using AI."

Additionally, legal AI company Harvey introduced Opus 4.6 as scoring high in their evaluation (BigLaw Bench), highlighting strengths in practical tasks (litigation and transaction areas). This can be seen as a reaction indicating the "heat of the product field" rather than "research scores."

The Real Reason "30%" is Scary

So why can a score of around 30% still be a "threat"? There are two reasons.

The first is that the areas where points can be scored are skewed. In legal work, there are parts with more routine processing than difficult judgments. Even if only those parts are automated, the industry's hiring and training structure (where juniors gain experience) is shaken.

The second is that retries and division of labor bring it closer to practical use. In APEX-Agents, it's said that the average increases with multiple attempts rather than a one-shot. In other words, as the ability to "miss at first but hit upon retry" develops, the cost of human review can be reduced.

At this point, the focus of the discussion is not whether "lawyers will disappear."
It's moving to "which jobs will become cheaper first" and "who will supervise and who will bear responsibility."

The Likely Reality: The "AI Premise" of Legal Work

The realistic future scenario is probably like this.

Corporate legal departments will pre-process contract reviews and initial internal consultations with AI, while lawyers will focus on exceptions and negotiations.
Law firms will increase throughput in research and drafting, revisiting pricing structures (fixed costs to outcomes and value).
The control of "using AI" itself (logs, explanations, audits, reevaluation upon model updates) will become a new compliance area.
And the writing of responsibility boundaries, disclaimers, and warranties will become more sophisticated, thickening "contract practices in the AI era."

Rather than whether AI will become lawyers,lawyers will more quickly reshape their work based on AI. The rise in APEX-Agents scores was an event that brought that reality forward.

Sources

TechCrunch (2026/02/06) "Maybe AI agents can be lawyers after all": Main article on the sharp score increase of Opus 4.6 in APEX-Agents
https://techcrunch.com/2026/02/06/maybe-ai-agents-can-be-lawyers-after-all/
Mercor "The APEX-Agents leaderboard": Design of APEX-Agents (33 worlds/480 tasks) and score list (Opus 4.6's 29.8%, etc.)
https://www.mercor.com/apex/apex-agents-leaderboard/
TechCrunch (2026/02/05) "Anthropic releases Opus 4.6 with new ‘agent teams’": Explanation of features such as agent teams, 1 million token context, PowerPoint integration
https://techcrunch.com/2026/02/05/anthropic-releases-opus-4-6-with-new-agent-teams/
TechCrunch (2026/01/22) "Are AI agents ready for the workplace? A new benchmark raises doubts": Background of APEX-Agents, "struggling" context as of last month
https://techcrunch.com/2026/01/22/are-ai-agents-ready-for-the-workplace-a-new-benchmark-raises-doubts/
Reddit r/LawSchool "AI and the future": Example of a reaction from a practicing lawyer's perspective that "assistance is effective but supervision is essential"
https://www.reddit.com/r/LawSchool/comments/1qvryim/ai_and_the_future/
Reddit r/Lawyertalk "According to bill gates, lawyers will be fully replaced by AI by 2030": Example of skepticism about replacement in criminal and courtroom areas and reactions to social acceptance
https://www.reddit.com/r/Lawyertalk/comments/1kra6io/according_to_bill_gates_lawyers_will_be_fully/
Hacker News "Legal Contracts Built for AI Agents": Example of governance discussions on responsibility, contractual boundaries, and SaaS uncertainties
https://news.ycombinator.com/item?id=45515640
Harvey (Blog) "Opus 4.6, Now Live in Harvey": Example of a reaction from the legal AI field side, stating "high scores in practical evaluation"
https://www.harvey.ai/blog/opus-4-6-now-live-in-harvey

The Day AI Stands in Court Becomes a Reality? "Lawyers Are Safe" Shaken in Just a Few Weeks — The Reason for the Surge in AI Agents' Performance

What does "APEX-Agents" measure?

Why did the scores jump with Opus 4.6?

What Happens Before Replacement: The "Decomposition" of Legal Work

Reactions on Social Media: Simultaneous Bursts of Expectation and Cold Water

1) "Already useful as an auxiliary tool. But it's dangerous without supervision."

2) "Impossible for courtrooms and criminal cases. Society won't accept it."

3) "Who takes responsibility? Contracts and governance will be bottlenecks."

The Real Reason "30%" is Scary

The Likely Reality: The "AI Premise" of Legal Work

Sources

"The Day AI Becomes the Boss" is Surprisingly Close — Why 56% of Business Leaders Believe "Most of Their Work Can Be Done by AI"

The Era Where AI Prepares Medications and Treatment Plans — What's Happening in the Field of Cancer Medicine

The Era of Consulting AI for Politics and Shopping: What’s Happening Behind Persuasive Chatbots

No Humans Allowed, the Author is "30,000 AIs" ─ Moltbook Reflects the Reality of the "Agent Era"

Can AI Enter the Examination Room? Doctors Discuss "Where It Should Be Used / Where It Should Be Avoided"

Cookie Usage

What does "APEX-Agents" measure?

Why did the scores jump with Opus 4.6?

What Happens Before Replacement: The "Decomposition" of Legal Work

Reactions on Social Media: Simultaneous Bursts of Expectation and Cold Water

1) "Already useful as an auxiliary tool. But it's dangerous without supervision."

2) "Impossible for courtrooms and criminal cases. Society won't accept it."

3) "Who takes responsibility? Contracts and governance will be bottlenecks."

The Real Reason "30%" is Scary

The Likely Reality: The "AI Premise" of Legal Work

Sources

"The Day AI Becomes the Boss" is Surprisingly Close — Why 56% of Business Leaders Believe "Most of Their Work Can Be Done by AI"

The Era Where AI Prepares Medications and Treatment Plans — What's Happening in the Field of Cancer Medicine

The Era of Consulting AI for Politics and Shopping: What’s Happening Behind Persuasive Chatbots

No Humans Allowed, the Author is "30,000 AIs" ─ Moltbook Reflects the Reality of the "Agent Era"

Can AI Enter the Examination Room? Doctors Discuss "Where It Should Be Used / Where It Should Be Avoided"