The Day AI Stands in Court Becomes a Reality? "Lawyers Are Safe" Shaken in Just a Few Weeks — The Reason for the Surge in AI Agents' Performance

The Day AI Stands in Court Becomes a Reality? "Lawyers Are Safe" Shaken in Just a Few Weeks — The Reason for the Surge in AI Agents' Performance

"The day when AI becomes a lawyer will never come"—until recently, there was a sense of certainty about this. The reason was simple: when tasked with challenges close to the "practical work" of professionals, AI didn't score as well as expected. However, that certainty was shaken in just a few weeks.


TechCrunch highlighted the ranking changes in Mercor's AI agent evaluation "APEX-Agents." As of last month, the performance of major labs was generally below 25%, leading to the belief that "at least for now, lawyers seem safe." But this week, Anthropic's Opus 4.6 shook up the leaderboard, achieving nearly 30% in a one-shot attempt and averaging around 45% with increased trial runs. While the numbers are still not quite "passing," the growth rate is remarkable. Mercor CEO Brendan Foody described the rapid increase as "incredible."



What does "APEX-Agents" measure?

What makes APEX-Agents interesting is that it doesn't just test knowledge; it tries to measure how well agents can complete tasks in environments that mimic "high-value white-collar work" such as investment banking analysis, consulting, and corporate legal work. According to Mercor, it requires the use of multiple applications, long-term planning, specialized knowledge, and reasoning, with 480 tasks and scoring criteria across 33 "worlds." They also publicly share the data and the framework for evaluation (Archipelago). In essence, it's not about creating "realistic problems" but rather "realistic environments" and scoring execution ability within them.


This design philosophy aligns well with the legal field. Legal work involves continuously connecting scattered materials such as statutes, precedents, internal policies, contract terms, and parties' circumstances to produce coherent conclusions and documents. Moreover, the materials are not monolithic. Internal documents, emails, chats, and external laws and guidance all come into play simultaneously. As TechCrunch noted in an article last month, models tend to stumble over "cross-domain information search and integration."



Why did the scores jump with Opus 4.6?

The key to understanding this rapid increase lies in the "agent teams" introduced by Anthropic in Opus 4.6. Instead of a single agent performing tasks sequentially, multiple agents divide responsibilities and proceed in parallel, coordinating with each other—an approach modeled after human teamwork. According to TechCrunch, this feature is offered as a research preview for API users/subscribers, along with improvements aimed at knowledge workers, such as expanded context length (1 million tokens) and side panel integration within PowerPoint.


In tasks like those in APEX-Agents that proceed in "multiple steps," "adjust the approach midway," and "refine deliverables," division of labor, retries, and self-checks are more effective than one-off intelligence. TechCrunch also touched on the possibility that Opus 4.6's "agentic features" helped with multi-step problems.


However, what's important here is the meaning of the "30%" figure. It's far from 100%. It's not a story of lawyers suddenly becoming unemployed next week. TechCrunch also cautions against that notion. But at the same time, the basis for declaring "safety" has weakened. The replacement of professions doesn't progress in an all-or-nothing manner. It starts with the "work that can be trimmed."



What Happens Before Replacement: The "Decomposition" of Legal Work

When legal work is broken down, areas where AI can easily penetrate become apparent.

  • Initial Drafts: Contract templates, clause proposals, risk identification

  • Research Assistance: Organizing issues, pinpointing laws, precedents, and guidance

  • Comparison and Summary: Explaining differences in counterparty revisions, listing negotiation points

  • Standardized Responses: Drafting responses to common inquiries, templating according to internal rules


These tasks, while requiring "ultimate responsibility" and "judgment," are mostly comprised of exploration, organization, and writing. If agents can handle these quickly and cheaply, the cost structure of law firms and corporate legal departments could change.


On the other hand, handling testimony and emotions, building trust with parties, and resolving value conflicts are areas where text generation alone is difficult to replace. In other words, it's more realistic for legal work to "change shape" rather than "disappear entirely."



Reactions on Social Media: Simultaneous Bursts of Expectation and Cold Water

Reactions on social media (forums and communities) to this topic generally fall into three categories.


1) "Already useful as an auxiliary tool. But it's dangerous without supervision."

In Reddit's legal community, a poster claiming to be a practicing lawyer stated, "It makes certain tasks easier, but there are hallucinations and a lack of conceptual understanding, requiring expert supervision," and suggested a future role akin to a "next-generation Westlaw (legal research platform)." While skeptical of complete autonomous replacement, there is an air of assuming its penetration as a tool.


2) "Impossible for courtrooms and criminal cases. Society won't accept it."

In another thread in the legal community, in the context of criminal defense, reactions included "It's hard to imagine AI handling the subtle judgments of procedures on a case-by-case basis" and "AI deciding guilt/sentencing is dystopian." Here, beyond capability, issues of legitimacy, transparency, and human acceptance lie.


3) "Who takes responsibility? Contracts and governance will be bottlenecks."

On Hacker News, there's a lively discussion about "contractual and responsibility boundaries" regarding who bears responsibility—the seller of AI agents, the provider of the foundational model, or the customer. As performance improves, there's an ironic scenario where legal demand increases as "legal work for those using AI."


Additionally, legal AI company Harvey introduced Opus 4.6 as scoring high in their evaluation (BigLaw Bench), highlighting strengths in practical tasks (litigation and transaction areas). This can be seen as a reaction indicating the "heat of the product field" rather than "research scores."



The Real Reason "30%" is Scary

So why can a score of around 30% still be a "threat"? There are two reasons.


The first is that the areas where points can be scored are skewed. In legal work, there are parts with more routine processing than difficult judgments. Even if only those parts are automated, the industry's hiring and training structure (where juniors gain experience) is shaken.


The second is that retries and division of labor bring it closer to practical use. In APEX-Agents, it's said that the average increases with multiple attempts rather than a one-shot. In other words, as the ability to "miss at first but hit upon retry" develops, the cost of human review can be reduced.


At this point, the focus of the discussion is not whether "lawyers will disappear."
It's moving to "which jobs will become cheaper first" and "who will supervise and who will bear responsibility."



The Likely Reality: The "AI Premise" of Legal Work

The realistic future scenario is probably like this.

  • Corporate legal departments will pre-process contract reviews and initial internal consultations with AI, while lawyers will focus on exceptions and negotiations.

  • Law firms will increase throughput in research and drafting, revisiting pricing structures (fixed costs to outcomes and value).

  • The control of "using AI" itself (logs, explanations, audits, reevaluation upon model updates) will become a new compliance area.

  • And the writing of responsibility boundaries, disclaimers, and warranties will become more sophisticated, thickening "contract practices in the AI era."


Rather than whether AI will become lawyers,lawyers will more quickly reshape their work based on AI. The rise in APEX-Agents scores was an event that brought that reality forward.



Sources