Skip to main content

Beyond the Benchmark: Why Harvey AI and CoCounsel Outperform Lawyers in Tests but Lag in Daily Practice

This meta-analysis for law firm innovation directors and technology committees examines the disconnect between top-tier benchmark scores from Harvey and CoCounsel and their ~20% regular adoption rate at large firms, combining Vals AI results with real-world survey data to build a practical evaluation framework.

Guide scope

Task or use case compared
Legal research and document analysis benchmark performance vs. real-world adoption
Audience segment
Law firm innovation directors, managing partners, and technology committees
Evaluation criteria
Benchmark accuracy, real-world adoption rate, cost and ROI, integration friction, training and change management, professional responsibility compliance
Last reviewed
2026-06-12

In February 2025, the legal AI market received its first major independent benchmarking study. The Vals AI benchmark put four leading legal AI tools — Harvey, CoCounsel, Lexis+ AI, and a fourth unnamed platform — through a battery of tests designed by practicing lawyers at ten top US and UK law firms. The results were unambiguous: Harvey scored highest overall, topping five of six task categories, while CoCounsel led on document summarization. More striking, both AI tools collectively surpassed the human lawyer baseline on every task involving document analysis, information retrieval, and data extraction.

Yet three months later, the Skills/Artificial Lawyer survey of 100 firms — predominantly US Am Law 100, with UK and Canadian representation — painted a different picture. Firms are running an average of 18 AI solutions, but only about 20% of lawyers at the largest firms use AI legal assistants regularly. Four firms had stopped using Harvey; nine had stopped using CoCounsel. The primary reasons cited were cost and insufficient return on investment.

This is the benchmark-to-adoption gap: the disconnect between what AI tools can do in controlled tests and what they actually deliver in daily practice. For law firm innovation directors, managing partners, and technology committees conducting due diligence, understanding this gap matters more than knowing which platform scored higher. This article examines the Vals AI methodology and results, the limitations of benchmarks, the real-world adoption data, and the structural reasons why adoption lags — then builds a practical evaluation framework for buyers.

Split-panel editorial infographic showing Harvey AI on the left and CoCounsel on the right with a central divider highlighting the gap between benchmark performance and real-world adoption, plus a timeline at the bottom.
The benchmark-to-adoption gap: both tools outperform human lawyers in tests, yet regular usage at large firms hovers around 20%.

Vals AI Benchmark: Methodology and Top-Line Results

The Vals AI benchmark, published in February 2025, was designed to address a persistent problem in legal AI evaluation: the lack of standardized, independent testing. Rather than using generic AI benchmarks or vendor-designed tests, Vals worked with ten top US and UK law firms to collect real-world questions that practicing lawyers actually encounter. The study then tested four legal AI tools — Harvey, CoCounsel, Lexis+ AI, and one unnamed platform — against a control group of human lawyers.

The results, as reported by Legal IT Insider, showed a clear hierarchy:

Vals AI benchmark top-line results (February 2025). Harvey topped 5 of 6 tasks; CoCounsel led on document summarization.
Task CategoryTop PerformerKey Score / Note
Data ExtractionHarveyHighest overall score
Document Q&AHarveyCoCounsel scored 89.6% — third highest overall
RedliningHarveyHighest overall score
Transcript AnalysisHarveyHighest overall score
Chronology GenerationHarveyHighest overall score
Document SummarizationCoCounselHighest score in this category

The headline finding was that both AI tools collectively surpassed the human lawyer baseline on tasks related to document analysis, information retrieval, and data extraction. This was not a narrow victory — the AI tools outperformed human lawyers across multiple task types, suggesting that for certain well-defined legal workflows, generative AI has already reached or exceeded professional-level competence.

Horizontal grouped bar chart comparing AI performance across six legal task categories, with dark navy bars (Harvey) highest in five categories and muted red bars (CoCounsel) highest in Document Summarization, most exceeding a dashed human lawyer baseline.
Vals AI benchmark results: Harvey and CoCounsel both exceeded the human lawyer baseline on document analysis, information retrieval, and data extraction tasks.

What Benchmarks Don't Tell You: Key Limitations

The Vals AI benchmark is the most rigorous independent evaluation of legal AI tools to date, but it has limitations that buyers must understand before using its results as a primary decision factor.

Vendor-Consented Testing and Sample Bias

The study tested only tools whose vendors consented to participate. Lexis+ AI, one of the four tools initially included, withdrew from the report before publication. This means the results reflect a self-selected sample of vendors who were confident enough in their current performance to submit to independent testing. The tools that declined — or were not invited — may perform differently. The benchmark also tested only four tools, leaving out a growing ecosystem of smaller legal AI platforms.

The Vals benchmark tested specific, well-defined tasks: extract data from a contract, answer questions about a document, summarize a transcript. These are precisely the kinds of tasks where large language models excel. But legal practice involves open-ended reasoning, strategic judgment, client counseling, and ethical decision-making — areas where AI tools are not yet tested and where the risk of error carries professional consequences.

The Hallucination Problem

Benchmark scores measure accuracy on tasks where the correct answer is known. They do not measure the rate at which AI tools generate plausible-sounding but entirely fabricated information — a phenomenon known as hallucination. A separate Stanford study found that some legal AI tools hallucinate in up to 17% of responses, a rate that would be unacceptable in any legal context where citation accuracy is required. For a dedicated discussion of hallucination risk across legal AI platforms, see our Westlaw CoCounsel vs Lexis+ AI: Hallucination Risk and Citation Verification comparison.

Real-World Adoption: Skills Survey Findings

The Skills/Artificial Lawyer survey, conducted in early 2025 with a sample of 100 firms (majority US Am Law 100, with UK and Canadian representation), provides the most current picture of how law firms are actually using AI tools — as opposed to how vendors claim they are being used.

Key Findings

  • Firms run an average of 18 AI solutions, but only about 20% of lawyers at the largest firms use AI legal assistants regularly.
  • Harvey ranked #1 for legal drafting; CoCounsel ranked #2.
  • Harvey's Vault product ranked #2 for due diligence, behind Kira.
  • Harvey and CoCounsel also led in contract negotiation and redlining.
  • 4 firms stopped using Harvey; 9 stopped using CoCounsel. The primary reasons cited were cost and insufficient ROI.
  • Innovation departments (59%) have overtaken IT (43%) as the primary drivers of AI strategy — a dramatic shift from the prior year, when IT led at 74%.

The shift in AI strategy ownership is particularly significant. When IT departments led AI adoption, the focus was on infrastructure, security, and vendor management. Now that innovation departments are taking the lead, the emphasis is shifting to use case identification, workflow integration, and measuring business impact. This change may help close the adoption gap, but it also creates new challenges around governance and professional responsibility.

Minimalist horizontal bar chart showing a top bar labeled AI Benchmark Performance extending to approximately 95% and a bottom bar labeled Real-World Lawyer Adoption extending to approximately 20%, connected by a dashed line with a downward arrow.
The benchmark-to-adoption gap: AI tools score near 95% on controlled tasks, but only about 20% of lawyers at large firms use them regularly.

Why Adoption Lags Behind Benchmark Performance

The gap between benchmark scores and adoption rates is not a paradox — it is the predictable result of several structural factors that controlled tests do not capture.

Cost and ROI Uncertainty

Both Harvey and CoCounsel are priced at a premium that makes sense for a few power users but becomes difficult to justify at firm-wide scale. The Skills survey confirms that cost and insufficient ROI were the primary reasons firms discontinued use of both tools. When a tool costs $1,000+ per seat per month, the ROI calculation depends on the tool being used regularly by a significant portion of the firm — which the 20% adoption rate suggests is not happening.

Integration Friction

A tool that scores well on isolated tasks may still fail in practice if it does not integrate with existing workflows. Lawyers work within document management systems, practice management platforms, and research databases. An AI tool that requires switching contexts, re-entering data, or learning a new interface faces a steep adoption barrier — regardless of how well it performs in a benchmark.

Training and Change Management

The 20% regular usage rate suggests that even when firms license AI tools, they struggle to get lawyers to change their established workflows. Effective use of generative AI requires understanding its capabilities and limitations, knowing how to prompt effectively, and developing the habit of verifying outputs. These are skills that must be taught and reinforced — they do not emerge from simply providing access to a tool.

Professional Responsibility Obligations

ABA Model Rule 1.1 requires lawyers to provide competent representation, which now includes understanding the benefits and risks of relevant technology. But competence also means knowing when not to rely on AI. The hallucination risk, the lack of transparency in some AI models, and the unresolved questions about data privacy and attorney-client privilege create a professional responsibility landscape that benchmarks do not address. For a deeper look at the ethical compliance dimensions, see our Westlaw CoCounsel vs Lexis+ AI: Ethical Compliance comparison.

A Practical Evaluation Framework for Law Firm Buyers

Given the disconnect between benchmark performance and real-world adoption, law firm buyers need an evaluation framework that goes beyond comparing test scores. The following framework is designed for innovation directors, managing partners, and technology committees conducting due diligence on Harvey, CoCounsel, or any legal AI platform.

A six-dimension evaluation framework for legal AI tools, designed to address the benchmark-to-adoption gap.
Evaluation DimensionKey QuestionsWhy It Matters
Use Case SpecificityWhich specific workflows will this tool support? How well does it match the firm's practice areas?Benchmarks test generic tasks; real value depends on fit with actual work
Pilot DesignWhat success metrics will be used? How many users? Over what period?Pilots that measure only usage, not outcomes, miss the ROI question
IntegrationDoes the tool integrate with existing DMS, practice management, and research tools?Integration friction is a primary barrier to adoption
Professional ResponsibilityWhat is the tool's hallucination rate? How does it handle data privacy? Is citation verification built in?Ethical compliance is non-negotiable; see ABA Model Rule 1.1
Cost-Benefit AnalysisWhat is the per-seat cost at full deployment? What is the expected productivity gain?The Skills survey shows cost/ROI is the top reason for discontinuation
Training and Change ManagementWhat training does the vendor provide? What is the firm's plan for driving adoption?Without training, even the best tool will see 20% adoption

Pilot Design Best Practices

The firms that have successfully adopted AI tools share common pilot design patterns:

  • Start with a narrow, well-defined use case where the tool's capabilities are clearly superior to existing methods.
  • Define success metrics before the pilot begins — not just usage rates, but time saved, error reduction, and user satisfaction.
  • Include a diverse group of users: partners, associates, paralegals, and legal ops professionals.
  • Build in a structured feedback loop to capture what works, what doesn't, and what needs to change.
  • Plan for the post-pilot decision: what criteria will trigger a firm-wide rollout, and what will trigger a discontinuation?

The Broader Platform Landscape

Harvey and CoCounsel are not the only options. For a comprehensive comparison of the major legal AI platforms — including Lexis+ AI and Bloomberg Law AI — see our Legal Research AI Platforms Compared guide. For detailed evaluations of each platform, see our Harvey AI Enterprise Legal Platform profile and Westlaw CoCounsel evaluation.

Conclusion: Bridging the Gap Between Test Scores and Daily Practice

The Vals AI benchmark established that Harvey and CoCounsel can outperform human lawyers on specific, well-defined legal tasks. That is a genuine achievement and a signal that the technology has reached a threshold of practical utility. But the Skills survey's finding that only about 20% of lawyers at large firms use AI tools regularly — and that multiple firms have discontinued both Harvey and CoCounsel — is a reminder that benchmark performance is not the same as real-world value.

The tools that will ultimately succeed are those that address the full adoption challenge: integration with existing workflows, transparent pricing that aligns with realized value, robust training and change management support, and demonstrable compliance with professional responsibility obligations. The firms that will benefit most are those that evaluate tools not on test scores alone, but on a structured framework that accounts for use case fit, pilot design, integration, ethics, and cost-benefit analysis.

The benchmark-to-adoption gap is not a reason to dismiss AI tools — it is a reason to evaluate them more carefully. The technology is ready. The question is whether the legal profession is ready to integrate it into daily practice.

Corrections & feedback

Submit corrections, flag outdated tool data, or share your evaluation experience. Comments are moderated. Nothing here constitutes legal advice.

Comments

Join the discussion with an anonymous comment.

Loading comments...