Full profile
The useful answer to “which AI contract clause extraction benchmark should I trust?” is not a leaderboard. It is a set of questions about what was tested, which clauses failed, whether the system was used out of the box or inside a review architecture, and whether the reported cost assumes a deployment pattern your legal team can actually run.
That matters more in 2026 because the benchmark evidence is no longer telling one simple story. Some frontier models still score well on aggregate measures. Some smaller, domain-adapted systems are now matching or beating them on structured extraction tasks. Some vendor-published results are methodologically useful but still need the usual discount for self-evaluation. And some high-looking scores become much less reassuring once you ask whether the model found the renewal notice clause, the uncapped liability clause, or the joint IP ownership clause.

The best reading of the 2025–2026 AI contract clause extraction benchmarks is narrower, but more useful: architecture and domain adaptation can matter as much as model scale, and aggregate accuracy is a poor proxy for whether a contract team can safely rely on a system for a specific clause in a specific workflow.
The Benchmark Landscape Is Mixed by Design, Not Just by Result
The current evidence base has three different kinds of material, and they should not be read as if they carry the same weight.
| Benchmark or study | What it tested | What to take from it | Main caution |
|---|---|---|---|
| ContractEval | 19 models across 41 clause types using CUAD | Strong overall performance can hide near-failure on important clause categories | Benchmark dataset performance is still not the same as live review performance |
| Harvey Contract Intelligence benchmark | 4,000+ data points comparing out-of-the-box LLMs, humans, and Harvey Review tables | Provision-level review architecture can matter more than the underlying model | Vendor-published and product-specific |
| LegalOn 2026 Contract Review Benchmark | 3,282 attorney reviews across 21 guidelines using ELO scoring and two-pass bias control | Detailed attorney-preference design, useful for reading review-output claims | Published by the vendor whose product ranked first |
| Onit/Olava Extract study | 508 labelled instances across 26 fields | A domain-trained small model can outperform frontier APIs on extraction at much lower reported inference cost | Vendor lab evaluated its own system; cost assumes batched self-hosted inference |
| Govea et al. academic study | Structured legal classification, extraction, and summarization on a corpus of about 200 contracts | Smaller legal-domain models can outperform larger general models on structured legal tasks | Not a procurement benchmark for any one commercial platform |
| CLAUSE discrepancy benchmark | LLM legal understanding of commercial contract clauses | Extraction benchmarks do not fully measure legal discrepancy reasoning | Boundary marker rather than a direct extraction-tool comparison |
ContractEval is the cleanest place to start because it makes the procurement problem visible. It evaluated 19 models across 41 clause types using the CUAD dataset, and its results show why a single average score can be actively misleading in contract review. GPT 4.1 mini scored above 0.9 F1 on easier fields such as Governing Law and Parties, yet performed near zero on Uncapped Liability, Joint IP Ownership, and Notice Period to Terminate Renewal—the kinds of provisions that can create real cleanup work if they are missed.[1]
That is the benchmark result procurement teams should keep on the table when a sales deck says “best accuracy.” A model that reliably extracts party names may still be the wrong control for a renewal-risk project. A system that looks adequate across a full clause inventory may still need human-first review for a handful of clauses where false negatives are expensive.
The Harvey benchmark points in a different but compatible direction. In a study with more than 4,000 data points, out-of-the-box LLMs identified only 65–70% of valid deal points, while lawyers using Harvey’s Review tables outperformed either humans alone or AI alone by more than 5%.[2] The interesting part is not simply that Harvey’s product did well. It is the acknowledgement that provision-level checks and review-table architecture contributed more to accuracy than the choice of the underlying model.[2]
LegalOn’s 2026 benchmark is another useful but caveated vendor-published study. It reports 3,282 attorney reviews across 21 guidelines, with ELO scoring and a two-pass bias-control design. LegalOn was preferred 3.8 times more often on assignment verification, 2.6 times more often on PHI ownership, and 2.4 times more often on NDA purpose definition.[3] Those are specific enough to be worth reading. They are also product claims from the publisher whose system ranked first, so they should be treated as evidence to interrogate, not as an independent market verdict.
The Onit/Olava Extract study puts cost and deployment squarely into the benchmark conversation. Across 508 labelled instances and 26 fields, the domain-trained mixture-of-experts small language model achieved micro F1 of 0.842, compared with 0.796 for Gemini 3.1 Pro and 0.794 for Claude Opus 4.6.[4] It also reported self-hosted inference cost of $0.018 per document versus $0.149–$0.456 for frontier API models, a 78–97% reduction, and higher precision than the frontier API models in the comparison.[4]
That is a meaningful result, especially for high-volume contract operations. It is also still a vendor-lab result evaluating the vendor’s own product. The right response is not to dismiss it; it is to ask whether the same configuration, batching assumptions, document types, and field definitions match the review environment being purchased for.
The independent academic study by Govea et al. helps keep the vendor results in perspective. In a peer-reviewed evaluation on a corpus of about 200 contracts, Legal-BERT, with 110 million parameters, achieved Macro-F1 of 0.921 in classification, span F1 of 0.903 in extraction, and ROUGE-L of 0.886 in summarization, outperforming GPT-3.5, listed at 6,100 million parameters, on structured tasks.[5] That does not prove that any small legal model is better than any frontier model. It does support the less glamorous point that legal-domain training and task fit can outweigh raw model size for structured contract work.
What Each Benchmark Actually Measures
Clause extraction sounds like one task until the workflow is put under a microscope. Some systems identify whether a clause exists. Some extract a span of contract text. Some normalize the answer into a structured field. Some compare a provision against a playbook. Some produce review comments that attorneys then rate. A benchmark can be strong for one of those jobs and weak evidence for another.
For procurement purposes, the first split is between extraction benchmarks and review-quality benchmarks. ContractEval and Onit/Olava are closer to extraction and field performance. Harvey and LegalOn include review architecture and attorney preference or deal-point identification. Govea et al. covers structured legal NLP tasks across classification, extraction, and summarization. CLAUSE sits at the edge of this article’s question because it audits LLM understanding of commercial contract clause discrepancies rather than simply asking whether a clause was extracted.[6]
That distinction affects what a benchmark can prove. A high extraction score does not prove that the system gives good negotiation guidance. A strong attorney-preference result does not isolate whether the model extracted the right span before generating the review note. A legal-understanding benchmark does not tell you whether the system can process a messy folder of vendor agreements by Friday afternoon.
| If the benchmark measures | It can help answer | It does not prove by itself |
|---|---|---|
| Clause existence | Did the system detect that a relevant clause is present? | Whether the extracted text is complete or legally sufficient |
| Span extraction | Did the system capture the right language? | Whether it interpreted the clause correctly against a playbook |
| Structured field extraction | Can the system turn contract language into reviewable data? | Whether exceptions and drafting variants are handled safely |
| Attorney preference | Which output lawyers preferred under the study design? | Which system will reduce risk in your workflow |
| Deal-point review architecture | Whether system design improves review outcomes | Whether the same lift appears with your clause set and staffing model |
| Legal discrepancy understanding | Whether a model grasps substantive clause differences | Whether it can reliably extract clauses from production documents |
This is why a model-name comparison is often the least useful part of the conversation. A contract review system is not just a foundation model. It includes document parsing, clause segmentation, retrieval, prompt design, validation rules, confidence thresholds, review-table design, audit logs, human escalation, and sometimes self-hosted inference. A third-party analysis in June 2026 made the same practical point: legal AI benchmarks increasingly reveal more about system architecture than model branding.[7]
Aggregate F1 Is Where the Trouble Starts
F1 is useful because it balances precision and recall. It is dangerous because it can hide which side of the error matters. In contract extraction, a false positive may waste a reviewer’s time. A false negative may leave an auto-renewal, liability carveout, consent requirement, or ownership exception out of the review set entirely.

ContractEval’s clause-level results make that problem concrete. The same model can look highly competent on Governing Law and Parties while nearly failing on Uncapped Liability, Joint IP Ownership, and Notice Period to Terminate Renewal.[1] Those failures are not equally distributed inconvenience. Easy fields are often the ones reviewers can spot quickly. High-risk missed clauses are the ones that may sit quietly until a renewal window closes or a dispute forces someone to reconstruct what the review process missed.
The false-negative pattern is especially important. ContractEval reported that several Qwen3 models had false “no related clause” rates above 30%.[1] In a live workflow, that failure mode can be worse than a messy extraction. A messy extraction still asks for attention. A confident absence tells the review team there is nothing to look at.
That does not mean every benchmark should be rejected unless it reports every clause separately. It means legal buyers should ask for per-clause precision, recall, and false-negative rates on the clause families they actually care about. If a vendor cannot provide that breakdown, the aggregate score is not enough to support a high-risk deployment.
The Clause List Matters More Than the Average
A procurement team reviewing NDAs, healthcare agreements, and enterprise SaaS paper does not have one universal clause-risk profile. Assignment, PHI ownership, limitation of liability, renewal notice, audit rights, change of control, data use, indemnity, and IP ownership do not create the same operational consequence when missed.
LegalOn’s benchmark is useful here because its reported wins are tied to named review guidelines rather than a single abstract score: assignment verification, PHI ownership, and NDA purpose definition.[3] The caveat is equally important. A buyer should not generalize those specific preferred-output results into “best at contract review.” They should ask whether those guidelines resemble their own playbook, whether attorney reviewers were blind to source, and whether the same result appears on their contract types.
Small, Domain-Trained Systems Are Now a Serious Procurement Option
The most operationally interesting finding in the current benchmark set is not that a smaller model can win a narrow contest. It is that a smaller, domain-trained system may be easier to run repeatedly, audit, and cost-justify for structured extraction.
Onit/Olava is the strongest example in the materials. Its domain-trained mixture-of-experts small language model reported micro F1 of 0.842, ahead of Gemini 3.1 Pro at 0.796 and Claude Opus 4.6 at 0.794, across 508 labelled instances and 26 fields.[4] It also reported precision of 0.812 versus a 0.686–0.783 range for the frontier API models, meaning fewer hallucinated extractions in that test.[4]

The cost result is where buyers need to slow down. The study reported $0.018 per document for self-hosted inference compared with $0.149–$0.456 for frontier API models.[4] That is a large gap, but the reported low cost assumes batched self-hosted inference. If the business need is single-document, real-time review inside a lawyer’s drafting session, the economics may change. If the need is portfolio-scale extraction over thousands of legacy agreements, batching may be exactly the relevant deployment pattern.
The academic evidence points in the same direction without making a product claim. Govea et al. found that Legal-BERT outperformed GPT-3.5 on structured legal classification, extraction, and summarization tasks, despite being much smaller by parameter count.[5] That is not an argument for choosing old models over new ones. It is an argument against assuming that the largest general model is automatically the safest extraction engine.
The Harness Can Beat the Model
The Harvey benchmark is most useful when read as an architecture study rather than a product victory lap. Out-of-the-box LLMs identified only 65–70% of valid deal points in its study, while lawyers using Harvey’s Review tables performed more than 5% better than either humans alone or AI alone.[2] The study’s more important admission is that provision-level checks, not just the underlying model, drove the accuracy improvement.[2]
That finding matches what contract teams see in practice. Clause extraction is rarely one prompt against one document. The system has to decide which text unit counts as the provision, whether cross-references should be included, whether a missing clause is genuinely absent, and whether a low-confidence result should be escalated. The model may generate the answer, but the harness decides what question is asked, what evidence is retrieved, and what gets shown to the reviewer.
A benchmark that compares raw foundation models can be useful for research. It is less useful for buying a contract review workflow. A benchmark that compares complete systems can be more realistic, but it also makes attribution harder. If a vendor’s result improves, is that because of model selection, clause segmentation, better examples, a retrieval layer, validation checks, attorney workflow design, or favorable test documents? Procurement should ask the question even when the answer is “several of the above.”
Vendor Benchmarks Are Evidence, Not Verdicts
Vendor-published benchmarks are not useless. Some are more transparent than many independent summaries, and vendors often have access to realistic workflow data that academic studies do not. But the developer-as-evaluator relationship changes how the evidence should be weighed.
LegalOn discloses a detailed review structure: 3,282 attorney reviews, 21 guidelines, ELO scoring, and two-pass bias control.[3] Onit/Olava discloses labelled instances, field count, comparator models, F1, precision, and cost assumptions.[4] Harvey discloses the scale of its benchmark and the relative performance of out-of-the-box LLMs, humans, and lawyers using its Review tables.[2] Those are all better than an isolated accuracy badge.
Still, internal benchmark design can quietly favor the home system. A vendor can choose clause types that match its training history, document formats that its parser handles well, review guidelines that mirror its product design, or cost scenarios that fit its deployment architecture. None of that requires bad faith. It is simply why the benchmark should start a diligence conversation rather than end it.
A headline F1 claim without public methodology deserves still less weight. If the source does not disclose the document set, clause labels, annotation process, comparator configuration, and per-clause results, it should be treated as a marketing claim. It may be true; it is just not yet decision-grade evidence.
What Current Benchmarks Still Do Not Tell You
The 2025–2026 benchmark set is stronger than the old demo-driven conversation, but it still leaves important gaps.
- Jurisdiction coverage remains thin. The materials do not establish performance across multiple jurisdictions in a way that would support broad global contracting claims.
- Language coverage is not proven. A system that performs well on English commercial contracts should not be assumed to perform equally well on multilingual portfolios.
- OCR-degraded and formatting-heavy inputs are not consistently tested. Scanned amendments, exhibits, tables, and legacy PDFs can change the extraction problem before the model even sees the text.
- Human agreement ceilings are often unclear. Without consistent inter-annotator agreement reporting, it is hard to know whether a model is failing or whether the clause definition itself is unstable.
- Reasoning is not the same as extraction. CLAUSE is useful because it reminds buyers that understanding a contractual discrepancy is a different task from locating a provision.[6]
- Cost is workload-dependent. Batched self-hosted inference, API calls, real-time drafting support, and human verification produce different total costs.
There is also an emerging caution around “thinking” or reasoning modes. One benchmark finding suggests that reasoning mode improved conciseness but reduced extraction correctness. That may evolve as models and prompting patterns change, but it is a useful reminder: a more elaborate answer is not automatically a more accurate extraction.
How to Read an AI Contract Clause Extraction Benchmark Before Procurement
The practical question is not whether the benchmark is impressive. It is whether the benchmark reduces uncertainty for the work your team is actually buying the tool to perform.
Start with the Clause Set
Ask for results by clause type, not just by model or product. The minimum useful view separates easy administrative fields from provisions that create legal or operational risk. ContractEval’s Governing Law and Parties results should not reassure anyone about Uncapped Liability or Notice Period to Terminate Renewal unless the vendor can show performance on those clauses too.[1]
- Which clause families were tested?
- Were absent clauses tested, or only present clauses?
- What were precision, recall, and false-negative rates for each high-risk clause?
- Did the test include drafting variants, cross-references, carveouts, schedules, and amendments?
- Were clause definitions close to your playbook, or merely close to the vendor’s taxonomy?
Separate Model Performance from Workflow Performance
A raw model benchmark answers a different question from a system benchmark. Harvey’s results are useful precisely because they show that lawyers using a structured review interface outperformed AI alone and humans alone in that study.[2] For procurement, that means the demo should include the review table, escalation logic, confidence display, and audit trail—not just a chat window producing a clean explanation.
If the vendor claims the architecture is what improves reliability, ask to see the architecture fail gracefully. What happens when the model says no clause exists? What happens when extracted text conflicts with a later amendment? What happens when a low-confidence answer is sent to a reviewer? The answer should be procedural, not poetic.
Interrogate the Cost Scenario
Onit/Olava’s reported $0.018 per document is meaningful because it is tied to a self-hosted, batched setup.[4] A legal operations team should ask whether its own use case looks like that. Portfolio remediation, due diligence, and repository cleanup may fit batch economics. Live matter support, negotiation redlines, and single-document lawyer queries may not.
Cost comparisons should also include human review. A cheaper extraction run that increases reviewer cleanup time may not be cheaper. A more expensive extraction run that reliably flags the few high-risk provisions that matter may be worth it. The benchmark will not answer that unless the test design includes downstream review effort.
Ask Who Marked the Truth
The ground truth layer is not a clerical detail. Clause labels and span boundaries are legal judgments, especially when a provision is split across sections or modified by an exhibit. Ask who annotated the contracts, how disagreements were resolved, whether inter-annotator agreement was measured, and whether the benchmark counted partial extractions as correct.
This is where academic and vendor benchmarks can complement each other. Academic work such as Govea et al. provides independent evidence on structured legal NLP performance.[5] Vendor benchmarks may provide more workflow-specific measurement. Neither should be allowed to skip the annotation question.
Run a Clause-Level Pilot, Not a Beauty Contest
A useful pilot does not need to mimic a public benchmark. It should test your clause inventory, your document quality, your review staffing, and your tolerance for missed clauses. The output should be a decision table: which clauses can be auto-extracted with spot checks, which require human verification, and which should remain lawyer-led until the system proves otherwise.
| Procurement question | What a good answer looks like | Warning sign |
|---|---|---|
| What clauses did you test? | Per-clause results for clauses matching the buyer’s playbook | Only aggregate accuracy or broad contract-type claims |
| What failures are most common? | Separate false positives, false negatives, hallucinated extractions, and partial spans | No discussion of error type |
| What is the deployment assumption? | Clear distinction between API, self-hosted, batch, and real-time processing | Cost per document without workload details |
| What role does the workflow play? | Disclosure of validation rules, review tables, confidence thresholds, and escalation paths | All credit assigned to the model name |
| Who created ground truth? | Qualified annotators, disagreement process, and ideally agreement data | Unspecified labels or vendor-only assertions |
| How will humans verify output? | Defined review obligations for low-confidence or high-risk clauses | Assumption that high benchmark score removes review need |
Where the Benchmarks Converge
The studies do not agree on one winner, and that is fine. They agree on several more useful points.
- Model scale is not a procurement strategy. Domain-specific smaller models and legal-domain architectures can perform very well on structured extraction tasks.[4][5]
- Out-of-the-box LLM performance is not enough for precision-critical contract review. Harvey’s benchmark found that raw LLMs identified only 65–70% of valid deal points in its study.[2]
- Clause-level variance is the central risk. ContractEval shows that strong performance on easy fields can coexist with near-zero performance on high-risk provisions.[1]
- System design matters. Provision-level checks, review tables, validation layers, and escalation workflows may determine whether a model’s output becomes useful legal operations infrastructure.[2][7]
- Cost claims are deployment claims. The Onit/Olava cost result is strongest for batched self-hosted inference, not every possible review scenario.[4]
- Vendor benchmarks need disclosure, not automatic rejection. The more specific the methodology, the more useful the benchmark becomes—but self-evaluation remains a material caveat.[2][3][4]
For readers who want a deeper comparison of purpose-built systems and general-purpose models, the related guide on AI contract review accuracy benchmarks is the natural next stop. For platform-specific context, see the Harvey AI enterprise legal platform evaluation and the Luminance AI contract review tool profile.
The Procurement Question to Use Instead
The wrong question is “Which AI model scored highest?” That question invites a vendor to collapse dataset choice, clause mix, deployment cost, and review design into a single number.
The better question is: which tested system performs reliably on our clause set, under our deployment constraints, with disclosed methodology and human verification where the benchmark shows weakness?
That question is less convenient for a procurement scorecard. It is much better for the person who will be asked, three months after implementation, why the system found the easy clauses and missed the one that mattered.
References
- ContractEval, arXiv, 2025.
- Contract Intelligence benchmark, Harvey.
- 2026 Contract Review Benchmark, LegalOn, 2026.
- Olava Extract study, arXiv, 2026.
- Govea et al. study, PMC.
- CLAUSE discrepancy benchmark, ACL, 2026.
- What Legal AI Benchmarks Reveal That Model Names Don't, Artificial Lawyer, June 2026.
Comments
Join the discussion with an anonymous comment.