What AI Paralegal Tools Still Get Wrong: Six Failure Modes Beyond Hallucinations

The cleanest AI-generated legal draft can be the most dangerous one on the desk. Not because it is full of obvious nonsense, but because it reads like work product: proper tone, familiar cadence, tidy citations, confident transitions. By the time a paralegal or junior lawyer is told to “just proofread it,” the document may already have the one quality proofreading assumes — a mostly intact relationship to the record and the law.

That assumption is no longer safe. Damien Charlotin’s database had documented more than 1,174 court and tribunal decisions worldwide involving AI hallucinations by April 2026, with consequences that moved beyond judicial irritation into monetary sanctions, bar referrals, and mandatory training orders.[1] Those cases matter, but not because every AI error looks like the famous fake-citation disaster. They matter because they prove that polished legal prose can reach court while detached from its sources.

For paralegals using AI paralegal tools, the practical question is narrower and more urgent than “Can AI hallucinate?” It is: what can go wrong in a document that still looks legally literate, and what kind of review catches it before filing, service, signature, or client delivery?

Polished legal document with hidden citation and sentence-level structural errors

Proofreading reads the document. Verification leaves it.

Traditional proofreading is recognition work. You look for typos, awkward phrasing, formatting slips, defined terms that changed shape, citations that violate local style, and sentences that do not scan. Reading aloud helps. Reading backward helps. A good proofreader can feel when a paragraph has been stitched together in the wrong order.

AI review is different. The most important defect may not be visible inside the sentence. It may live in the gap between the sentence and the cited case, between the chronology and the production set, between the quotation marks and the opinion, or between the user’s instruction and what the model quietly decided to do instead.

That is why “proofread the AI draft” is too small an instruction. A reviewer can recognize a plausible legal sentence without knowing whether the cited authority actually supports it. The document must be treated less like prose to polish and more like a chain of claims to test.

Six ways an AI legal draft can become unreliable

A useful starting point is Jeffrey I. Ehrlich’s May 2026 Advocate Magazine article, which lays out a practitioner-developed taxonomy of hallucination varieties and buy-in bias, drawing on documented courtroom examples including Mata v. Avianca, Coomer v. Lindell, Noland v. Land of the Free, and Campos.[2] It is not a consensus standard issued by a court, bar association, or professional body. It is still valuable because it separates errors that too often get collapsed into one word.

For workflow purposes, the taxonomy can be expanded into six failure modes that paralegals are likely to face when reviewing AI-assisted research, chronologies, memos, briefs, cite-checks, and contract summaries.

Failure mode	What the draft may look like	What must be verified
Fabricated authority	A case, statute, rule, regulation, or source is cited as if it exists.	Existence in the official or reliable source set.
Misattributed holding	A real authority is cited for a proposition it does not support.	The precise proposition, procedural posture, and limiting context.
Fabricated quotation	A quotation sounds judicial or statutory but does not appear in the cited source.	Exact words, punctuation, omissions, and page or pinpoint location.
Factual drift	A chronology or summary slowly changes facts from the record.	Each fact against the underlying document, transcript, exhibit, or data source.
Buy-in bias and prompt non-compliance	The model accepts a false premise or ignores an instruction while sounding cooperative.	Whether the answer actually follows the assignment and challenges unsupported assumptions.
Context-window degradation	Long sessions begin to lose, compress, or contaminate earlier information.	Continuity across drafts, especially after long prompts, uploads, or iterative revisions.

Fabricated authorities are only the obvious first category

The fabricated case is the failure mode everyone now knows to fear. It is also the one most likely to be addressed by purpose-built legal research platforms. A retrieval-based system connected to a legal database should be better positioned than a general chatbot to avoid inventing a case that is not in the database at all.

That distinction matters. ChatGPT and Claude, used without an authoritative legal database, do not have the same source constraints as Westlaw AI, Lexis+ AI, CoCounsel, Harvey, or other legal platforms designed around legal materials. The tools should not be flattened into one bucket.

But reducing nonexistent citations is not the same thing as making the answer reliable. A case can exist. The citation can be formatted correctly. The paragraph can still be wrong.

A real case can be attached to the wrong proposition

Misattributed holdings are quieter than fabricated authorities. They are also harder to catch by eye. The reviewer sees a familiar reporter citation, a court name, maybe even a real case name. Nothing about the sentence announces that the cited case decided a narrower question, arose on a different procedural posture, was later distinguished, or contains dicta being treated as a holding.

This is where AI-generated prose exploits a normal legal-reading habit. In a busy draft, a reviewer may read the cited proposition and think, “That sounds like the law.” But the question is not whether the proposition sounds right in the abstract. The question is whether that authority supports that proposition in that sentence.

The verification step is not a citation check in the narrow formatting sense. It requires opening the authority and matching the draft’s claim to the court’s actual reasoning. If the draft says a case “held” something, the reviewer should confirm that the issue was before the court and that the statement is not background, dicta, a party argument, a quotation from another case, or a rule applied only under materially different facts.

Quotation marks can create false confidence

A fabricated quotation is worse than a loose paraphrase because quotation marks tell the reader the language has already been checked. AI can produce language that sounds exactly like judicial writing: balanced, formal, slightly old-fashioned when needed, and comfortably embedded in a paragraph. That is precisely why it must be distrusted until located.

The check is mechanical and unforgiving. Search the cited authority for a distinctive phrase. If it is not there, do not massage the sentence until it becomes “close enough.” If the language appears in a different case, a headnote, a party brief, a secondary source, or nowhere at all, the draft needs correction before the surrounding analysis can be trusted.

This is one reason headnotes and AI summaries need careful handling. They may help a reviewer navigate, but they are not a substitute for the opinion, statute, regulation, contract, transcript, or record document that will bear the weight of the assertion.

Factual drift is the error that spreads

Legal teams often use AI to make a first pass through records: summarize deposition testimony, draft a chronology, identify recurring actors, or condense long email threads. This is a sensible use case when the output is treated as a working aid. The danger comes when the first-pass summary becomes the source for the next summary, and then the next draft, until an early mistake hardens into the team’s apparent record.

Factual drift rarely announces itself as fiction. It may change “received” to “reviewed,” turn an approximate time into a sequence, merge two employees, omit a qualification, or convert a witness’s uncertainty into a clean factual assertion. None of those changes may look dramatic in isolation. In a statement of facts, they can alter causation, notice, damages, privilege, or credibility.

The reviewer’s job is to keep the draft from becoming its own authority. Every material fact needs a record parent: exhibit, transcript page, declaration paragraph, contract section, board minutes, data export, or other source. If a fact cannot be walked back to that parent, it should not survive merely because it has appeared consistently in earlier AI-assisted drafts.

Buy-in bias makes the model a poor skeptic

Ehrlich’s discussion of buy-in bias is especially useful for legal support work because it describes a failure that does not always look like invention.[2] The model may accept the user’s framing, build around a false premise, and produce an answer that is internally polished but externally unearned.

A prompt that says, “Draft an argument that the agreement required written notice before termination,” may receive a polished argument even if the agreement has no such requirement. A prompt that says, “Summarize the cases holding that delay alone proves prejudice,” may produce a list framed around that proposition even if the law is more conditional. These are hypothetical examples, but the workflow problem is real: a cooperative model is not the same thing as an adversarial reviewer.

Paralegals can catch this only by checking the assignment against the source, not just the answer against grammar. Before polishing the output, ask whether the premise was established. If the prompt smuggled in a fact, legal standard, or procedural assumption, verify that assumption first.

Prompt non-compliance can hide inside a useful answer

Prompt non-compliance is not always dramatic. The tool may answer five of seven questions, ignore the requested jurisdiction, use cases outside the date range, omit contrary authority despite being asked for it, summarize only uploaded documents instead of comparing them, or provide a narrative when the task required a table of source-backed assertions.

The answer can still be useful. That is what makes the defect easy to miss. A reviewer who starts editing the prose too soon may never return to the instruction set. For AI-assisted legal work, the prompt itself becomes part of the review file. The first question is not “Is this well written?” It is “Did it do the assigned job?”

Long sessions can contaminate later work

Context-window degradation is the least visible failure mode in ordinary document review. In a long AI session, earlier materials, user corrections, discarded theories, draft language, and source excerpts may all remain part of the conversational environment. The model may compress, forget, blend, or over-weight pieces of that history.

For a litigation team, that matters when a research thread becomes a drafting thread, then a revision thread, then a cite-checking thread. The model may carry forward a mistaken premise that was corrected earlier, revive language from an abandoned argument, or treat a prior draft as more authoritative than the source documents. The reviewer may not see the contamination because the final answer arrives clean.

Purpose-built legal AI narrows some risks, not all of them

It is tempting to draw a bright line between general-purpose AI and legal AI: the general chatbot hallucinates; the legal platform verifies. The real line is less comforting. Legal platforms can reduce the risk of citing nonexistent authorities because they are designed around legal databases, retrieval, and source display. That is an important improvement. It is not a complete quality-control system.

Reported accuracy figures should be read with care, because legal AI benchmarks vary by task design, query type, source set, and scoring method. Still, the published numbers are enough to reject casual confidence. A secondary discussion of 2026 legal research accuracy cited a Stanford empirical study reporting Lexis+ AI at 65% accuracy and Westlaw AI at 42% accuracy on legal queries, and also described a Vals AI Legal Research Report from October 2025 showing a 78% to 81% ceiling.[3] Those figures do not mean every tool will perform that way on every office task. They do mean that professional review cannot be optional.

Retrieval-augmented generation, often shortened to RAG, changes the failure profile. Instead of answering only from model weights, the system retrieves documents and uses them to generate an answer. In legal tools, that can help ground the response in cases, statutes, regulations, briefs, contracts, or uploaded materials.

But retrieval is not the same as characterization. A tool may retrieve the right case and still overstate its holding. It may retrieve the right deposition and still smooth away uncertainty. It may display sources and still generate an inaccurate quotation. It may find relevant documents and still ignore an instruction to separate favorable from unfavorable authority. RAG helps with the “does this source exist?” question. It does not, by itself, answer “does this sentence fairly state what the source says?”

Comparison of recognition-based proofreading and verification-based legal review

Why a streak of correct answers is not a safe signal

One of the most dangerous habits in AI review is relaxing after several good answers. In human workflow, that instinct is understandable. If a junior team member has accurately checked ten citations in a row, a supervisor may begin to trust the eleventh. AI output invites the same comfort, but the analogy is weak.

A March 2026 Cornell/GWU arXiv paper described a simplified effective-attention-head model in which output could flip from correct to fabricated after sustained accurate responses.[4] That finding should not be overstated. It is a model-based result, not proof that production legal AI systems behave deterministically in the same way. The useful warning is more modest: a run of plausible, accurate output is not itself verification of the next answer.

Spot-checking inherits the same problem. If the reviewer checks three citations and they are sound, the fourth unreviewed citation is not thereby proven. If the model summarized two exhibits accurately, the third summary still needs its own source comparison. AI review does not reward trust by sample unless the team has a defensible reason to sample rather than verify, and legal filings rarely give the last reviewer much comfort on that point.

A verification workflow for AI-assisted legal drafts

A workable review process has to be blunt enough to use under deadline pressure. It should not require the paralegal to diagnose model architecture. It should force the draft back to the materials that matter.

Review pass	Action	Primary question
1	Separate assertions from prose.	What is this sentence asking the reader to believe?
2	Identify the source for each material assertion.	Where did this come from?
3	Confirm the source exists and is the right source.	Is the cited authority or record document real and relevant?
4	Match the proposition to the source.	Does the source actually support the claim made?
5	Verify all quoted language directly.	Do the words inside quotation marks appear there?
6	Compare facts against the record.	Did the draft change sequence, actors, certainty, or scope?
7	Review prompt compliance and context contamination.	Did the tool answer the assignment cleanly, or carry over noise?

Start by marking claims, not mistakes

Before correcting style, mark every sentence that makes a legal or factual assertion. This includes standards of review, elements, burdens, exceptions, procedural history, factual chronology, descriptions of documents, and characterizations of testimony. Transitional sentences can matter too if they imply causation or sequence.

A sentence such as “The company terminated the agreement after repeated written warnings” contains several claims: there was a termination, there was an agreement, warnings existed, the warnings were written, there was more than one, and they preceded termination. If the record supports only some of that, the sentence is not ready.

Give each assertion a source parent

Every material assertion should have a parent source. For law, that means the case, statute, rule, regulation, contract provision, agency guidance, or other authority. For facts, it means the record document: exhibit, transcript page, declaration, discovery response, email, spreadsheet, board packet, or other file the team can produce and defend.

If the AI draft cites a source, open it. If it does not cite a source, do not assume the source was in the prompt or upload. Find it or flag it. “AI said so” is not a parent source, and neither is a prior AI-generated summary.

Check existence before meaning

For authorities, first confirm the source exists in a reliable database or official source. Check the case name, court, date, reporter citation, docket number if relevant, and subsequent history where the task requires it. For statutes and regulations, check the current version and effective date if currency matters.

Only after existence is confirmed should the reviewer move to meaning. A real citation is not a cleared citation. It has merely survived the first gate.

Match the proposition at the right level of specificity

The most useful verification question is precise: “Does this source support this sentence as written?” Not the general topic. Not a nearby proposition. Not a broader principle that could be rewritten into support. The sentence as written.

Watch verbs carefully. “Held,” “found,” “recognized,” “noted,” “suggested,” “declined,” and “distinguished” do different work. AI drafts often choose the stronger verb because it makes the paragraph move. The source may require a weaker one.

Also check procedural posture. A statement made while denying a motion to dismiss does not necessarily carry the same weight in summary judgment analysis. A rule discussed in a dissent is not the court’s holding. A case applying state law is not automatically a statement of federal law, or vice versa.

Treat quotations as zero-tolerance items

Quotation review should not be approximate. Open the source, search the phrase, confirm the page or paragraph, and compare the exact language. Check ellipses, brackets, emphasis, internal quotation marks, and whether the quoted language comes from the court, a party, a cited source, or a headnote.

If time is short, quotations are not the place to economize. A paraphrase can be revised. A false quotation is a representation that the words appear in the source.

Run a separate factual drift pass

Factual drift deserves its own pass because it hides under narrative smoothness. Compare the AI draft against the record in order. For each fact, check actor, date, sequence, document type, certainty, quantity, and whether the draft has converted an allegation into an established fact.

Chronologies are especially vulnerable. A good chronology is not just a list of events; it preserves uncertainty. If a witness says she “believes” a call happened before a meeting, the chronology should not silently promote that belief into a fixed sequence unless another source supports it.

Review the prompt after reviewing the answer

Keep the prompt or task instruction with the review materials. After the answer has been checked for sources, compare it to the assignment. Did the tool limit itself to the correct jurisdiction? Did it include adverse authority if asked? Did it distinguish controlling from persuasive authority? Did it use only the uploaded documents if that was the instruction? Did it separate facts from legal conclusions?

This pass catches the useful-but-wrong answer: the one that would improve the draft while failing the assignment.

Use clean sessions when the review task changes

When moving from brainstorming to research, from research to drafting, or from drafting to verification, consider starting a fresh session or using a separate review workspace. The point is not ritual cleanliness. The point is to keep discarded theories, earlier mistakes, and draft language from becoming invisible context for the next task.

If the platform allows source-bound review, use it deliberately. Upload only the materials needed for that pass. Label source sets clearly. Avoid asking the same thread to be researcher, advocate, editor, and cite-checker for a filing without any boundary between those roles.

What to check first at 10:40 p.m.

Deadline review is never ideal. If the document is moving and there is no time for a beautiful process, triage by consequence. The first pass should go to the statements that would embarrass the signer, mislead the court, alter the client’s position, or infect multiple later paragraphs.

Check every quotation against the source.
Check every cited legal proposition that uses strong verbs such as “held,” “requires,” “bars,” or “establishes.”
Check all facts tied to dates, notice, causation, damages, privilege, waiver, exhaustion, or jurisdiction.
Check whether the draft relies on a prior AI summary instead of the underlying record.
Check whether the tool answered the actual assignment or merely produced something useful-looking.

Formatting, style, and grammar still matter. They just cannot be allowed to consume the review window before source integrity is tested. A polished falsehood is not improved by perfect commas.

The reviewer’s standard is traceability

AI can be useful in legal support work. It can make a first pass through a record, surface candidate authorities, generate a rough chronology, suggest organization, and expose issues a tired team might otherwise postpone. The problem is not use. The problem is treating fluent output as if fluency were evidence.

Paralegals do not need to become model scientists to protect the work product. They do need a different review posture. The question is not whether the AI draft sounds like a legal document. The question is whether each material assertion can be traced to a primary source, checked at the right level of specificity, and corrected when the source does not carry the weight the sentence puts on it.

References

AI Hallucination Cases Database, Damien Charlotin, April 2026.
Before you buy legal AI, learn to use the AI you already have, Advocate Magazine, May 2026.
What Makes the AI Legal Research Tool Accurate in 2026?, Poll the People.
arXiv:2603.23857v1, arXiv, March 2026.

What AI Paralegal Tools Still Get Wrong: Six Failure Modes Beyond Hallucinations

Profile summary

Proofreading reads the document. Verification leaves it.

Six ways an AI legal draft can become unreliable

Fabricated authorities are only the obvious first category

A real case can be attached to the wrong proposition

Quotation marks can create false confidence

Factual drift is the error that spreads

Buy-in bias makes the model a poor skeptic

Prompt non-compliance can hide inside a useful answer

Long sessions can contaminate later work

Purpose-built legal AI narrows some risks, not all of them

Why a streak of correct answers is not a safe signal

A verification workflow for AI-assisted legal drafts

Start by marking claims, not mistakes

Give each assertion a source parent

Check existence before meaning

Match the proposition at the right level of specificity

Treat quotations as zero-tolerance items

Run a separate factual drift pass

Review the prompt after reviewing the answer

Use clean sessions when the review task changes

What to check first at 10:40 p.m.

The reviewer’s standard is traceability

References

Corrections & feedback

Comments

Profile summary

Full profile

Proofreading reads the document. Verification leaves it.

Six ways an AI legal draft can become unreliable

Fabricated authorities are only the obvious first category

A real case can be attached to the wrong proposition

Quotation marks can create false confidence

Factual drift is the error that spreads

Buy-in bias makes the model a poor skeptic

Prompt non-compliance can hide inside a useful answer

Long sessions can contaminate later work

Purpose-built legal AI narrows some risks, not all of them

Why a streak of correct answers is not a safe signal

A verification workflow for AI-assisted legal drafts

Start by marking claims, not mistakes

Give each assertion a source parent

Check existence before meaning

Match the proposition at the right level of specificity

Treat quotations as zero-tolerance items

Run a separate factual drift pass

Review the prompt after reviewing the answer

Use clean sessions when the review task changes

What to check first at 10:40 p.m.

The reviewer’s standard is traceability

References

Related resources

Corrections & feedback

Comments