Use case

Legal Research Agents

Your legal research agent surfaces a citation. The case name is plausible. The holding fits the argument. The case does not exist. Your associate submits the brief. Opposing counsel finds it in writing. This is not a hypothetical.

Where things go wrong

Non-existent case cited as real

The model generates a case citation — plausible court, plausible year, plausible party names — that does not exist in any legal database. The citation looks real, reads as authoritative, and fits the argument. It is fabricated. The Mata v. Avianca pattern, first documented in the Southern District of New York in 2023, has recurred in multiple subsequent cases involving AI-generated filings. The engineers who built those systems did not have hallucination scoring on the citation output.

Court sanction, bar complaint, public coverage. The attorney is liable for what was in the brief, regardless of whether the AI wrote the citation.

Real case cited with opposite holding

The agent retrieves a real case but misrepresents its holding. The case is cited as supporting the argument; it actually cuts against it. The correct case name and citation make it look verified. Opposing counsel reads the actual case.

Brief credibility destroyed in reply; the error signals that the research pipeline produced citations without verification, not just a one-off mistake.

Citation without source verification

The agent produces a citation string — case name, court, year, docket — without performing a lookup against a legal database. The output format is identical to a verified citation. Nothing in the output indicates the source was not checked.

The citation moves downstream as if verified. The failure is invisible in the output; only a downstream lookup or opposing review surfaces it.

Confident output with no uncertainty signal

The agent produces a research memo with five citations. Three are accurate. Two are fabricated or misquoted. The memo is formatted consistently — same citation style, same confident prose throughout. There is no signal in the output that the two problematic citations are any different from the three correct ones.

A reviewer who spot-checks one correct citation and approves the rest has no way to distinguish the good from the bad without checking all five independently.

Eval + control loop

What happens when a rule fires

Legal Research Agents control loop: original span scores hallucination_rate 0.72 — above threshold, triggering human review — awaiting review.STEP 1Original spanarrivedSTEP 2Eval fireshallucination_rate 0.72 — above thresholdSTEP 3Human reviewnext call on the same failure pathSTEP 4Human queueAwaiting review

The response

How TruLayer closes the loop

  • Hallucination
  • Faithfulness

The legal research failure mode is a hallucination problem with a paper trail. A case citation that does not exist is a factual claim about the world — not a hallucination in the vague sense of "a wrong answer," but a specific, verifiable, false assertion of fact. TruLayer’s hallucination evaluator scores whether the agent’s output contains assertions that are not grounded in the provided context. For a legal research agent, the right context is the contents of a verified legal database lookup — not the model’s training data. When the hallucination score on a citation output exceeds threshold, it means the agent produced a claim it cannot trace back to its provided sources. That is the signal that routes the citation to a human review queue before the brief moves downstream.

The faithfulness evaluator runs alongside hallucination. Where hallucination asks "did the model assert something it has no basis for," faithfulness asks "does the output match what was actually in the retrieved source." For citation outputs, faithfulness catches the second failure mode: the case is real and the citation was retrieved, but the holding was misrepresented — the model’s summary of the case diverges from what the source document actually says. Both scores run inline on every span as each trace arrives. When either fires, the control loop acts on the next call in the same failure path: retry with a prompt that requires the model to quote the holding verbatim from the retrieved document; fall back to a model with lower generative latitude on summary tasks; or route to a human review queue for associate verification before the citation enters the brief.

The per-trace before/after delta shows the original hallucination or faithfulness score alongside the post-remediation score. For a legal team doing pre-submission review, the trace answers the question that matters: which citations were flagged, what score they received, and whether the corrected output passed threshold. The alternative — manually verifying every citation in every research output — scales with headcount, not with the pipeline. The trace makes the review targeted rather than exhaustive.

See it in practice

Instrument your legal research agent in two lines.

Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.

agent.ts
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'

const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())

// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the research project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: task }],
})

Ship reliable legal research agents.

Free tier includes 1M spans / month · No credit card