Use case

Clinical AI Assistants

Your health assistant described the right condition. The dosage range it cited was wrong. The output was delivered with the same confident tone as every accurate response before it. Nothing in the pipeline produced a different signal for the wrong answer versus the right one.

Where things go wrong

Wrong dosage range for a common medication

The model describes a condition accurately and then states a dosage range that is outside standard clinical guidelines — either too low to be effective or high enough to cause harm. The error is plausible. The formatting is clean. The response reads like every other accurate response the assistant has produced.

A patient acts on incorrect dosage information. The failure is invisible in the pipeline until an adverse outcome surfaces it.

Specialist referral mismatched to presenting symptoms

The assistant recommends a specialist referral based on pattern-matching on a surface feature — a term in the symptom description that appears in an adjacent clinical category. A patient presenting with cardiac symptoms is referred to a specialist whose scope does not cover cardiac care.

Delayed or missed care for the presenting condition; the referral appears correct in the output log, and the failure requires clinical review to identify.

Invented drug interaction

The model describes a drug interaction as clinically documented when it has no basis in the pharmacological literature. The interaction sounds plausible — both drug names are real, the mechanism described is coherent. The interaction itself does not exist.

A patient avoids a medication combination they could safely use, or seeks unnecessary clinical consultation, based on a fabricated contraindication.

Confident output with no clinical uncertainty signal

The assistant produces responses with consistent confidence across both high-certainty clinical facts and low-certainty inferences. Nothing in the output distinguishes "this is standard clinical guidance" from "this is the model’s best inference given incomplete information." Reviewers cannot tell from the output which responses warrant additional verification.

Clinical outputs that should trigger additional review are treated as authoritative, because the output itself provides no signal that they are different from the outputs that should be.

Eval + control loop

What happens when a rule fires

Clinical AI Assistants control loop: original span scores hallucination_rate 0.68 — above threshold, triggering human review — awaiting review.STEP 1Original spanarrivedSTEP 2Eval fireshallucination_rate 0.68 — above thresholdSTEP 3Human reviewnext call on the same failure pathSTEP 4Human queueAwaiting review

The response

How TruLayer closes the loop

  • Hallucination
  • Faithfulness
  • PII Leakage

For clinical AI assistants, the evaluation discipline is the same as in any other high-consequence domain: every output span gets a score, and the score is what tells you whether the output should move downstream or be reviewed. TruLayer’s hallucination evaluator measures whether the assistant’s clinical assertions are grounded in the source material it was given. For a clinical AI that retrieves from a drug database or clinical guideline corpus, a hallucination score above threshold on a dosage or interaction claim means the model produced a clinical statement it cannot trace to its retrieved context. That is the span that routes to a clinical reviewer queue, not after the patient interaction completes but before the same failure class repeats on the next call.

The faithfulness evaluator runs alongside hallucination. Where hallucination catches invented content, faithfulness catches drift: the model retrieved the correct clinical source but summarized or paraphrased it in a way that changes the clinical meaning. A dosage described as "up to 400mg daily" becoming "400mg twice daily" in the summary is a faithfulness failure, not a hallucination — the source existed, but the output did not accurately represent it. Both evaluators are built-in; both run inline on every span as each trace arrives. The PII evaluator also runs on every span that might contain patient identifiers — ensuring that clinical context assembled for one patient does not carry protected fields into a subsequent interaction or a logging surface where they do not belong.

When an eval rule fires, the control loop acts on the next call in the same failure path. The action types for clinical pipelines are the same three that apply across all TruLayer deployments: retry with a more constrained prompt that requires the model to cite its source for clinical claims; fall back to a model with lower generative latitude on clinical summaries; or route to a clinical reviewer queue for human verification before the output is returned to the end user. The choice of action is determined by the rule configuration — the engineering team defines the threshold, the action type, and the cascade depth. The per-trace remediation diff shows what changed after the control loop acted, giving the clinical AI engineering team a reviewable record of every flagged span, the score that triggered the flag, and whether the corrected output met threshold.

See it in practice

Instrument your clinical ai agent in two lines.

Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.

agent.ts
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'

const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())

// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the clinical project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: task }],
})

Ship reliable clinical ai assistants.

Free tier includes 1M spans / month · No credit card