Use case
AI Ops Agents
Your incident response agent restarts a healthy service that shared a label with the degraded one. Now you have two incidents and it is 2am. The original pager alert is still open. Nothing in the pipeline flagged the tool-call output before the restart executed.
Where things go wrong
Wrong service restarted (label match)
The agent matches on a service label that appears in both the degraded service and a healthy one running a dependency. It restarts the healthy service. The dependency goes down. The original incident is now a cascade.
A contained single-service incident becomes a multi-service outage; the additional downtime is attributable to the agent action, not the original failure.
Wrong team paged
The alert is in the payments service. The agent pattern-matches on a keyword and pages the platform on-call instead. The payments on-call is not paged. Forty-five minutes pass before a human notices the wrong team is responding.
Forty-five-minute response delay during an active incident; payments SLA breached; postmortem attributes root cause to agent misrouting.
Config push to prod instead of staging
The agent determines a configuration change is the correct fix and executes the runbook action against the wrong environment. The production environment now has a config change applied during an active incident — a change that was not reviewed, tested, or approved.
The config change compounds the incident or introduces a new failure mode; rolling back requires a separate incident response cycle.
Runbook step executed on wrong incident context
The agent executes a runbook action that is correct for a different class of incident than the one it was given. The runbook step is syntactically valid — it calls the right tool with well-formed parameters — but it is semantically wrong for the current failure mode.
Automated action delays recovery; the on-call engineer has to undo what the agent did before addressing the original failure.
Eval + control loop
What happens when a rule fires
The response
How TruLayer closes the loop
- Tool Choice
- Function Call
- Faithfulness
For AI ops agents, tool-call correctness is the evaluator that matters most. Every runbook action — service restart, pager escalation, config change, environment flag — is a tool call with parameters. The tool-call correctness evaluator scores whether the agent called the right action with the right parameters against the right target. A score below threshold means the agent’s tool call does not match what the runbook context specified — which is exactly what happens when it restarts the wrong service or targets prod instead of staging. This scores inline on every span, on every runbook execution, not after the postmortem surfaces the pattern.
The faithfulness evaluator runs alongside tool-call correctness. For runbook agents, faithfulness measures whether the agent’s chosen action is grounded in the runbook document it was given — whether the step it is executing is actually specified for this class of incident. When both evaluators are configured on the same pipeline, a tool-call correctness failure and a faithfulness failure together signal the highest-risk scenario: the agent is calling the wrong action in the wrong context. When a rule fires, the control loop acts before the next runbook step executes on the same failure path: retry with a more constrained prompt that names the specific service and environment; fall back to a no-op action that pauses the runbook until a human confirms; or route to a human review queue so the on-call engineer approves the next action before it fires. The HITL path is the appropriate default for any runbook action that is irreversible — config pushes, service restarts in prod, pager escalations.
The per-trace before/after delta surfaces exactly what the agent did, what score it received, what action the control loop took, and whether the retry produced a better result. For an SRE team running a postmortem, this is the difference between "the agent did something wrong and we are not sure what" and "here is the span, the tool-call correctness score was 0.41, the retry resolved it, the delay was 2 minutes." The trace does not replace the postmortem — it makes it shorter.
See it in practice
Instrument your ai ops agent in two lines.
Wrap your LLM client. Every span from this trace is captured and scored by every built-in evaluator. Eval rules and control-loop actions are configured in the dashboard.
import { TruLayer } from '@trulayer/sdk'
import OpenAI from 'openai'
const tl = new TruLayer({ apiKey: process.env.TRULAYER_API_KEY })
const openai = tl.instrument(new OpenAI())
// Every span from this client is captured, scored by all 25
// built-in evaluators, and surfaced in the incident-response project.
// Eval rules + control-loop actions are configured in the dashboard,
// not in your application code.
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: task }],
})Ship reliable ai ops agents.
Free tier includes 1M spans / month · No credit card