Observability for AI systems — what changes when your service calls an LLM

Your golden signals don't cover the failure that will actually page you: the model returned a confident, well-formed, wrong answer. What observability for AI-enabled systems has to add — context as a span, quality as a signal, and the shift from passive monitoring to active investigation — grounded in where the field is in 2026.


A traditional service fails loudly. It throws, it times out, it returns a 5xx, latency spikes, the error rate crosses a threshold and PagerDuty does its job. Every signal we built observability around for fifteen years assumes that failure is visible at the edges — in status codes, durations, and counts.

Now your service calls an LLM, and the most important failure mode it has produces a 200 OK in 600 milliseconds with a perfectly well-formed body that is wrong. The model was confident. The JSON validated. The latency was great. And the answer was hallucinated, or subtly off, or sycophantically agreed with a bad premise in the prompt. None of your golden signals moved.

That’s the whole problem in one sentence: the failure has moved inside the response, where your existing observability stack doesn’t look. Everything in this post is about what you add to see it.


Your existing stack still matters — it just stops short

Let me be clear up front, because the LLM-observability vendors won’t be: latency, error rate, saturation, and traffic are not obsolete. An LLM call is still a network call to a thing that can be slow, rate-limited, down, or expensive, and you absolutely still monitor all of that. The token-cost dimension is genuinely new and worth a dashboard of its own — cost is now a first-class operational signal, not a monthly surprise.

But the four golden signals were designed for a world where a well-formed, on-time response is a successful response. That assumption is exactly what breaks. So the model isn’t “replace your observability.” It’s “your observability now has a blind spot precisely where it used to be complete.” Three additions close it.


Addition 1: context as a first-class span

When a microservice does the wrong thing, the cause is in a log line or a trace. When an LLM-backed system does the wrong thing, the cause is overwhelmingly a context deficiency — the model didn’t have the right information in front of it at inference time. The 2026 consensus on agent failures keeps landing on this: most of them are not model failures, they’re assembly failures.

And here’s the operational gut-punch: the assembled context — the actual system prompt, retrieved chunks, tool results, and memory that the model saw — is the single most diagnostic artifact in the entire system, and almost nobody captures it. You have logs of what the model returned and no record of what it saw. That’s debugging a crash with the stack trace deleted.

The fix is concrete and boring in the best way. Treat the assembled context as a span attribute. The 2026 tooling has converged on the open path here — instrument with OpenInference / OpenTelemetry libraries that emit OTLP-compatible spans, and export to whatever backend you already run (Phoenix, Langfuse, Datadog, your own collector — the point is it’s not a walled garden). Each LLM call becomes a span that carries:

  • the resolved prompt and which retrieval/memory sources fed it,
  • the model, parameters, and token counts,
  • the raw output and any post-validation verdicts.

Once context is a span, “why did the agent do that” stops being archaeology and becomes a query. This is the observability counterpart to building context well in the first place, which I covered in context engineering — engineer the window on the way in, capture it as telemetry on the way out.


Addition 2: quality is a signal now

This is the one that breaks people’s mental model. In classic observability, correctness is assumed and performance is measured. For LLM systems you have to measure correctness too, continuously, in production — because it varies turn to turn even with identical inputs.

That means a class of signals that didn’t exist on your old dashboards:

  • Factual grounding — did the answer stay anchored to the retrieved context, or did it invent? Often checkable automatically against the very context span you’re now capturing.
  • Logical / schema validity beyond “did it parse” — did it satisfy the actual constraints of the task, not just the JSON shape.
  • Sycophancy and drift — is the model agreeing with bad premises in the prompt, and is its behavior sliding over time as inputs shift? Model decay is a reliability problem, and it’s invisible unless you score for it.

The mechanism teams are standardizing on in 2026 is automated quality scoring in the pipeline — a cheaper model, a rules layer, or an eval harness judging a sampled stream of production outputs and emitting scores as metrics you can alert on. The mindset shift: an LLM feature isn’t “up” because it’s returning 200s. It’s up when its quality distribution is inside the band you’ve decided is acceptable. That band is an SLO. Treat it like one.


Addition 3: from passive monitoring to active investigation

The third shift is the most forward-looking, and it’s where the field is visibly moving this year. Classic observability is passive: you emit telemetry and wait for a human to query it when something looks wrong. The 2026 direction is active — the observability layer itself runs autonomous investigation, forming hypotheses and pulling the next diagnostic signal rather than waiting to be asked.

This is genuinely powerful and it is also the snake eating its own tail, which is why I’ll flag the discipline directly: the investigating agent is itself an AI system that needs every kind of observability in this post. If your root-cause agent is non-deterministic and unobserved, you haven’t removed the 3am uncertainty — you’ve added a layer to it. The teams doing this well keep the investigator firmly assistive: it gathers, correlates, and proposes a causally-consistent root cause with its evidence, and a human still owns the call to act. That constraint isn’t timidity; it’s what keeps the tool on the reliability side of the ledger. I made the broader version of this argument in when NOT to use AI in production SRE.


What good looks like

If you’re adding an LLM to a service this quarter, here’s the minimum bar I’d hold the work to:

  1. Keep the golden signals — latency, errors, saturation, traffic — on the LLM call. Add token cost as a peer signal.
  2. Capture the assembled context as a span. If you log one new thing, log what the model saw. OTLP-compatible, exportable, not locked in a vendor.
  3. Score a sampled stream of outputs for quality — grounding, validity, drift — and put an SLO and an alert on the distribution.
  4. If you add an investigation agent, observe it like any other agent and keep it assistive until you’ve earned the right not to.

The uncomfortable truth is that adding an LLM to your system doesn’t just add a dependency — it adds a dependency whose failures are invisible to the entire observability practice you’ve spent a career building. The good news is that closing the gap is mostly disciplined instrumentation, not magic. Capture what the model saw, measure whether it was right, and keep a human on the trigger. The rest is the observability you already know.


Related: Context engineering: the window is a budget · When NOT to use AI in production SRE · Observability and incident response basics