Observability and incident response — the SRE basics

A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.


When a production system misbehaves, two disciplines are doing the work simultaneously: observability tells you what is happening and why, and incident response organizes the humans who use that information to make the system healthy again. They are inseparable in practice and worth understanding together.

This post is a primer on both — what each is, the standard taxonomy, and how the 2026 generation of AI tooling is starting to absorb the lower rungs of each.

A note on framing. I use “SRE” inclusively across this site — observability and incident response are practices every modern engineering organization needs, whether you call your team SRE, DevOps, platform engineering, or just “the on-call rotation.” The mechanics described here apply across all of those.


Part 1: Observability

What “observability” actually means

The term has been around since control theory in the 1960s — a system is observable if you can determine its internal state from its external outputs. In software, it has been over-claimed to the point of meaninglessness, but the working definition is useful: observability is the property of being able to ask new questions of a running system without shipping new code.

That definition matters because it draws the line between monitoring and observability. Monitoring answers questions you knew to ask in advance (CPU > 80%, error rate > 1%). Observability lets you investigate failures you didn’t predict.

The three pillars

The conventional taxonomy splits the data into three types. Each answers a different question.

Logs — discrete, timestamped, structured events. The right tool for what specifically happened on this request, in this process, at this moment. High cardinality (every request can have unique IDs), high cost at scale. Use for forensics and debugging, not for alerting.

Metrics — pre-aggregated, time-series numbers. The right tool for how is the system behaving in aggregate, over time. Low cardinality (a label explosion will kill your Prometheus), cheap, fast to query. Use for SLI measurement, dashboards, and alerts.

Traces — the path of a single request through a distributed system, with timing at each hop. The right tool for where did the latency come from and what called what. Sampled in production (you cannot afford 100% trace volume at scale).

The three pillars of observability — logs, metrics, traces

Courtesy: Honeycomb — What is observability? and OpenTelemetry — concepts

The pillars are not religious. Events, profiles, and exceptions show up in modern stacks alongside them, and the more interesting work happens at the joins — a trace ID in a log, a metric exemplar pointing at a trace, a profile attached to a slow span. The point of the model is not the count of pillars but that you can pivot between them.

A working observability stack in 2026

The CNCF-friendly default looks roughly like this:

  • Instrumentation: OpenTelemetry (the standard for all three signals).
  • Metrics: Prometheus + Thanos / Mimir for long-term storage.
  • Logs: Loki, OpenSearch, or a managed equivalent.
  • Traces: Tempo, Jaeger, or a managed equivalent.
  • UI: Grafana, or a vendor (Datadog, Honeycomb, New Relic).

What’s worth getting right early: a stable trace context propagated everywhere, structured logs (JSON, never plain text), and SLI-based alerts (alert when users are unhappy, not when a CPU graph crosses a line).

Reference observability stack — OpenTelemetry, Prometheus, Loki, Tempo, Grafana

Courtesy: Grafana — the LGTM stack and CNCF Observability TAG


Part 2: Incident response

What an incident actually is

An incident is an unplanned event that causes — or threatens to cause — a degradation in user-visible service. The word matters because it triggers a defined process: declare, respond, resolve, learn.

Not every alert is an incident. Not every incident requires a war room. The job of triage is to size what just happened.

Severity levels

Every team needs a shared vocabulary for how bad is this. The common ladder:

SeverityMeaningTypical response
SEV1Complete outage, or critical user-facing breakage at scaleAll-hands, customer comms within minutes, exec notification
SEV2Partial outage or major feature broken for many usersDedicated IC, structured response, customer comms
SEV3Degraded experience or single-feature issueOncall handles in-channel, no exec ladder
SEV4Internal or low-impact issue, can wait for business hoursTicket, fix during normal work

The exact thresholds don’t matter as much as everyone agreeing on them. The failure mode is severity inflation (everything is SEV2) or severity denial (nothing is ever SEV1 because that means waking the VP).

Incident roles

Even small incidents benefit from explicit roles. The standard three:

  • Incident Commander (IC). Owns the response. Does not fix the system — coordinates the people who do. Decides when to escalate, when to call the incident over.
  • Communications Lead. Owns updates to customers, status pages, internal stakeholders. Frees the IC to focus on coordination.
  • Scribe. Captures the timeline in real time — what was tried, what was observed, what was decided. This is what the postmortem is built from.

In smaller teams these collapse into one person, but you should at least name the role even if one person wears all three hats. Naming the role is what stops the response from drifting.

Incident response roles and command structure

Courtesy: PagerDuty — Incident Response documentation and Atlassian Incident Handbook

Blameless postmortems

After every significant incident — typically SEV2 and above — the team writes a postmortem. The non-negotiable rule is that it is blameless: it focuses on the system, the contributing factors, and the changes that would prevent recurrence. It does not focus on the human who clicked the button.

The reason is operational, not moral. Blame-driven postmortems produce defensive engineers who hide information, which produces shallow analyses, which produces repeat incidents. Blameless postmortems produce engineers who volunteer “here’s what I almost did wrong,” which is the substrate for actually learning.

A working postmortem has: a short summary, a timeline, root-cause analysis (usually multiple contributing factors, not one root cause), action items with owners and dates, and lessons learned. Action items without owners and dates are wishes, not commitments.


What’s changing in 2026

Both disciplines are seeing the same shift. The lower rungs — the parts that are pattern-matching against historical data — are being absorbed by AI tooling. The judgement-heavy upper rungs are not.

In observability, that means LLM-assisted log triage, alert explanation in plain language, and automated correlation across the three pillars are now realistic to build in days, not quarters. (Observability for AI systems is the related and harder question — what changes when the service itself is calling an LLM, not just your observability stack.)

In incident response, agents are starting to draft the first version of the postmortem from the incident channel transcript, write the customer-facing status update, and propose remediation steps. They are not the IC, and the autonomy ladder for those decisions is climbing carefully — for good reason. A wrong remediation during an incident makes a SEV2 into a SEV1.

The discipline underneath both is unchanged. Measure what users see. Name the roles. Capture the timeline. Be blameless.

References