Alert fatigue? Let AI triage.

How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.


The first incident I ever ran solo was a HighErrorRate page at 2:14 a.m.

I was twenty-five. The alert payload was four labels and a Grafana link. I spent the next eleven minutes piecing together what the alert meant — which service it referred to, what the threshold was, what had deployed in the last hour, whether anyone else was paged. By the time I’d assembled enough context to act, the symptom had partially resolved on its own and I was no longer sure whether what I’d done mattered.

That eleven minutes was not a knowledge problem. The information existed. It was scattered across a runbook wiki, a deploy log, a Slack channel, and the metric system itself. The chronic on-call problem isn’t that engineers aren’t smart enough; it’s that there’s too much context in too many places and not enough time to gather it when something is breaking.

This post is about a tool I built to close that gap. It’s called alert-explainer. The code is open source at github.com/ajinb/alert-explainer. It sits between Alertmanager and your on-call routing and turns every alert into a structured triage brief in 1–4 seconds for well under a cent per alert. The design decisions — especially the reliability patterns — are the interesting part, so that’s what I’ll focus on here.


The problem, stated honestly

The dream is: on-call engineer gets paged, sees a clean brief instead of a bag of labels, knows what to do.

The real problem is subtler. A useful triage brief has to:

  • Be fast enough to read before the runbook URL loads — under five seconds end-to-end, ideally under three.
  • Be cheap enough that you can run it on every alert — a service that costs $0.10 per enrichment dies in the next budget review when alert volume hits ten-thousand a month.
  • Be honest about what it doesn’t know — a confidently-wrong triage brief is worse than no brief at all, because it consumes the engineer’s first read.
  • Degrade rather than break — if Claude is rate-limited or the network blips, the engineer still gets the original alert. No new outage on top of the existing one.
  • Be operable — fits into the same Prometheus / Alertmanager / Slack / PagerDuty plumbing the team already runs. No parallel observability empire.

Most of those constraints aren’t LLM problems. They’re plain reliability engineering, applied to a non-deterministic component. The architecture below is what falls out when you take those constraints seriously.


Architecture

The service has four stages, three of which are not the LLM.

                ┌──────────────────┐
   alerts ────▶ │  POST /webhook   │   Alertmanager v4 payload
                └─────────┬────────┘

                ┌──────────────────┐   Priority Queue (critical → warning → info)
                │ asyncio queue    │   Queue-Based Load Leveling (bounded)
                └─────────┬────────┘

                ┌──────────────────┐   Circuit Breaker around the LLM
                │ enrich (Claude)  │   Prompt caching on the system prompt
                └─────────┬────────┘

                ┌──────────────────┐
                │   downstream     │   Slack | PagerDuty | custom relay
                └──────────────────┘

The four stages map to four Azure Well-Architected reliability patterns, and that’s not an accident — those four patterns are exactly what this kind of fronting service needs:

StagePatternWhat it gives you
WebhookQueue-Based Load LevelingAlertmanager retries don’t stack up behind LLM latency
QueuePriority QueueCritical alerts drain before info alerts when there’s pressure
EnrichCircuit BreakerLLM degradation passes through, doesn’t propagate as an outage
/readyzHealth Endpoint MonitoringLoad balancers can shed traffic when we’re saturated

Each one is short — most of them under fifty lines of Python. I made a deliberate choice not to pull in pybreaker or build a Redis-backed queue from day one, because the simplicity stance is that the simplest thing that holds is the right starting point, and graduating to a library is a contained migration when you actually need half-open probing or per-tenant breakers. (More on that in a minute.)


The four patterns, made concrete

Queue-Based Load Leveling at the webhook

The webhook returns 202 Accepted immediately and enqueues the alert on a bounded asyncio.PriorityQueue. Alertmanager doesn’t wait for enrichment to finish; it gets a fast acknowledgment and moves on. If the queue is full, the response includes a dropped count and /readyz flips to 503 — Alertmanager and any fronting load balancer can shed traffic on their own retry semantics.

Why bounded? Because an unbounded queue under sustained overload is a memory leak with extra steps. The bound is configurable; the default is 1000 alerts in flight, which at four-worker concurrency is roughly a fifteen-minute backlog at the upper end of a small SRE org’s alert volume. If you breach that, the right answer is to scale horizontally, not buffer harder.

Priority Queue with stable tiebreakers

The priority is the alert’s severity label, mapped to integers (critical=0, warning=2, info=3). The queue stores tuples of (rank, sequence, monotonic_time, alert) — the sequence is a itertools.count() tiebreaker so two same-priority alerts never have to compare their Alert objects (which would crash on a Pydantic model lacking __lt__).

That tiebreaker took ten minutes to add and saved an entire class of “queue crashed at 3 a.m. because two alerts arrived in the same microsecond” debugging story. It’s the kind of small, paranoid choice that’s invisible when it works and the only thing that matters when it doesn’t.

Circuit Breaker around the LLM call

The breaker is a 50-line class. After N consecutive failures (default 5), it opens; further calls short-circuit with a CircuitOpen exception until a cool-down (default 30 s) elapses. The next call is retry-eligible — success closes the breaker, failure re-opens it.

The most important behavior isn’t the breaker itself; it’s what happens when it opens. The worker catches CircuitOpen, emits a structured log line, and returns the original alert wrapped in an EnrichedAlert with enrichment=None and enrichment_error="llm_circuit_open". The downstream sink — Slack, PagerDuty, whatever — receives the alert with a polite “AI enrichment unavailable” note and the original payload underneath.

This is the pattern I want every reader to internalize: the AI layer is additive, not load-bearing. If the LLM is down, the on-call experience degrades back to the unenriched baseline. Nobody gets paged about a thing they would otherwise not have been paged about. Nobody misses a page they would otherwise have received. The blast radius of an LLM outage is bounded to “the briefs are slightly less useful for thirty seconds.”

Health Endpoint Monitoring with semantically distinct probes

/healthz is liveness — it always returns 200 unless the event loop is wedged. Kubernetes uses this to decide whether to restart the pod.

/readyz is readiness — it returns 503 when the queue is at 95% capacity or the breaker is open. Load balancers use this to decide whether to send traffic. The two probes are answering genuinely different questions, and conflating them is the failure mode I see most often in junior platform code.

@app.get("/readyz")
async def readyz(response: Response) -> dict:
    queue_full = queue.depth >= int(settings.queue_maxsize * 0.95)
    breaker_open = queue.breaker_open
    ready = not (queue_full or breaker_open)
    if not ready:
        response.status_code = 503
    return {"ready": ready, "queue_depth": queue.depth, "breaker_open": breaker_open}

The cost story

The cost target was < $0.01 per enriched alert. Hitting that target is what lets you run this on every alert rather than gating it to “interesting” ones (which becomes an interesting argument every quarter).

Three choices got there:

  1. Default to Sonnet, not Opus. Triage briefs are a narrow, well-specified task. Sonnet is roughly a fifth of Opus’s cost for output that an on-call engineer cannot tell apart at 2 a.m. Opus is the right call when planning is the whole job (think multi-agent supervisor); Sonnet is the right call here.
  2. Prompt-cache the system prompt. The system prompt is ~600 tokens of static instructions that don’t change between alerts. With Anthropic’s prompt caching, the second-and-onward call reads the cache at ~10% of the regular input price. On a hot path with sustained alert volume, this drops the per-alert input bill by close to 90%.
  3. Cap the response. max_tokens=1024 is enough for a structured JSON triage brief; anything longer is the model rambling, and rambling at $15 per million output tokens is the line item that quietly burns the budget.

The math, with current Sonnet-class pricing: ~600 tokens of cached system prompt + ~400 tokens of alert payload + ~500 tokens of structured response ≈ $0.005 per enriched alert on a warm cache. Cold cache is closer to $0.01. Both are inside the SLO.


What it deliberately doesn’t do

Three things, and the omissions are the most opinionated part of the design:

  • It doesn’t act. It enriches; it does not remediate. There’s no “click here to restart the pod” button. Every action is still a human’s action. This is the same line I hold in the book chapter on multi-agent SRE systems — the agent produces evidence, the human takes the action that touches production.
  • It doesn’t replace your runbooks. It surfaces what’s in the alert and what the model can reason about from labels and annotations. If your runbook says “page the database team and don’t restart the leader,” the model doesn’t know that unless your alert annotations say it. Real runbooks still belong in your knowledge base.
  • It doesn’t ship its own monitoring. OpenTelemetry spans, structured logs, JSON /metrics — these go into the SIEM and APM the team already runs. Standing up a parallel observability stack for the AI layer is one of the failure modes I see most often when teams adopt these patterns; this project refuses to participate.

Where this fits in the bigger picture

alert-explainer is the third tool I’ve published under the AI-native SRE banner this month, after sre-ai-toolkit and incident-scribe. Each one is a single, narrow surface that takes one production-realistic problem and applies AI to it with the discipline of an SRE rather than the energy of a demo.

The next ones in the pipeline — k8s-ai-operator, cloud-cost-ai, llm-log-analyzer, cloudandsre-python — follow the same shape. Narrow scope, real reliability patterns, clear operational story, working code in five minutes. If that’s the kind of thing you want more of, the easiest way to follow along is the GitHub profile at github.com/ajinb and the blog at cloudandsre.com.


Try it in five minutes

git clone https://github.com/ajinb/alert-explainer.git
cd alert-explainer
python -m venv .venv && source .venv/bin/activate
pip install -e .

export ALERT_EXPLAINER_ANTHROPIC_API_KEY=sk-ant-...
python -m alert_explainer &

curl -X POST http://localhost:8080/webhook \
  -H 'content-type: application/json' \
  -d @examples/sample_alert.json

The enriched alert prints to stdout. Set ALERT_EXPLAINER_DOWNSTREAM_WEBHOOK_URL to a Slack incoming webhook to forward it. Wire it into Alertmanager via examples/alertmanager-snippet.yml.

If you run it against a real alert and it tells you something a runbook doesn’t, I want to hear about it. The interesting failure modes — and the interesting successes — are the ones I haven’t seen yet.


Ajin Baby is an AI Platform & Cloud Infrastructure Architect at a Fortune-500 aviation technology company and the founder of cloudandsre.com, where he publishes production-grade tooling at the intersection of AI and SRE. He is currently pursuing an MS in Artificial Intelligence and is a 15+ year IEEE member.