May 5, 2026 · 5 min read AISREPrompt Engineering

Prompt engineering for SRE: patterns that actually work in production

Most prompt-engineering advice is written for chatbots. SRE workloads are different — the input is messy, the output has to be machine-readable, and there's no human to gracefully handle a wrong answer. Here are six patterns I've shipped to production for SRE LLM tools, and why each one earned its place.

Almost every “prompt engineering guide” you’ll find online was written for chatbots. The advice is good — for chatbots. “Be polite. Set context. Use examples. Ask the model to think step by step.” Pleasant. Useful in a marketing tool. Mostly wrong for SRE.

SRE workloads break the chatbot assumptions in three specific ways:

The input is hostile. Alert payloads have nested labels nobody normalized, log lines that crashed the parser, error messages translated through five layers of middleware. Nothing is clean. Nothing is short.
The output has to be machine-readable. Something downstream is going to consume it — a Slack formatter, a ticket creator, a remediation queue. “Helpful prose” is a bug.
There is no human to recover from a wrong answer. When a chatbot hallucinates, the user re-prompts. When an SRE LLM hallucinates a runbook URL into a 3am alert summary, the on-call engineer follows it. There is no second chance.

After two years of building, reviewing, and operating LLM tools for SRE workflows — incident-scribe, alert-explainer, cloudandsre-skills, plus a pile of internal stuff I can’t show you — I have six prompt patterns I trust in production. None are clever. All of them earn their place.

1. The output contract is the prompt

If your prompt does not start with a JSON Schema or a literal example of the exact output structure you expect, you are not doing prompt engineering. You are vibing.

The single highest-ROI change you can make to any SRE LLM call is to specify the output as a contract, not a vibe:

Return JSON exactly matching this schema. No prose, no markdown, no
preamble. If a field is unknown, use null — never invent a value.

{
  "severity": "P1" | "P2" | "P3" | "P4",
  "category": "infra" | "app" | "config" | "external" | "unknown",
  "summary": string,                  // <= 200 chars, factual only
  "first_actions": [string],          // 0-3 items, imperative voice
  "runbook_hint": string | null       // free text, no fabricated URLs
}

The model gets dramatically better the moment the output shape is non-negotiable. The downstream code gets to parse instead of regex. And — more importantly — the things the model does wrong become visible: a null is a known unknown, prose is an enforcement bug, an invented URL is a contract violation you can detect.

Every SRE LLM tool I’ve shipped has this kind of contract at the top of the prompt. It’s the difference between “AI tool” and “tool that uses AI.”

2. Anchor the model with one full example, not three half ones

The few-shot literature recommends 3–5 examples. For SRE that’s usually wrong, for two reasons: token budget (alert payloads are huge) and confusion (three slightly-different examples teach the model that variation is acceptable, when what you want is rigidity).

Use one example. Make it complete. Make it look exactly like the input you actually expect, not a cleaned-up version. Include the messy nested label, the alphabet-soup pod name, the ANSI color codes nobody stripped.

EXAMPLE INPUT:
{
  "alertname": "KubePodCrashLooping",
  "labels": {"namespace":"prod","pod":"checkout-7d9b4f-xkz2m",...},
  "annotations": {"summary":"Pod has restarted 12 times in 1h",...}
}

EXAMPLE OUTPUT:
{"severity":"P2","category":"app","summary":"checkout pod crash-looping in prod (12 restarts/1h)","first_actions":["check recent deploys","kubectl describe pod","kubectl logs --previous"],"runbook_hint":"deploy rollback if regression suspected"}

The example does more work than the schema and the instructions combined. Rule of thumb: if you find yourself writing more instructions to clarify the example, your example is wrong — fix the example, not the prose.

3. Forbid the model from being polite

For chatbots, “be helpful and friendly” is fine. For SRE tools, every word the model didn’t need to say is a token you paid for, a millisecond of latency, and a chance for prose to leak past your output parser.

Three lines I put in every system prompt:

- Do not greet, apologize, hedge, or summarize what you're about to do.
- Do not use markdown headers, bullets, or formatting unless the schema requires it.
- If you cannot answer, return the schema with null fields and stop. Do not explain why.

The hedging line in particular is doing real work. Models love to write “Based on the alert, it appears that…” before any actual content. That phrase is signal-free, it inflates token count, and — worse — it gets pasted verbatim into incident channels by lazy downstream code.

4. Give the model a written-down policy for “unknown”

The most expensive failure mode of an SRE LLM is confidently wrong. The cheapest way to prevent it is a policy on what to do when the model doesn’t know:

Unknown-handling policy (apply strictly):

- If a field is not derivable from the input, use null. Never guess.
- Never fabricate URLs, runbook IDs, ticket numbers, person names, or
  resource identifiers. If the input doesn't contain it, it does not exist.
- If the entire input is unparseable, return:
    {"severity":"unknown","category":"unknown",
     "summary":"input could not be parsed","first_actions":[],
     "runbook_hint":null}

This sounds obvious. It is not what models do by default. By default, models invent runbook URLs that look like the URLs they’ve seen in training data. They invent ticket numbers in your team’s format. They invent the on-call engineer’s name based on the alert namespace. None of this is acceptable.

A written-down null-policy is what stops it. Models follow rules they can read; they do not follow rules you assumed they’d infer.

5. Move “context” out of the prompt and into a tool

Every prompt-engineering guide will tell you to “give the model context.” That advice is a trap for SRE tools. Stuffing the prompt with the last 24 hours of related alerts, the deploy log, the on-call schedule, and the most recent five postmortems will:

blow up your context window,
cost real money on real volume,
leak data you didn’t mean to leak, and
not actually improve the answer — the model can’t pay attention to all of it.

The right move is to give the model one thing in the prompt — the alert payload, the log line, the incident summary — and let it ask for anything else via tool calls. A lookup_recent_deploys(service, since) tool call is cheaper, more auditable, more cacheable, and almost always more accurate than a 50KB blob of context dumped into the prompt.

This is the architectural shift behind the MCP gateway pattern I wrote about last week. Tools, not context, are how you give an SRE LLM access to your stack — and the gateway is what keeps the tool calls safe.

6. Test against recorded model outputs, not mocked ones

This isn’t a prompt pattern; it’s the testing pattern that makes the other five matter. Most “LLM unit tests” in the wild mock the model. They look like:

def test_alert_summary():
    fake_response = '{"severity":"P2","summary":"cool"}'
    with mock.patch("llm.call", return_value=fake_response):
        result = summarize_alert(payload)
    assert result.severity == "P2"

This test passes forever and tells you nothing.

The minimum viable test for an SRE LLM tool is a recorded-replay harness: real model output, captured once, replayed in CI. When you change the prompt, you re-run against the real model, save the new outputs, and diff them. The diff is your code review. The test is “did the schema parse” — but the diff is what tells you whether your prompt change was actually an improvement or a regression you didn’t notice.

I have shipped exactly zero LLM tools to production without this harness. I have reviewed several built without it. Every single one had at least one prompt-engineering “improvement” in its history that quietly made the output worse, with no test failing to catch it.

Putting it together

A production-ready prompt for an SRE tool ends up looking like this — short, ugly, no fluff:

You translate a Prometheus alert into a structured triage brief.

Return JSON exactly matching this schema. No prose, no markdown,
no preamble. If a field is unknown, use null — never invent a value.

<schema>
... (as in pattern 1)
</schema>

Unknown-handling policy:
- Never fabricate URLs, runbook IDs, ticket numbers, or person names.
- If unparseable, return the unknown sentinel object.

Do not greet, apologize, hedge, or summarize what you're about to do.

EXAMPLE INPUT:
... (one full example, messy)

EXAMPLE OUTPUT:
... (one valid output)

INPUT:
{{ alert_payload }}

That’s it. Six patterns, one prompt template. Every SRE LLM tool I trust in production looks roughly like this. Every SRE LLM tool I’ve watched fail in production violated at least three of them.

The reason “prompt engineering for SRE” needs its own writeup isn’t that the techniques are exotic. It’s that the audience is different. Chatbots get to be charming. Tools that show up in your incident channel at 3am do not.

If you’re shipping LLM tools into your SRE stack and want to compare notes, my contact is at cloudandsre.com. The companion repos to this post are incident-scribe, alert-explainer, and cloudandsre-skills — all Apache-2.0.