Harness engineering: the third phase of AI maturity

Agent = Model + Harness. In 2026 the model is rarely the bottleneck — the scaffolding around it is. Here's what a production-grade SRE harness actually contains, with a ~40-line reference implementation you can run offline: tool orchestration, verification, memory, guardrails, and observability.


There’s a quiet result from this spring that should reorder how you think about building agents. In March 2026 the LangChain team took their coding agent from 30th to 5th on Terminal Bench 2.0 — without touching the underlying model. Same weights, same API. The entire gain came from rebuilding the code around the model: the tool loop, the verification steps, the context handling, the guardrails.

That’s the whole thesis of harness engineering, and it’s the third phase of a progression you’ve already lived through:

  • 2022–2023: prompt engineering. Get the wording right.
  • 2024–2025: context engineering. Get the right information into the window.
  • 2026: harness engineering. Engineer the environment the model runs in — the loop, the tools, the rails, the audit trail.

The equation people keep writing is Agent = Model + Harness. The model reasons; the harness is what turns reasoning into something you’d actually let near production. After a year of building agentic infrastructure for a safety-critical platform, I’m convinced the harness is where almost all the reliability lives — and where almost none of the public examples spend their effort.

So this post does two things: names the five components a real harness has, and ships a working reference one you can clone and run in two minutes with no API key.


Why the harness, not the model

A frontier model dropped into a bare while loop is a demo. It will confidently restart the wrong pod, conclude a root cause it never gathered evidence for, run forever when it gets confused, and leave you nothing to reconstruct afterward. None of those are reasoning failures. They’re harness failures — the model was never given a structure that made the safe path the only path.

This is the same argument I made about the MCP gateway pattern: the intelligence is additive, the infrastructure is load-bearing. The gateway is the harness for tool access specifically. Harness engineering is the same instinct applied to the whole agent loop.


The five components

The industry has more or less converged on five. Names vary; the jobs don’t. Here’s how each one looks in my reference harness, sre-harness — a small SRE diagnostic agent that triages a flapping pod.

1. Tool orchestration

The agent acts only through a registry of declared tools, and each tool carries one piece of metadata that everything else keys off of: is it mutating?

reg.register(ToolSpec("get_logs", "Recent log lines for one pod.", get_logs))
reg.register(ToolSpec("restart_pod", "Restart a pod. Mutating.", restart_pod, mutating=True))

The agent never decides for itself whether an action is safe. The tool’s own declaration does. get_logs is read-only and runs freely; restart_pod is mutating and has to clear the rails below. That single boolean is the seam the whole safety model hangs on.

2. Guardrails

Three independent rails, checked in order before any tool runs:

  • Step budget — a hard ceiling on actions, so a confused agent burns out instead of looping forever.
  • Allowlist — the agent may only call tools explicitly granted for this run. A read-only investigation simply doesn’t get restart_pod in its set.
  • Approval gate — mutating tools require a human (or a policy) to approve the specific call. Read-only tools pass straight through.

The point isn’t that writes are forbidden. It’s that the AI layer is read-only by default, and every mutation is an explicit, recorded yes — not an emergent decision. This is the same line every tool I’ve shipped this year holds: alert-explainer tells you what to investigate, never what to mutate; the SRE skills refuse to propose remediation. An agent that confidently says “restart the pod” and just does it is exactly the agent you don’t want anywhere near production at 2 a.m.

3. Context & memory

Working memory holds the task plus every observation gathered so far — and compacts it before feeding it back, so a long investigation doesn’t blow the context window:

def render(self) -> str:
    recent = self._observations[-self.max_observations:]
    dropped = len(self._observations) - len(recent)
    ...

In the reference it’s a list with a cap. In production it’s a vector store, a summarizer, a tiered memory system. The loop doesn’t care — it depends only on add_observation and render. That seam is the point: you can replace the entire memory strategy without touching the agent loop.

4. Verification loops

Before the harness returns an answer, it checks the answer against what actually happened. A verifier is a small predicate that knows one thing about what a valid result looks like:

  • evidence_required — the agent must have called a read tool before concluding. A conclusion with zero gathered evidence is a hallucination, and we catch it structurally rather than hoping the model behaved.
  • no_unapproved_mutation — defense in depth behind the guardrail: the trace must contain no executed mutation that lacked an approval.

This is the part most home-grown agents skip, and it’s the cheapest reliability you’ll ever buy. You don’t need the model to be trustworthy if you can check its work.

5. Observability

Every decision — tool call, refusal, verification, final answer — emits a structured JSON-line event with a sequence number and timestamp:

{"seq": 5, "kind": "tool_call", "tool": "restart_pod", "mutating": true}
{"seq": 6, "kind": "tool_refused", "tool": "restart_pod", "reason": "mutating tool 'restart_pod' was not approved"}

The trace is the audit log. In a regulated or safety-critical environment, “what did the agent do” has to be answerable from an artifact, not reconstructed from a vibe. A refusal is logged as carefully as an allow. This is the part that makes an agent shippable in industries that have auditors.


What it looks like running

The demo runs offline — a deterministic model drives a real triage so every component fires without an API key. It triages a CrashLoopBackOff twice.

Run one, read-only: the agent lists pods, pulls metrics (memory at 97%, three OOM kills), reads logs (OutOfMemoryError: Java heap space against a 512Mi limit), and concludes it’s an undersized heap — and that a restart alone won’t fix it. restart_pod was never in its allowlist.

Run two, remediation attempt: a more eager model gathers the same evidence and then tries to restart_pod. The tool is granted this time, but the approver says no — and the harness refuses the mutation, logs it, and moves on:

{"kind": "tool_refused", "tool": "restart_pod", "reason": "mutating tool 'restart_pod' was not approved"}

Same model class, same tools. The difference between a safe agent and a dangerous one is entirely in the harness.


The loop itself

Here’s the thing worth internalizing: the loop that ties all five components together is about forty lines. Check budget → ask model → authorize → execute → observe, with a trace event at every decision and verification at the end. That’s it. The components are small on purpose. Harness engineering isn’t a heavyweight framework you adopt — it’s a set of responsibilities you can’t skip, however you implement them.

while True:
    if not guardrails.check_budget().allowed:
        return Result(answer=None, stopped_reason=...)
    action = model.decide(memory.render(), catalog)
    if action.is_final:
        return Result(action.final, run_verifiers(self.verifiers, tracer, action.final))
    decision = guardrails.authorize(spec, args)
    if not decision.allowed:
        memory.add_observation(f"{spec.name} refused: {decision.reason}")
        continue
    output = spec(**args)
    memory.add_observation(...)

Swap FakeModel for AnthropicModel and the loop is unchanged. That’s the literal meaning of Agent = Model + Harness: the model is a dependency behind one decide method, and everything that makes the agent reliable lives outside it.


Try it

git clone https://github.com/ajinb/sre-harness.git
cd sre-harness
pip install -e ".[dev]"
python examples/triage_demo.py

No API key. You’ll watch the five components fire on a real triage, then watch a mutation get refused. To use real Claude: pip install -e ".[anthropic]", set ANTHROPIC_API_KEY, and pass AnthropicModel() to the harness.

The repo is Apache-2.0 and deliberately small — it’s a reference to read and outgrow, not a framework to depend on. If you build a sharper verifier or a tighter approval flow, open a PR. The bar is the same one I hold for everything in this brand: would a senior SRE sign off on what this agent did, from the trace alone?


Ajin Baby is an AI Platform & Cloud Infrastructure Architect at a Fortune-500 aviation technology company and the founder of cloudandsre.com, where he publishes production-grade tooling at the intersection of AI and SRE. He is currently pursuing an MS in Artificial Intelligence and is a 15+ year IEEE member.