Published · 10 min read ·

Self-Healing Performance Tests

The Golden Map, the Quality Gate, and M.I.N.D. — how LoadMagic rebuilt self-healing around a deterministic diff, and when AI still has to step in.

Self-Healing Started as Insurance

Self-healing was not some grand AI vision. It started as insurance. Insurance against the kind of accidents that happen when you work with JMeter every day: a misclick that deletes an extractor, a drag-and-drop that moves a sampler to the wrong thread group, a find-and-replace that corrupts a JSON path. These things are easy to do. JMeter's UI can be fiddly at the best of times. I have corrupted my own test plans more than once.

But the insurance policy turned out to protect against more than fumbled mouse clicks. It caught genuine errors in code and business logic too. The most reliable form of self-healing we built is the Suzy pipeline: when a Groovy script has a bug — whether I introduced it or whether Suzy (our scripting agent) produced it — the error gets routed back to her in fix mode with the full error context. She reads the stack trace, traces it to the root cause, and produces a corrected script.

I have not seen her fail at those fixes in a long time. The error context is what makes the difference. When an agent knows what went wrong, where in the stack trace it happened, and what the surrounding code looks like, the fix is straightforward. That is a pattern we discovered early and built everything else on top of: quality context produces quality repairs.

Two Worlds Collide

The first version of self-healing was designed organically. Think of a failure scenario, write a code path to handle it. See a broken extractor? Investigate the response. Check the boundaries. Rebuild the regex. Validate.

It worked. At first.

The problem was complexity. Every failure scenario needed its own code path. Every code path needed its own maintenance. There were layers upon layers of investigative paths, each branching into sub-investigations. The self-healing system became a miniature version of the complexity it was supposed to solve.

Then we decided to rebuild our world view — the central data model that tracks every dynamic value through its lifecycle. Strategically, the right move. Operationally, chaos. We had to maintain the old self-healing system while building the new one. Two architectures in parallel. Old assumptions meeting new data structures.

We were trying to extract the old system seam by seam, rewiring each piece to the new world view. In theory, a clean strangle pattern. In practice, the old world kept creeping back. Every time we thought we had reached the end of the tunnel, another part of the foundations collapsed. A fix in the new system exposed a hidden dependency on the old system. A refactor that looked clean broke three things downstream.

There were many long days and late nights. The struggle refined three principles that now govern how we build everything at LoadMagic.

One Tidy

Our version of Kaizen. Every time you touch a piece of code, you are already holding context. Use that context to make the code better, not only fix the immediate issue. Five dimensions: centralise, modularise, reusable, maintainable, visible.

Honest visibility

Identify what you know and what you do not know. Build gap identification into the process so you see your blind spots at once. We had too many silent failures — places where the system appeared to succeed but dropped data. The rule now: every data path needs explicit verification.

Shift left on data quality

Obtain all the data, organise it as soon as possible. Inject indexing at creation time for better tracking and retrieval. Pre-build, pre-organise. Stop investigating failures after the fact. Start preventing them through better data.

These principles sound obvious when written down. They emerged from pain, not theory.

The Golden Map

All that work led to what we now call the Golden Map.

The concept: we predict what needs correlating, wire it up, and prove it works by running a validation test with real traffic. When every extractor captures the right value on a live run, the entire state gets stamped. That stamp becomes the Golden Map — a frozen snapshot of the world view at the moment everything was proven to work. It captures the last known good configuration for every dynamic value in the test plan.

The self-healing that sits on top of this is simple. When something breaks, we do not investigate. We do not run an AI agent to diagnose root causes. We check the differences between the current state and the Golden Map, and restore what changed.

No investigation. No root cause analysis. Diff and restore.

It is more capable than the old approach. The quality of the world view data is now good enough that we do not need AI to self-heal. That has three advantages:

  • Faster — a diff operation takes milliseconds, not the seconds an LLM call requires.
  • Cheaper — no API tokens burned on diagnosis.
  • More predictable — a deterministic diff produces the same result every time; an AI agent might reason its way to a different conclusion on different runs.

Do not tell Carrie I said this.

The Golden Map also solved the regression problem. Before, when a test broke, the question was "what went wrong?" That question can have many answers, and investigating each one takes time. With the Golden Map, the question becomes "what changed?" That question has one answer: the diff. If an extractor that was capturing a valid session token yesterday is returning nothing today, the fix is to restore the golden configuration and re-run. If the restore works, the problem was corruption or accidental modification. If it does not, the application itself has changed — and that is useful information too.

A simpler system with good data beats a complex system that investigates every failure. You can shortcut entire investigative pathways if you do the work upfront to build on strong foundations.

What the Quality Gate Does

The quality gate sits between test execution and the self-healing pipeline. Its job is to answer one question after every test run: is this script ready?

Before you trust results, the gate checks every dynamic value in the test plan against the execution data. For each value, it asks: did the extractor capture something? Was the captured value used in subsequent requests? Did those requests succeed?

The gate classifies each variable into one of four states:

Proven

Extractor captured the right value and the test ran without errors.

Broken

Extraction failed — returned nothing, returned wrong value, or downstream request failed.

Wired

Extractor exists but has not been validated against a live run yet.

Candidate

Scanner identified something that looks dynamic but no extractor has been wired.

When everything is Proven, the gate stamps the Golden Map and the test plan is ready for load. When anything is Broken, the gate builds a set of candidate actions: specific instructions for what needs fixing. Those actions go to the self-healing pipeline, which applies the Golden Map restore (fast, deterministic) or, if the restore does not work, escalates to Carrie for deeper investigation (slower, AI-powered, but capable of handling application changes).

A convergence loop automates this cycle: run the test, check the gate, apply fixes, run again. Safeguards prevent infinite loops: if the system makes no progress across two iterations, or if it detects oscillation (the same variable flipping between Broken and Wired), or if it exceeds a maximum iteration count, it stops and reports what it found.

The key architectural decision: the server decides what happens to each variable. The client obeys. We call this gate authority. It eliminated a whole class of bugs where the client and server disagreed about what needed fixing, and it means the quality gate can enforce policies (like "do not modify extractors that are already Proven") without relying on every client implementation to get the logic right.

Where This Breaks Down — and M.I.N.D.

I need to be honest about the boundaries.

The Golden Map approach works well when the test plan changes but the application stays the same. Accidental deletion, configuration drift, extractor corruption — these are the scenarios where diff-and-restore is reliable and fast.

When the application itself changes, things are harder. A platform update that moves a token from a response header to a JSON body. A redesigned authentication flow that adds a new CSRF token. A server-side change that alters the structure of API responses. In these cases, the Golden Map restore will fail because the golden configuration no longer matches the application. The system detects this (the restored extractor breaks on the next run) and escalates to the AI agents for deeper investigation. Carrie and Rupert handle most of these, but it is slower and less predictable than a deterministic restore.

M.I.N.D. — persistent learning

M.I.N.D. is live in production today: persistent per-application knowledge that compounds across sessions.

The principle: everything we do should build on top of what we have done before. We already capture rich data during correlation. The prediction, the wiring, the validation — it all generates observations about how an application behaves. M.I.N.D. persists those observations so that the second time someone correlates the same application, the system starts with what it learned the first time.

The initial design was a confidence-based decay system with graduated tiers, statistical weighting, and time-based deprecation. We simplified it. The current direction is closer to binary: a pattern works or it does not. When newer, better intelligence arrives, we adapt. When a pattern that used to work fails, that single failure triggers re-investigation.

The idea is to recognise situations and patterns we have seen before and apply fixes that worked before, rather than investigating from scratch every time. That is how experienced performance engineers work. You see a familiar token pattern, you already know where to look and what extractor to use. M.I.N.D. gives the system that same accumulated experience.

Errors that agents were making on first contact with unfamiliar applications are becoming less frequent as the knowledge base grows. That compounding is exactly the point. M.I.N.D. gets better as it sees more applications, and the data it accumulates is the foundation for the specialised models we plan to train down the line.

The Principle Underneath

Invest in data quality, build a verifiable baseline, and most repair operations reduce to a diff. AI is still there for the cases where the application itself has changed — but the Golden Map handles the majority of real-world breakage with no AI at all.

The self-healing story comes down to one thing: doing the hard work upfront so that recovery becomes trivial. A simpler system with good data beats a complex system that investigates every failure.

AI Performance Engineering book cover

The full engineering story

Chapter 8 of AI Performance Engineering has the architecture diagrams, the failure scenarios we still cannot handle, and the M.I.N.D. design iterations in full.

Related: Why Correlation Needs Three Layers · How Five Agents Fix Correlation · The 120x Claim, Audited