Published · 9 min read ·

The 120x Claim, Audited

A founder runs a stopwatch on his own correlation claim. Same HAR, same workstation — LoadMagic autocorrelation against manual correlation with ChatGPT as a coding assistant. The time gap was expected. The coverage gap was not.

The Saturday Morning Guinea Pig

I used myself as a guinea pig.

Early on a Saturday morning — questionable life choices aside — I decided to stop claiming LoadMagic was faster and measure it. Same test plan. Same HAR recording. Same workstation. One run with LoadMagic's autocorrelation. One run doing it manually, with ChatGPT open in a browser tab for help.

I expected the manual run to be slow. I did not expect what happened next.

Seventeen minutes in, I still had no working script. The JSON extractors looked perfect. Values captured fine. ChatGPT's path suggestions were sensible. But the correlated script did not work. At one point I thought my demo site was broken.

To sanity-check, I restarted everything. Re-imported the HAR file. Let Carrie take over. Thirty seconds later: fully correlated, fully functional script.

The formal study came after that Saturday. I wanted proper numbers, not a feeling. So I ran it again with a stopwatch, documented every mouse click, every tool switch, every wrong turn. The results were worse than I expected, and I am the person who built the alternative.

The Numbers

One flow, nine requests, two runs on the same workstation: LoadMagic's autocorrelation against manual correlation with ChatGPT as a coding assistant.

MetricLoadMagicManual + ChatGPTDifference
Total time75 seconds25 min 20 sec20.3x faster
Context switches015Eliminated
Human errors06Eliminated
Productive time100%44%+56% efficiency
Candidates found6 of 62 of 63x coverage

The headline is 20.3x faster on correlation for this flow. On a 9-request test the absolute time saved is modest. The coverage gap matters more.

The manual run missed four out of six correlation candidates. That means the "finished" manual script was submitting hardcoded values to the server — values that would be invalid in any real test run. The script would have passed validation and delivered meaningless results.

A test that looks correct but sends stale data is worse than no test at all. It creates false confidence.

Where the Time Goes

The minute-by-minute breakdown showed three distinct phases in the manual session.

Phase 1 — Setup wrestling (0:00 to 6:00)

Six minutes before any productive work began. I needed to find and display the HAR file. The browser could not render it. Switched to an IDE. Switched to a text editor. Switched back to the IDE. Spent two minutes pretty-printing JSON. I ended up using LoadMagic to download the file in a readable format.

Six minutes of setup. LoadMagic: zero.

Phase 2 — Productive scripting (6:00 to 19:00)

The first candidate took 14 minutes. Finding the response source, asking ChatGPT for regex, realising I needed JSON Paths instead, pasting extractors into the right place, renaming variables, running search-and-replace. Fifteen context switches across four tools: IDE, text editor, ChatGPT in the browser, and the test plan. An interruption every 100 seconds.

At that frequency, sustained focus is impossible. Setup overhead and mental re-orientation consumed 56% of the manual session, leaving less than half for productive scripting.

Phase 3 — The debugging spiral (19:00 to 25:20)

This is where it got ugly. The second candidate triggered a chain reaction. A single wrong dropdown selection — setting the extractor to "request" instead of "response" — and then six minutes of debugging that went nowhere. I tried adding a line feed. Tried case-insensitive regex. Tried removing a trailing space. Asked ChatGPT for help three separate times. Gave up on header extraction and switched to body extraction. Guessed the wrong capture group. Checked with ChatGPT, corrected it, shortened the regex to avoid an emoji in the boundary, and finally got it working.

One wrong dropdown. Six failed attempts. Five additional context switches. Six minutes lost on something LoadMagic handled in under 20 seconds.

Three Hidden Cost Drivers

The time study revealed three cost drivers that do not show up in planning estimates.

The Error Cascade

A single configuration mistake does not stay small. It triggers a debugging spiral where each attempted fix creates new uncertainty. The wrong dropdown selection in the study generated a cascade: try line feed, try case-insensitive, try space removal, give up on header, switch to body, guess capture group wrong, verify with AI, try shorter regex, success. That is nine steps to recover from one click.

The manual error rate in the study was about one cascade per three candidates. In a flow with 38 candidates, you would expect 12 cascade-triggering errors. In complex flows, you are not debugging a single extractor — you are tracing failures across upstream and downstream dependencies.

Cognitive Fragmentation

0.59 context switches per minute. An interruption every 100 seconds across four separate environments. No one can sustain focus at that frequency. Non-productive activity consumed 56% of the manual session. Not laziness. Not lack of skill. Structural fragmentation built into the manual workflow.

LoadMagic runs in one window. Zero context switches. Detection, extraction, and replacement happen in a single pass.

The Script Museum

When scripts cost 25 minutes each to correlate manually, and break every time the application changes, teams face a choice: re-correlate and pay the cost again, or abandon the script and move on. Most choose abandonment. The performance testing programme accumulates a museum of stale, untrusted scripts that nobody maintains. Testing coverage shrinks with every release cycle until the team is running a handful of happy-path tests and hoping for the best.

Automation makes maintenance viable. When re-correlating a script takes 75 seconds instead of 25 minutes, updating scripts after every release becomes practical rather than theoretical.

What Happens at Scale

The study measured a simple 9-request login flow. Six dynamic values. Real applications are bigger.

Manual effort scales with both candidate count and error probability. Automated effort scales only with candidate count. As complexity grows, manual costs accelerate while automated costs stay flat. Applied to four complexity profiles using TPC and SAP benchmarks:

ProfileManualAutomatedValue Driver
Happy Path (observed)25 min75 secSpeed (20x)
Standard Commercial3.4 – 4.6 hrs~10 minEfficiency
Complex Financial~1 week (42–46 hrs)1.2 – 1.8 hrsData integrity
Heavy Enterprise8 – 10 weeks1.5 – 2 daysFeasibility

I want to be honest about confidence levels. The Happy Path numbers come from direct observation. The Standard Commercial projection is moderate confidence, derived from the observed unit costs with stated assumptions. The Financial and Enterprise projections are lower confidence, further extrapolations that need validation against real applications at those scales. The structural argument — that manual costs accelerate faster than linear while automated costs stay near-linear — is well-supported. The specific numbers at each tier are estimates, not measurements.

56% of the manual session was consumed by context switching and setup — not productive scripting

Run Your Own Time Study

You do not have to take my word for it. Measure correlation efficiency on your own applications.

  1. Pick a representative user journey. Something you have correlated before.
  2. Record a fresh HAR file of that journey.
  3. Start a stopwatch. Correlate the script manually, the way your team does it now. Note every tool switch, every error, every time you pause to think.
  4. Record the total time, the number of candidates you found, and the number of replacements you made.
  5. Now run the same HAR through an automated correlation tool. Record the same metrics.
  6. Compare total time, coverage (did you find the same candidates?), and accuracy (did the extractors work first time?).

If your experience is anything like mine on that Saturday morning, the gap will surprise you.

AI Performance Engineering book cover

The numbers, in full

Chapter 7 has the full minute-by-minute breakdown, the four complexity profiles, and the confidence discussion. The rest of AI Performance Engineering shows the architecture behind the numbers.

Related: The Correlation Spectrum · Why Correlation Needs Three Layers · Self-Healing Performance Tests