| Workhorse

Most of what gets written about AI and automation is written from a distance. Strategy decks. Analyst reports. Case studies polished until the friction has been removed.

This isn't that.

Workhorse is an inventory and order management system for UK product businesses — the kind of operations that run on spreadsheets, supplier relationships, and people who've been doing this long enough to know where the bodies are buried. We're building AI automation into that environment right now, and we're writing down what we find.

The question that started this series was deceptively simple: if automation is supposed to remove labour, why does the labour keep showing up?

Not everywhere. The routine work does disappear. Orders that used to be processed manually get handled without anyone touching them. But the people who used to do that work aren't sitting idle — they're busier, in some cases — doing something slightly different. Checking. Deciding. Catching the things the system flagged, or worse, didn't flag. The headcount problem doesn't solve itself. It moves.

What we've found, piece by piece, is that this isn't a technology problem. It starts with a distinction that sounds technical but turns out to matter more than almost anything else about how operational software is built: a system that records what happened and a system that owns what happens next are different products. Most operational software is only the first one.

From there, the problems compound. Fragmented stacks — a purchasing tool, a forecasting tool, an ERP underneath all of it — move decisions between systems without any of them owning the outcome. Operational systems routinely create executable state without ever declaring whether execution is actually permitted — and the humans who catch the ones that shouldn't go out aren't a safety feature. They're evidence that the system never defined what execution requires. That pattern, it turns out, isn't accidental and it isn't fixable with better tooling. It's structural.

Which is where things get uncomfortable for anyone selling AI as the answer. Better forecasts and higher confidence scores don't reduce the review load — because the review was never about whether the record is correct. It's about whether the system is permitted to act on it. Authority is the bottleneck, and intelligence doesn't move it.

What makes this harder to fix than it looks is that the volume doesn't stay flat. As automation increases, more records arrive at the permission boundary — and the humans absorbing that load aren't checking arithmetic, they're carrying commercial risk the system was never authorised to carry. Accuracy makes it worse, not better: a 95% success rate doesn't produce 95% less work, because the system still can't tell you which 95% are safe. Until it can, you check everything. The pressure compounds precisely because the system is getting smarter.

The automation boundary in an operational workflow isn't placed where someone decided to stop. It forms after an irreversible mistake reaches a customer — a wrong shipment, a duplicated order, a disputed invoice — and it lands at the last point the error was still fixable. After that, every order gets checked. The boundary doesn't move because nothing in the system has changed that would prevent the same event from recurring undetected.

The instinct, when errors keep appearing, is to reduce them — to invest in accuracy until the checking becomes optional. But accuracy isn't what drives the verification load. A team that took their error rate from 5% to 2% found the checking unchanged: the system still couldn't identify which orders it might have got wrong, so every order still got checked. The exit from universal verification isn't a higher score. It's a system that can tell you where it's uncertain.

There's a second problem running underneath the authority question, and in some ways it's harder to see. The review gate assumes records wait safely until someone looks at them — but in most operational systems, an unreviewed order is already participating in execution the moment it exists. Stock availability adjusts, demand signals update, warehouse teams start planning — all against a record no one has approved. By the time a reviewer opens it, the downstream decisions are already made.

The gate-is-already-leaking problem assumed that records at least meant what they said — that the ambiguity was about timing, not content. That turns out to be only half of it. In any live operational environment, the sources the AI draws from will always contain records that are individually accurate but collectively contradictory — configuration notes, JIRA tickets, code documentation that each reflect a different moment in a client's history. When an engineer corrects the AI's answer, she's resolving that conflict from knowledge the sources don't contain, and nothing about that resolution gets back to the system. Next time the same query arrives, the same conflicting sources produce the same confident wrong answer.

That recurrence turned out to sit on something more basic than the knowledge loop. A system that logs the fact of an action isn't the same as one that can tell you why it thought the action was right, and without a per-record trail from input to output, every correction is a dead end — it fixes the output but not the cause. Authority, when it eventually arrives, can only govern records whose path it can retrace; the trail has to be there first or there's nothing for the rules to act on.

Which makes the usual sequencing worse than it looks. The default plan, almost everywhere, is to define execution authority properly later — after the pilot, after the platform decision, once things settle. But the records don't sit still while you wait: once a confirmation is ingested, the stock ledger updates against it, picks get generated from it, invoices reference it, and authority defined afterwards can only face forwards. By the time the rules are ready, the set they were meant to govern has already moved past their reach.

The same authority question shows up on a different surface. With a working retrieval pipeline overinternal documentation — engineer reads what the system found, ships the response to the client — what the engineer is doing between read and send isn't checking accuracy; the answer is usually fine. What they're holding is the commercial risk of an answer drawn from a corpus that doesn't contain the client's tolerance for a workaround, the mid-migration finance team, or the commitment made on a call no ticket captured. More context narrows the corpus, not the gap — willingness to be wrong on behalf of the company isn't a retrieval problem.

The next surface over has the same shape but no gate to put a rule in front of. On an NL/SQL setup — someone asks a question of operational data in plain English and gets a number back — the costly errors are the plausible-looking ones, where the join quietly dropped a category and the figure arrives looking roughly right. By the time anyone notices, the pricing email has gone and the supplier call has happened; there's no moment before the action to gate, because the action is in the consumer's head on a Tuesday morning. On this surface, provenance has to come first — the number has to show its workings, or any rule about which answers to trust has nothing to refer to.‍

After all those surfaces with nothing to refer to, here's one decision where a gate actually held: matching an order to the right customer. The parser scores each order against the customers we've already got — name, email domain, a couple of fuzzier matches — and anything under the line waits for a person. It once stopped a customer we knew who'd quietly changed email domains, and the client asked us to loosen the threshold. We said no — nudge it once for a good reason and you'll nudge it again for the next, till the rule is just whatever someone decided under pressure that morning. It works because matching a customer is something you can score against a fixed list; line items and quantities aren't, so the trick doesn't carry.

The support side shows where even a working check runs out. We taught the system to stop answering when the docs don't back it up — it says so and passes the question to a person instead of bluffing, and the value's in what it holds back. Then a client asked about overriding a credit limit on one order, and the answer was right — no per-order override, bump it on the account — with nothing for the check to catch. The analyst still didn't send it: a flat no was going to land badly here, and how to soften it — maybe float building a per-order option — turned on who the client was and what we'd promise, none of which is in a document. Checking an answer against the evidence and knowing what's right to say to this client are two different jobs — and only one has anything to look at.

We're still in the middle of it. The series follows the evidence one step at a time, and the evidence keeps moving. What's below is what we've established so far.

Notes from the Coalface

Ready to Transform
Your Customer Management?

Ready to TransformYour Customer Management?

Ready to Transform
Your Customer Management?