| Workhorse

The Number That Looks Right

‍

Our NL/SQL setup lets the team ask questions of operational data in plain English. Stock positions, supplier performance, order patterns. The kind of thing that used to need someone who knew both SQL and the schema. It works well. People who couldn’t write SQL six months ago are now asking questions they wouldn’t have bothered to ask before, because the cost of asking has dropped to seconds.

It also gets things wrong in a particular way that we didn’t see coming.

When a query is wrong because the syntax is broken, you find out straight away. The query fails. When a query is wrong because the answer is obviously implausible — a stock figure ten times too large, a supplier with negative lead time — you also find out almost straight away, because someone looking at the number flinches. Those errors are cheap.

The errors that aren’t cheap are the ones where the number looks plausible. A figure for average order value comes back roughly where you’d expect. Slightly down on last quarter, broadly in line with what the team has been seeing. The number is wrong, but you can’t tell that from looking at it. The query joined two tables in a way that quietly dropped a category of order — a category the asker didn’t know existed in the data, and that the system didn’t flag because, structurally, nothing was broken. The join did what it was told. The filter did what it was told. The answer is wrong because the question, asked in English, didn’t carry the assumption that those orders mattered.

By the time anyone notices the number was wrong — and someone usually does, eventually, when the figure shows up next to a contradicting one and someone has to reconcile them — the decision based on it has already been taken. The pricing email has gone out. The supplier conversation has happened. The cost of the error isn’t the wrong number. It’s everything downstream of someone trusting it.

The thing that makes this category of error different is that it’s invisible at the point you receive the information. With the retrieval system we wrote about last week, the engineer reading the answer is at least in a position to feel when something is off, because they’re the one who would have to defend the answer if it were wrong. With NL/SQL the consumer is two steps removed. The asker accepts the number. The consumer of the asker’s report accepts it again. Neither of them is in a position to question the SQL that produced it.

The instinct, when this becomes visible, is to invest in better question parsing, stricter rules around the generation of the query. Train the system to ask clarifying questions. Get it to surface its assumptions, flag when a join might be ambiguous. We’re doing some of that. It helps at the margins. But it doesn’t solve the underlying problem, because the underlying problem isn’t that the system is bad at translating questions. It’s that the consumer of the answer has no way to evaluate it. They can’t see the SQL. They can’t see the schema. They can’t see the assumptions baked into the views the query is hitting. They see a number. The number looks right.

And here is where this surface differs from the others. On the inventory side, you can imagine the system holding back — release this record, refuse that one, ask before posting. There is a moment where something is about to happen, and you can sit a rule in front of it. On NL/SQL there is no equivalent moment. The output is a number, and the number is already in someone's hands. The thing you might want to control — whether to act on it — is happening in their head, on a Tuesday morning, while they're writing an email.

So if you wanted to put a rule in the system — only release answers that meet some standard, hold back the ones that don't — you run into a basic problem. The rule would have to refer to something about the answer: which tables it touched, which rows got dropped, what the view assumed. None of that comes back with the number. The number arrives on its own. You can write the rule, but there's nothing for it to read.

We are implementing tracing that allows us to see what’s going on. What’s become clear, having watched plausible-looking numbers turnout to be wrong in subtle ways, is that for NL/SQL, provenance has to come first. Not as a feature added later. Until the answer shows its workings, no rule about which answers to trust has anything to refer to.

‍

The Number That Looks Right

Ready to TransformYour Customer Management?

Ready to Transform
Your Customer Management?