95% accurate and completely useless

The model shipped on a Thursday, and the team did the thing teams do. Screenshot of the dashboard, 95% accuracy, posted in the company channel with a rocket next to it. The fraud system worked. Everyone moved on to the next thing.

AI MLOps DataEngineering MachineLearning Analytics

June 10, 20263 min read3 sectionsBy Ahmed Abdullah

Introduction

The number was real. That was the problem.

Fraud at this company ran around 3% of transactions. Which means a model that does absolutely nothing, that waves every payment through with a cheerful “looks fine,” scores 97% just by showing up. Their celebrated 95% was, in plain terms, losing to a brick. But nobody runs the brick comparison, because 95% pattern-matches to “good,” and good doesn’t get interrogated.

It got interrogated three months later, when their largest merchant churned.

The story there was ugly in a quiet way. The model had been catching the obvious fraud, the clumsy stuff, the cases it was always going to catch. The 5% it missed were the careful ones, the actual professionals, which were also the only ones large enough to matter. Meanwhile, to keep its score clean, it had been gently freezing legitimate cards on the edges, including a run of the merchant’s best customers on a Saturday night. The merchant didn’t churn over fraud. They churned over the model’s idea of being careful.

By the time they booked a consultation call, the dashboard still glowed 95% and no one in the building believed it anymore.

A score that adds up the wrong things

Here is what the number was hiding. A missed fraud cost them a chargeback, an annoyance, a figure with a dollar sign and not many zeros. A frozen legitimate payment cost them a furious customer and sometimes the whole account. Those two mistakes are nothing alike, and accuracy had been quietly averaging them into one number as if they were. A metric that treats your cheapest error and your most expensive error as the same event is not measuring your business. It is flattering it.

So we stopped averaging. We scored the two failures separately and weighted each by what it actually did to revenue. The proud single percentage came apart into a few numbers that were uglier to look at and impossible to argue with. Which is the point of a metric. The one on the dashboard had been comfortable, and comfortable is how a bad number survives a year.

The part that felt like cheating

Then we let the model quit. On the genuinely ambiguous transactions, the ones balanced right on the knife edge, it stopped forcing a guess and handed them to a human for a ten second look. Confidence and correctness are different things, and a model that can feel the difference is worth more than one that is a fraction more accurate and sure of itself about everything.

The headline accuracy went down. It got worse on paper. The next big merchant they signed stayed, the chargebacks they actually cared about dropped, and the Saturday night freezes stopped. The dashboard now shows a number nobody would screenshot with a rocket, and the company underneath it is in better shape than the day the brag went out.

That is the whole lesson, and it travels past fraud. A churn score, an underwriting flag, a stockout alert, a demand forecast: anything that puts a number in front of a person who then makes a decision on it lives or dies on the same two questions. Not “how accurate is it.” Instead: what does it cost us when it’s wrong, and does the thing know when it might be. If you can’t answer the second, the figure on your deck isn’t a result. It’s set dressing.

Keep reading from the journal.

June 23, 2026

Data

The data nobody ever queried

When keeping everything becomes a standing liability

June 6, 2026

The map that ran out of memory

Somewhere between the demo and the third customer, the product started dying.

The brief improved until it cited a ghost

July 13, 2026

MLOps

The brief improved until it cited a ghost

A deterministic gate resolves every citation against the record