95% accurate and completely useless
The model shipped on a Thursday, and the team did the thing teams do. Screenshot of the dashboard, 95% accuracy, posted in the company channel with a rocket next to it. The fraud system worked. Everyone moved on to the next thing.

Introduction
The model shipped on a Thursday, and the team did the thing teams do. Screenshot of the dashboard, 95% accuracy, posted in the company channel with a rocket next to it. The fraud system worked. Everyone moved on to the next thing.
The number was real. That was the problem.
Fraud at this company ran around 3% of transactions. Which means a model that does absolutely nothing, that waves every payment through with a cheerful “looks fine,” scores 97% just by showing up. Their celebrated 95% was, in plain terms, losing to a brick. But nobody runs the brick comparison, because 95% pattern-matches to “good,” and good doesn’t get interrogated.
It got interrogated three months later, when their largest merchant churned.
The story there was ugly in a quiet way. The model had been catching the obvious fraud, the clumsy stuff, the cases it was always going to catch. The 5% it missed were the careful ones, the actual professionals, which were also the only ones large enough to matter. Meanwhile, to keep its score clean, it had been gently freezing legitimate cards on the edges, including a run of the merchant’s best customers on a Saturday night. The merchant didn’t churn over fraud. They churned over the model’s idea of being careful.
By the time they booked a consultation call, the dashboard still glowed 95% and no one in the building believed it anymore.
A score that adds up the wrong things
Here is what the number was hiding. A missed fraud cost them a chargeback, an annoyance, a figure with a dollar sign and not many zeros. A frozen legitimate payment cost them a furious customer and sometimes the whole account. Those two mistakes are nothing alike, and accuracy had been quietly averaging them into one number as if they were. A metric that treats your cheapest error and your most expensive error as the same event is not measuring your business. It is flattering it.
So we stopped averaging. We scored the two failures separately and weighted each by what it actually did to revenue. The proud single percentage came apart into a few numbers that were uglier to look at and impossible to argue with. Which is the point of a metric. The one on the dashboard had been comfortable, and comfortable is how a bad number survives a year.
The part that felt like cheating
Then we let the model quit. On the genuinely ambiguous transactions, the ones balanced right on the knife edge, it stopped forcing a guess and handed them to a human for a ten second look. Confidence and correctness are different things, and a model that can feel the difference is worth more than one that is a fraction more accurate and sure of itself about everything.
The headline accuracy went down. It got worse on paper. The next big merchant they signed stayed, the chargebacks they actually cared about dropped, and the Saturday night freezes stopped. The dashboard now shows a number nobody would screenshot with a rocket, and the company underneath it is in better shape than the day the brag went out.
That is the whole lesson, and it travels past fraud. A churn score, an underwriting flag, a stockout alert, a demand forecast: anything that puts a number in front of a person who then makes a decision on it lives or dies on the same two questions. Not “how accurate is it.” Instead: what does it cost us when it’s wrong, and does the thing know when it might be. If you can’t answer the second, the figure on your deck isn’t a result. It’s set dressing.
You might also like
Keep reading from the journal.
June 6, 2026AI
The map that ran out of memory
Somewhere between the demo and the third customer, the product started dying.
June 11, 2026AI
Your parser is not your product
The consultation call was about emails. A parts-sourcing platform for industrial components, the kind of business where a buyer sends a bill of materials as three paragraphs of prose and somebody on the other end retypes it into a quote system before lunch
January 8, 2026AI
Build Interactive Data Dashboards and Talk to Data Using Vizro & Vizro-AI
Learn how to build interactive Python dashboards with Vizro and use natural language to explore data using Vizro-AI. This step-by-step guide covers EDA, modular dashboards, Plotly visualizations, and AI-powered data analysis for modern data science workflows.