Tensor LabsTENSORLABS

The pipeline trusted the filename

The unwritten convention holding your pipeline together

June 23, 20263 min read3 sectionsBy Ahmed Abdullah
The pipeline trusted the filename

Introduction

The monthly report came back two hundred pages short and nobody could say why. Same pipeline as always. No errors in the logs. The job had reported success, posted its green summary, and gone to sleep. It was only when a regional manager asked where his entire territory had gone that anyone went looking.

His territory had been renamed. That was the whole crime.

The pipeline ingested files dropped into a shared folder, and it decided what each file was by reading its name. A file ending in one suffix was treated as the master record. A file with the region's name in the right position got matched to that region. This had worked for years, quietly, because everyone who dropped files in knew the unwritten rule and followed it without thinking. There was a convention. There was no contract. The difference between those two words is the difference between the report being whole and the report being short.

An unwritten convention is a contract you have agreed not to enforce.

A rename, and a region disappears

Someone new had exported the territory data, named the file the sensible, human way, and dropped it in. The name no longer matched the pattern the pipeline was scanning for. So the pipeline did not error. It did not warn. It simply did not see the file as belonging to anything, skipped it the way you'd step over a parcel that isn't addressed to you, and processed everything else perfectly. Success, said the log, truthfully. It had successfully processed every file it recognised. It just declined to mention the one it didn't.

This is the failure mode that scares me more than a crash. A crash is loud. A crash pages someone. This was a silent subtraction, a report that was wrong in exactly the way that looks right, because a slightly shorter stack of pages is not obviously a shorter stack. If the manager had been on holiday that month, the number would have shipped, been believed, and quietly distorted whatever decision it fed.

The root cause was not the rename. The rename was always going to happen. The root cause was that the system carried an assumption it never checked: that every file in the folder would be named by someone who knew the rule. The moment that stopped being true, the system had no way to notice, because it was never counting what it expected to see against what it found.

The rule nobody wrote down

The fix is cheap and slightly humiliating, which is the best kind. Make the pipeline declare its expectations out loud. It expected files for twelve regions and matched ten. That is not a success, that is a two-region hole, and the run should refuse to call itself green until a human has looked at the gap. You do not need smarter file matching. You need the job to count, and to flinch when the count is wrong.

Every long-lived data pipeline is held together by a few of these conventions, load-bearing and undocumented, known only to the people who were there when it was built. They work right up until the day someone reasonable does something reasonable that nobody wrote down.

When TensorLabs inherits a pipeline, the first thing we hunt for is the rule nobody wrote down, because that is the one that fails without telling you.