Tensor LabsTENSORLABS

The data nobody ever queried

When keeping everything becomes a standing liability

June 23, 20263 min read3 sectionsBy Ahmed Abdullah
The data nobody ever queried

Introduction

Two numbers, sitting next to each other in the same review. The storage and processing bill for the candidate data lake: a comfortably large monthly figure, growing every quarter, line-itemed and approved without much thought because data is an asset and assets cost money. The number of times anyone had actually queried most of that data in the past year: zero. Not low. Zero.

They had been paying, every month, to carefully preserve information they never once looked at.

It had accumulated the way these things always do, one reasonable decision at a time. Every integration captured a few more fields, just in case. Every new feature logged a bit more candidate behaviour, because you might want it later for a model. Nobody ever stood up in a meeting and proposed building a vast archive of personal data with no clear use. It assembled itself out of a hundred small "we should probably keep that" decisions, each individually sensible, collectively a liability with a monthly invoice attached.

Data you are not using is not an asset sitting in reserve. It is a risk sitting in storage.

The two numbers that should never sit together

And here is the part the storage bill doesn't show. The cost on the invoice was the cheap cost. The real exposure was that every one of those unused fields was personal data about real candidates, sitting in a system, governed by laws that do not care whether you ever query it. The day there is a breach, the regulator does not ask how often you used the data. They ask why you still had it. "We thought we might need it someday" is not a defence, it is an admission. Each unqueried field was a small bet that the convenience of maybe-needing-it-later outweighed the standing risk of holding it, and nobody had ever actually made that bet on purpose. It just happened.

The instinct that drives this is a good instinct pointed the wrong way. Engineers and data teams are trained, correctly, that throwing data away is irreversible and that you tend to regret it. So the safe-feeling default is to keep everything. But "keep everything" is only safe if data is inert, and personal data is not inert. It is a perishable liability that happens to look like an asset on the balance sheet, and the longer you hold the part you don't use, the worse that trade gets.

Delete things on purpose

The fix is unfashionable: delete things on purpose. Tie collection to a use, not to a hunch. For every field you capture, name the question it answers and the model it feeds, and if you cannot name one, that is not data you are saving for later, it is data you are storing for the regulator to find. Put an expiry on what you keep. Make "we might need it" pass an actual test before it earns a permanent home in a system full of other people's personal information.

The cheapest, safest data is the data you were honest enough not to keep. It cannot leak, it cannot be subpoenaed, and it does not show up on next quarter's bill.

TensorLabs asks what you are going to query before it asks what you are going to store, because the second question, left alone, answers itself with "everything," and everything is exactly the wrong amount.