Tensor LabsTENSORLABS

The map that ran out of memory

Somewhere between the demo and the third customer, the product started dying.

June 6, 20263 min read3 sectionsBy Tensor Labs
The map that ran out of memory

Introduction

Somewhere between the demo and the third customer, the product started dying.

Not gracefully. Out of memory, process killed, support tickets stacking dying. And only when someone drew a bigger box on the map, which is a miserable thing to debug, because the obvious culprit is also the most expensive one to rule out.

The founder had already reached his verdict. Out of memory means buy more memory. He was a week from tripling the server bill.

We got on a consultation call and asked the annoying question first. What is the work actually proportional to?

The symptom and the cause are rarely neighbors

The app drew analytical heat-maps over whatever region a user selected. To do that, it sliced the region into a grid and ran interpolation math on every cell.

The grid was a fixed resolution. Every region, no matter its size, got chopped into the same dense lattice. A small area gave a sane number of cells. A bigger one gave hundreds of thousands, then past a million, and the interpolation tried to grab several gigabytes in one breath. On a box with ten, it fell over.

So yes, technically, out of memory. But memory was never the problem. The work scaled with the size of the rectangle the user dragged, not the amount of real data inside it. Bigger box, exponentially more math, to say roughly the same thing.

A bigger server buys exactly one more zoom level before the same crash, at a permanently higher monthly rate. You pay more to fail slightly later. That is the whole trade.

Make the expensive thing adaptive

The fix never touched the hardware.

We made the grid adaptive. Cell size scales to the region and the density of actual points, so a huge area gets a coarse grid instead of an absurdly fine one. We put a pre flight check in front of the job that estimates the memory it will need before it runs, so the system can say no like an adult instead of dying mid request. The heavy interpolation now runs only where it earns its keep.

Same hardware. The exact region that had been reaching for gigabytes ran on a few thousand cells. Around fifteen megabytes. Sixty times less, for a picture no human eye can tell apart from the old one.

The bill stayed flat. The crashing stopped. Nobody bought anything.

Now, in fairness to the buy a bigger box instinct: sometimes you really are out of compute and the answer really is more of it. Scaling is real. But “it crashed, so we need a bigger machine” quietly skips the question that decides which world you are actually in. Does this workload grow with your data, or with something it has no business growing with? If it is the second, hardware just shoves the cliff a few feet further out and waits.

Ask it before the invoice, not after. When something falls over at scale, find what the work is truly proportional to. Hunt the step that grows faster than your data does. Then put a cheap check in front of the expensive one, so it fails with a sentence instead of a stack trace.

None of this is exotic. It is just the part that is easy to skip when a bigger instance is one click away and the thing is on fire right now.

The founder kept his server budget and lost his crash, which is a strange way for a call to end when it could have ended in a quote. The map still draws whatever box you hand it. It just stopped mistaking a bigger rectangle for a bigger problem.