The million-token search engine that isn’t

A library with no card catalog has every book you could want and no way to find any of them. Bigger libraries don’t fix search. They make search worse. That is most of what is wrong with the current marketing of long context windows.

#AI #Data #Search #Engineering #TechLeadership

May 5, 20263 min read4 sectionsBy Tensor Labs

The million-token search engine that isn’t

The boast that wasn’t

A new model release ships with a million-token context window. The marketing slide reads like the end of retrieval. Why bother building a vector store if the whole codebase fits in the prompt? Why think about chunking, embeddings, or rerankers if the model can hold an entire quarterly report in its head?

Because attention isn’t recall. The model can read every page of every book in the library and still miss the one you needed. The published benchmarks are clear about this, and almost nobody reads them.

Where the model loses the thread

On a recent project, the same questions ran against the same 600,000 tokens of documentation through two pipelines. The first pasted everything into the model’s context. The second was a sixty-line retrieval pipeline: chunk, embed, top-k, rerank. The retrieval version answered multi-hop questions correctly noticeably more often, and ran an order of magnitude cheaper per query. The big-context version did not get worse the more context it was given. It got worse the more relevant context it was given alongside irrelevant context.

That is the part the marketing skips. The model sees every token. It does not give every token equal attention. Past a certain density, multi-hop questions degrade visibly. Independent benchmarks (NIAH variants, RULER, the Needles series) report effective context that is a fraction of the advertised window. The 1M-token boast is a capacity figure. The number that matters in production is the recall figure.

(The hardest version of this is the team that ran the benchmark, saw the degradation, and shipped the long-context version anyway because it was less work to integrate.)

What the catalog actually does

A retrieval pipeline does more than fetch the right pages. It shrinks the search space the model has to reason over. It removes distractors. It lets the model spend its attention on the hundred tokens that matter, not the hundred thousand around them.

That is what a card catalog did in a library. Not store the books. Let the librarian find them. The catalog was a small thing. It made the entire library work.

To be fair, long context wins in some cases. Documents that need to be read end-to-end (legal arguments, narrative analysis, code review on a small file) benefit from the model seeing everything at once. Conversations with months of memory benefit from never dropping context. Some workloads are not retrieval problems and shouldn’t be forced into one.

But “paste the whole repo” is not engineering. It is the laziest possible answer to the questionof how the model should access information.

Bigger libraries, smaller answers

The library with no catalog is the same library either way. The books are still there. The question is whether you can find one.

Most teams that have shipped both architectures arrive at the same conclusion. The retrieval version ages better. The bigger the library got, the more the catalog earned its keep, not less.

Keep reading with more notes from the journal.

#AI #SoftwareEngineering #Tech

Don't pre-build the AI layer

The retail industry has a name for the developer who builds the mall before the anchor tenant signs. Optimistic. The same developer five years later, when the mall is half empty and the anchor never showed, is called bankrupt.

#WorkCulture #Meetings #Productivity #Leadership

We added the meeting

Once, somebody had the bright idea that the team was not close enough. Let’s call this idea `the Standup`. Any resemblance to actual meetings is purely coincidental. Probably

#FutureOfWork #Productivity #TechLeadersh

The billable hour was always a workaround

In the early 1900s, coal miners across South Wales were paid by the ton. It was a reasonable arrangement. Effort and output correlated tightly enough, measurement was simple, and the variance between a fast miner and a slow one was narrow enough that the proxy held.