The build went green by giving up

When a retry hides the bug it was meant to catch

June 23, 20263 min read3 sectionsBy Ahmed Abdullah

Introduction

The pull request had a green check beside it, so it got merged. That is the whole point of a green check: it is permission to stop paying attention. Two days later the bug that check was supposed to catch took down a checkout flow for forty minutes, and someone finally went back to ask why the test had passed.

It had passed on the third attempt.

The suite was set to retry flaky tests up to three times before reporting failure. Reasonable on paper. Tests flake for dull reasons: a slow container, a port that isn't ready, a network hiccup in the runner. A retry smooths that noise so humans can trust the signal. Except this test was not flaking. It was failing correctly, about one run in three, because the bug it covered was itself intermittent. The retry could not tell the two apart. It treated a flaky test and a real-but-occasional bug the same way: run it again, and if green ever shows up, call the whole thing green.

A retry cannot tell a flaky test from an intermittent bug. It forgives both.

Green on the third try

So the runner did exactly what it was told. Red, shrug, again. Red, shrug, again. On the third try it caught the one-in-three pass, reported success, and the board went clean. The bug rode through the merge with a passing grade stapled to it.

The cruel part is that this failure mode gets worse the better it works. Every time a retry rescues a genuinely flaky test, the team trusts it a little more. The retry count creeps up. Three becomes five, because five keeps the board even greener. And a higher retry count is a wider net for precisely the bugs you most want to catch: the intermittent ones, the heisenbugs, the failures that only surface under load. You are quietly tuning the system to hide its most dangerous category of problem.

The net you tuned to catch nothing

The fix is not to rip out retries. Flaky tests are real, and blocking every merge on them is its own misery. The fix is to make the retry visible. Count it. A test that only passes on the second or third try is not a pass, it is a warning, and it belongs somewhere a human looks. This test needed two retries this week. This one needed three and is getting worse. A green check that quietly required three attempts is a different fact than one that passed first try, and most teams render the two identically.

Flakiness is not noise to be silenced. It is the system telling you which parts of itself are unstable, and the retry button is, very politely, dropping that message in the bin. The instinct to make the board green is the right instinct pointed at the wrong target. You wanted confidence. You bought the appearance of it.

There is an honest version of green and a cosmetic one, and from across the room they look the same. The difference is whether anyone is counting what it cost to get there.

TensorLabs has spent more than one engagement deleting retry logic that turned out to be the only thing standing between a team and a bug they already had. Unglamorous work. Usually the cheapest forty minutes of downtime a team ever buys back.

Keep reading from the journal.

Both halves got worse and the average got better

July 20, 2026

AgenticAI

Both halves got worse and the average got better

Rate-mix decomposition splits every KPI move into what customers did and what the mix did

July 13, 2026

Build a Bulk Product-Image Generation Service with Google Nano Banana 2 Lit

On June 30, 2026, Google released Nano Banana 2 Lite, an image generation model that produces a finished image in about 4 seconds and costs $0.034 per 1,000 images.

July 10, 2026

Build a Self-Hosted Support Ticket Triage Service with Qwen3.5-4B

In late June 2026, vLLM shipped v0.21: speculative decoding support for reasoning models, KV cache offload, and Model Runner V2 becoming the default for dense Llama and Mistral models.