Test your prompts like you test your code

A prompt is code you have decided not to test

June 30, 20263 min read4 sectionsBy Ahmed Abdullah

Test your prompts like you test your code

Introduction

Complaints started arriving on a Monday, and they did not make sense. The product's AI assistant had been summarizing support tickets cleanly for months. Now a slice of summaries were dropping the customer's actual problem and leading with the sign-off instead. Nothing in the application code had changed. What had changed, three days earlier, was eleven words in a prompt, a small wording improvement someone made to fix a different complaint, shipped on a Friday with a thumbs-up from whoever read the three examples in the pull request.

Eleven words, a behavior change nobody could see, and no way to know which slice of inputs it quietly broke. That is what shipping prompts by feel becomes once you have real users. The change looks fine on the handful of cases you check by hand and fails on the thousand you did not.

A prompt is code. It is just code you have decided not to test.

The right way: a golden set and a gate

The method is to treat every prompt and every model version as what they are, a dependency that changes behavior, and put it through the same regression testing as the rest of your code. You build a golden set: a curated collection of real inputs paired with what good output looks like, the tricky tickets, the edge cases, the ones that have burned you before. Every proposed change to a prompt or a model runs against the whole set automatically, and if quality drops, the change does not merge. Same gate as a failing unit test.

This turns "does this prompt seem better" into a number you can stand behind. You stop arguing from three cherry-picked examples and start measuring against the hundred cases that actually matter.

How you score output that has no single right answer

The obvious objection is that language output is not pass-or-fail like a function return. True, and the method handles it on two levels. Hard checks for the things that are objective: is it valid JSON, did it stay under the length limit, did it avoid inventing a price, does it cite a real ticket. Then, for the fuzzy quality that resists assertions, you use a model as a judge: a separate call that scores each output against a rubric, did the summary capture the core issue, is the tone right. Run that judge across the whole golden set and you get a quality score per change, tracked over time like any other metric.

You cannot improve what you refuse to measure, and "looks good to me" is not a measurement.

There is a quieter regression this also catches. The model underneath you can change without warning when a provider updates it, and a prompt tuned for the old behavior can degrade overnight through no act of your own. Pin your model versions, and let the eval set tell you, before your customers do, whether an upgrade actually is one.

Why this separates shipping AI from demoing it

Anyone can get a language feature working in a demo. Keeping it working, across model updates and prompt tweaks and the long tail of real inputs, is the part that decides whether the feature survives contact with customers. The teams that ship AI confidently are not braver. They have a gate that tells them the moment they broke something, before the something is a Monday full of complaints.

We built this harness for a team that was tuning prompts in production and finding out from users, and the cycle inverted. Changes that regressed quality got caught at the pull request, and the team started shipping prompt improvements weekly instead of fearing them.

Test your prompts like you test your code and AI stops being the part of the product you are afraid to touch. It becomes the part you can change every week, because you will know the instant you make it worse.

TensorLabs builds the evaluation and prompt-testing infrastructure behind that kind of ship-it-with confidence AI.

Keep reading from the journal.

June 10, 2026

95% accurate and completely useless

The model shipped on a Thursday, and the team did the thing teams do. Screenshot of the dashboard, 95% accuracy, posted in the company channel with a rocket next to it. The fraud system worked. Everyone moved on to the next thing.

Build a Conversational Video Clip Service with Gemini Omni Flash and FastAP

July 17, 2026

AgenticAI

Build a Conversational Video Clip Service with Gemini Omni Flash and FastAP

On June 30, 2026, Google opened developer access to Gemini Omni Flash, its video generation model, in public preview through the Gemini API at $0.10 per second of output.

Replace Tool Loops with GPT-5.6 Native Tool Calling

July 15, 2026

Engineering

Replace Tool Loops with GPT-5.6 Native Tool Calling

On July 9, 2026, OpenAI moved the GPT-5.6 family (Sol, Terra, and Luna) to general availability across ChatGPT, Codex, and the API, and the Responses API picked up the feature that matters most for builders: Programmatic Tool Calling.