Test your prompts like you test your code
A prompt is code you have decided not to test

Introduction
Complaints started arriving on a Monday, and they did not make sense. The product's AI assistant had been summarizing support tickets cleanly for months. Now a slice of summaries were dropping the customer's actual problem and leading with the sign-off instead. Nothing in the application code had changed. What had changed, three days earlier, was eleven words in a prompt, a small wording improvement someone made to fix a different complaint, shipped on a Friday with a thumbs-up from whoever read the three examples in the pull request.
Eleven words, a behavior change nobody could see, and no way to know which slice of inputs it quietly broke. That is what shipping prompts by feel becomes once you have real users. The change looks fine on the handful of cases you check by hand and fails on the thousand you did not.
A prompt is code. It is just code you have decided not to test.
The right way: a golden set and a gate
The method is to treat every prompt and every model version as what they are, a dependency that changes behavior, and put it through the same regression testing as the rest of your code. You build a golden set: a curated collection of real inputs paired with what good output looks like, the tricky tickets, the edge cases, the ones that have burned you before. Every proposed change to a prompt or a model runs against the whole set automatically, and if quality drops, the change does not merge. Same gate as a failing unit test.
This turns "does this prompt seem better" into a number you can stand behind. You stop arguing from three cherry-picked examples and start measuring against the hundred cases that actually matter.
How you score output that has no single right answer
The obvious objection is that language output is not pass-or-fail like a function return. True, and the method handles it on two levels. Hard checks for the things that are objective: is it valid JSON, did it stay under the length limit, did it avoid inventing a price, does it cite a real ticket. Then, for the fuzzy quality that resists assertions, you use a model as a judge: a separate call that scores each output against a rubric, did the summary capture the core issue, is the tone right. Run that judge across the whole golden set and you get a quality score per change, tracked over time like any other metric.
You cannot improve what you refuse to measure, and "looks good to me" is not a measurement.
There is a quieter regression this also catches. The model underneath you can change without warning when a provider updates it, and a prompt tuned for the old behavior can degrade overnight through no act of your own. Pin your model versions, and let the eval set tell you, before your customers do, whether an upgrade actually is one.
Why this separates shipping AI from demoing it
Anyone can get a language feature working in a demo. Keeping it working, across model updates and prompt tweaks and the long tail of real inputs, is the part that decides whether the feature survives contact with customers. The teams that ship AI confidently are not braver. They have a gate that tells them the moment they broke something, before the something is a Monday full of complaints.
We built this harness for a team that was tuning prompts in production and finding out from users, and the cycle inverted. Changes that regressed quality got caught at the pull request, and the team started shipping prompt improvements weekly instead of fearing them.
Test your prompts like you test your code and AI stops being the part of the product you are afraid to touch. It becomes the part you can change every week, because you will know the instant you make it worse.
TensorLabs builds the evaluation and prompt-testing infrastructure behind that kind of ship-it-with confidence AI.
You might also like
Keep reading from the journal.
June 29, 2026Coding
Build a Cross-Modal Search Engine with Google gemini-embedding-2 in Python
In June 2026, Google added gemini-embedding-2 to the Gemini API, the first multimodal embedding model in the family.
June 29, 2026AI
Run LLM-Generated Code Safely with Cloudflare codemode and Dynamic Workers
Cloudflare's Agents SDK v0.16.1, shipped June 16, 2026, includes code mode: a way to let a model write a single program and execute it inside a sandboxed Dynamic Worker instead of making one tool call at a time.
June 29, 2026AI
Define the metric once, or argue about it forever
One definition, or an argument that never ends