Unit Tests for LLMs: Catching Model Drift Before Your Users Do

Written by Derek Fisher | Sat | Jun 13, 2026 | 1:42 PM Z

I've spent a good amount of time in software development and application security. In those roles, you lived and died by the testing around the feature you were building. Unit test, integration test, performance test, and a half dozen other types of tests were utilized to suss out any regressions or deviations from the intended purpose of the application.

Likewise, in AppSec, running test tools such as SAST, DAST, and SCA was and still is the most scalable method of testing an application for security concerns. Being able to identify vulnerabilities in the code or the runtime environment goes a long way to stopping critical vulnerabilities from getting into a production environment. I'm willfully ignoring the current mindset that LLMs will replace these tools in the near future since that still remains to be seen and actually operationalized.

The sole purpose of these test harnesses is to ensure that the application you are deploying is free (or as free as can be identified) from something that will bite you down the road—whether it's a regression or a vulnerability. But now that we're in the age of LLMs, is there an equivalent set of tests that can be added to look for drift in the model or the responses? Why, yes, there is!

Enter Promptfoo

I've been working on a little project that I'm hoping can help people trying to break into the cybersecurity field. It's called Clarus and it's a platform dedicated to helping people get into a role that aligns with their goals, skills, and knowledge. And yes, it's backed by an LLM that helps develop and validate the user journey to that goal. As development has progressed (it's still very much under construction), I've been looking for ways to catch drift in responses from the model as more features are added and prompts change on a regular basis. I had played around with Promptfoo early on but wanted to try to utilize it again with some more intention.

If you're not aware, Promptfoo is an open-source, developer-focused framework for testing and evaluating LLM applications, and the easiest way to describe it is "unit tests plus CI, for prompts and models." I've recently started to run it against Clarus, driving the whole thing from Claude Code inside VSCode. I wanted to write up how it works and, perhaps more importantly, how to read what it tells you.

In the dojo of prompts, every failure is a lesson

Promptfoo wants a few things from you, declared in a single YAML config. Once you've given it these, it runs that input against the provider, grades each response against your assertions, and produces a pass/fail report with reasoning.

Configuration – Items like evaluateOptions, defaultTest, and lifecycle hooks round out the testing fixture.

Providers – The system under test. This can be a raw model, but the important move is pointing it at something real. In my case, the provider is the live chat API, not a model in isolation. Example:

Prompts/inputs – The user messages you want to send. Example:

Tests and assertions – What a good response must, and must not, contain. Example:

The single test above shows the three assertion families: deterministic (icontains), negative deterministic (not-icontains-any), and LLM-as-judge (llm-rubric). You can mix as many of each as you want per test ,and there are dozens to choose from in Promptfoo. One thing to consider here with the way Promptfoo evaluates the test and assertion is that it's not simply going against the LLM itself but following the live API. This means that you're testing the whole app—retrieval, prompt assembly, guardrails, the API layer, all of it. A small response transform converts the server's streaming (SSE) output into the final assistant message so it can be graded. That means when a test fails, it failed against the thing your users actually hit, not a sanitized lab version of it.

I started out by building a 28-scenario regression suite, grouped into behavioral categories. The grouping isn't cosmetic, it's how you reason about coverage the same way you'd reason about it for any other test plan. This testing strategy will evolve, and the goal is to build out more scenarios and categories as the platform grows and the model changes.

The assertion philosophy

Every scenario layers two kinds of checks, and the order matters. The cheap, deterministic ones run first. These are the simple: does the response include MITRE ATT&CK, does it have a numbered list? This can be validated through a string match or regex and are deterministic because the same input should produce the same output. The expensive, judgment-based ones run when there's no other way to express what "good" means. Examples would be whether the response was "warm," or did it ask follow-up questions. The rule of thumb: use deterministic checks where you can, and LLM judges where you must (those LLM tokens add up!). The deterministic layer keeps your costs down and your failures interpretable. The judge layer covers the things a regex will never catch, like whether the assistant refused a bad request gracefully. For the subjective rubrics, I pinned the judge to temperature 0 for repeatability and set pass thresholds (0.75 is a reasonable starting point) so a "mostly fine" (maybe 0.70) answer doesn't quietly sail through.

Calibrate locally before you automate anything

Running this locally on my beefy machine is usually quick (a couple of minutes), and it exists so the rubric can be calibrated before wiring this into a pipeline where a bad rubric becomes everyone's problem. The flow is straightforward:

Provision an isolated test identity – A dedicated, least-privilege test user with a fresh auth token, so evals run as a dedicated test identity with only the privileges of a normal end-user.
Seed any required state – For example, a user profile the endpoint expects to exist.
Run the suite – npx promptfoo eval, optionally filtered to a subset of tests while you iterate.
Inspect results – npx promptfoo view opens a browser UI showing each input, the model's response, and the judge's reasoning for every pass and fail.

That last point is critical. The judge's reasoning is the difference between "20% passed, this is broken" and "20% passed, and here's exactly why." The latter is likely to lead you to reviewing the rubric for strictness.

Making it a merge gate

A test suite that only runs when someone remembers to run it is a suggestion, not a control. So we can pivot this same suite to become a GitHub Actions workflow, and the pattern generalizes well beyond my Clarus app:

Triggers: Manual dispatch today. The design supports PR triggers and a weekly canary, however, both are currently disabled while the code is stabilized. Either can be re-enabled by adding the pull_request and schedule blocks back.
Steps: Check out, set up Node, provision a fresh token at run time, run the suite, upload results.json as a build artifact, and enforce a pass-rate threshold gate.
The quality gate: The job fails if the suite pass rate drops below a configured threshold (different than the individual test threshold). I started permissive at 0.60 with a plan to ratchet to 0.80 once the rubrics and code is stabilized and turning this into an actual merge gate.
Secrets and config: Credentials and the API base use OIDC federation.

The division of labor works great. Once fully configured in the CI, PR-time evals will catch behavioral regressions before they merge. The weekly canary will catch drift that lands outside any PR, like a model version bump or a change in your retrieval data that quietly changes the behavior.

Reading the results without panicking

The first time I ran Promptfoo, it was deflating. Here's a representative early smoke run:

An 80% failure rate looks like a disaster, however, it isn't. Errors and failures are completely different animals.

Based on the smoke run, there were zero errors. That says that the auth worked, the streaming transform worked, the state-seeding worked and the entire harness is sound. A green pipeline with failing assertions is fundamentally healthy. It means you're asking real questions and getting real, gradeable answers. A pipeline throwing errors, even at a 100% "pass" rate, is telling you something else. This is easier to see with a real pair from my run than in the abstract. Take B3 ("What does a SOC analyst do?"), which passed at 0.96. It has four assertions: three cheap deterministic checks looking for the words "monitor" and "incident," and one LLM-rubric grading whether the answer accurately describes SOC work without inventing processes. All four passed, because a genuine answer about a SOC analyst is going to say "monitor" and "incident" almost by definition. The deterministic checks and the judge agreed: good answer.

Looking at B2 ("What are the key roles in cybersecurity?"), which failed at 0.56. The rubric (the part actually judging answer quality) passed at a perfect 1.0. The judge confirmed the response listed five well-described roles in the right range, grounded in real job-market data. By any reasonable standard, it was a good answer. Where it deviated was the other assertion, a hardcoded keyword check that scanned for eight specific role terms (SOC, Pentest, GRC, Analyst, and so on) and required at least four. It matched exactly one: "incident." Because the two assertions are weighted equally, a 0.125 on the keyword check and a 1.0 on the rubric average out to 0.56, and the scenario goes red.

While this seems like only a miscalibration of the assertion, there are a few things going on here. Clarus answered with the NICE Framework's seven work-role categories (i.e., Oversight & Governance, Design & Development, Protection & Defense, etc.). Those are real, but they're the top-level buckets, and not the answer a career advisor would give. The genuinely useful response names are the specific work roles underneath them. So the expected response should be something like: "The Systems Authorization (OG-WRL-013) work role, in the Oversight & Governance category, is in demand per Cyberseek." However, the model stayed one level too abstract to actually be helpful. That's a real grounding failure, and the rubric was too shallow to notice.

So, what to do with this test? To make B2 pass honestly on the next run, I don't touch the product at first. I would fix the test to require specific NICE work roles, then fix the product to deliver it. Only then does a green B2 actually mean what I want it to mean. A failure isn't just "bug or bad test," sometimes it's both requiring closer examination.

The path forward

None of this is exotic or should be a foreign concept to those that have been steeped in testing (for security or not). And that's the point. We are not inventing a new discipline for AI, but rather we are applying the one we already have. Eval-driven development provides us the ability to grade behavior instead of byte-for-byte output, and allows us to regression test an entire LLM product by pointing the harness at the live API.

As LLMs become further integrated into applications, the teams that ship the LLM-backed features safely won't be the ones with the cleverest prompts. They'll be the ones who treated those prompts like every other piece of production code they've shipped previously—versioned, tested, and gated. The model may be new, but the job isn't.

This article was published originally here.

View full post