Skip to content

From Prototype to Production: Change Control for AI Decisions

You update a prompt. You test it on five examples. Looks good. You ship.

Three days later, a support ticket: a field that used to contain "$4.2M in Q3 revenue" now says "strong revenue growth". Another ticket: a claim that listed a specific hire date is gone entirely. Your prompt change improved 95% of cases and silently broke 3%. The inputs that regressed weren't in your five test cases. They never are.

This is the gap between prototype and production. Prototypes need to work on a few examples. Production systems need to prove that nothing broke.

Change control for AI decisions

We've solved this problem before. Not for AI, but for code.

GitVerist
Code changePrompt or model update
Run testsRecompute against stored inputs
PR diffDecision diff
MergeApprove and ship

The difference: with code, the diff exists automatically. With AI, it doesn't exist unless you explicitly compute it. You wouldn't merge a PR without reviewing the diff. Why would you ship a prompt change without seeing what decisions it would alter?

Verist applies this workflow to AI decisions. Define a step, capture baselines, change your prompt, recompute, and get a diff showing exactly what changed before you deploy.

Try it: catch a prompt regression in 60 seconds

bash
npm install verist @verist/llm zod openai

verist is the kernel (steps, replay, diff). @verist/llm adds LLM provider adapters.

Define a step

A step wraps an LLM call with typed input and output. defineExtractionStep handles the boilerplate of calling the model, parsing the response, and validating against a Zod schema.

typescript
import { defineExtractionStep, createOpenAI } from "@verist/llm";
import { run, unwrap, recompute, formatDiff } from "verist";
import OpenAI from "openai";
import { z } from "zod";

const ClaimsSchema = z.object({
  claims: z.array(z.string()),
});

const extractClaims = defineExtractionStep({
  name: "extract-claims",
  input: z.object({ text: z.string() }),
  output: ClaimsSchema,
  request: (input) => ({
    model: "gpt-4o-mini",
    temperature: 0,
    messages: [
      {
        role: "system",
        content: `Extract specific, verifiable claims from the text.
Each claim must contain a concrete number, name, or date.
Return raw JSON only, no markdown: { "claims": ["claim1", ...] }`,
      },
      { role: "user", content: input.text },
    ],
    responseFormat: "json",
  }),
});

This is a self-contained definition. No global registry, no side effects at definition time. The step declares what goes in, what comes out, and how to call the model.

Run and capture a baseline

typescript
const adapters = { llm: createOpenAI({ client: new OpenAI() }) };

const text = `Acme Corp reported $4.2M in Q3 revenue, up 18% year-over-year.
CEO Jane Park announced 3 new enterprise clients and plans to expand
the engineering team from 45 to 60 people by March 2025.`;

const baseline = unwrap(await run(extractClaims, { text }, { adapters }));

run() executes the step and captures the result as an artifact. unwrap() extracts the value from the Result type (Verist uses errors-as-values, not exceptions). The baseline now holds the output along with the artifacts needed to recompute later.

text
Baseline: 4 claims
  - Acme Corp reported $4.2M in Q3 revenue
  - Revenue up 18% year-over-year
  - CEO Jane Park announced 3 new enterprise clients
  - Plans to expand engineering team from 45 to 60 by March 2025

Recompute with a new prompt

Now change the prompt. Maybe you want something more concise, so you switch to a summarization prompt:

typescript
const vagueStep = defineExtractionStep({
  name: "extract-claims",
  input: z.object({ text: z.string() }),
  output: ClaimsSchema,
  request: (input) => ({
    model: "gpt-4o-mini",
    temperature: 0,
    messages: [
      {
        role: "system",
        content: `Summarize the key points from the text. Be concise.
Return raw JSON only, no markdown: { "claims": ["point1", ...] }`,
      },
      { role: "user", content: input.text },
    ],
    responseFormat: "json",
  }),
});

// Replay the new step logic against the baseline's input
const result = unwrap(await recompute(baseline, vagueStep, { adapters }));

console.log(formatDiff(result.outputDiff));

recompute() runs the new step definition against the same input from the baseline, without re-running your application code. It returns a diff comparing the old output to the new one.

text
  claims[0]: "Acme Corp reported $4.2M in Q3 revenue"
          -> "Acme Corp had strong Q3 revenue growth"
- claims[3]: "Plans to expand engineering team from 45 to 60 by March 2025"

That's the regression. The vague prompt lost specificity in claims[0] and dropped claims[3] entirely. You caught it before shipping. No customer tickets. No silent regressions.

Schema violations: catching what logs miss

Back to the opening scenario. You changed a prompt, tested on five examples, and shipped. Three percent of cases broke. But what kind of breakage?

Sometimes the model doesn't just change a value. It returns something structurally wrong: a string where you expected a number, a missing required field, an array that's suddenly empty. Logs would show the raw response. Verist validates the output against your Zod schema on every recompute and surfaces violations explicitly.

typescript
const result = unwrap(await recompute(baseline, updatedStep, { adapters }));

if (result.schemaViolations.length > 0) {
  for (const v of result.schemaViolations) {
    console.log(`${v.path.join(".")}: ${v.kind} (${v.message})`);
  }
}
// claims.0: type (Expected string, received number)

Each violation has a path, a kind ("missing", "type", "refinement", or "other"), and a message. In CI, schema violations always fail the build, even with --no-fail-on-diff. Value changes are debatable. Structural breakage is not.

Try it without API keys

You don't need OpenAI credentials to see the workflow in action. verist init scaffolds a deterministic step using regex extraction:

bash
npx verist init
npx verist capture --step parse-contact --input "verist/inputs/*.json"
npx verist test --step parse-contact

This creates a step, captures baselines from sample inputs, and runs a regression test. No LLM calls, no API keys. Once you're ready to see LLM diffs, add your key and run the full example:

bash
OPENAI_API_KEY=sk-... npx tsx examples/prompt-diff/quickstart.ts

Human corrections that survive

In production, humans correct AI mistakes. A reviewer fixes a misclassified claim. A support agent overrides an extracted value. These corrections are expensive to make and easy to lose. Recompute with a new prompt and most systems wipe them out.

Verist separates the two concerns with a three-layer state model:

text
computed  +  overlay  =  effective
(AI)        (human)      (what the app sees)
  • Computed: AI-derived values, rewritten on every recompute
  • Overlay: Human corrections, never touched by automation
  • Effective: Shallow merge where overlay wins: { ...computed, ...overlay }
typescript
import { effectiveState } from "@verist/storage";

// AI extracted: { amount: "$4.2M", currency: "USD" }
// Human corrected currency to EUR

const state = {
  computed: { amount: "$4.2M", currency: "USD" },
  overlay: { currency: "EUR" }, // human correction
};

const effective = effectiveState(state);
// -> { amount: "$4.2M", currency: "EUR" }

When you recompute with a new prompt, the computed layer updates. The overlay stays. If the new prompt extracts { amount: "$4.2M", currency: "GBP" }, the effective state is still { amount: "$4.2M", currency: "EUR" } because the human correction takes precedence.

This isn't a merge strategy you have to build. It's a primitive in the storage layer.

CI: block regressions before merge

Once you have baselines captured, add a regression gate to your CI pipeline:

yaml
# .github/workflows/verist.yml
name: Verist regression check
on: [push, pull_request]

jobs:
  verist-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "22"
      - run: npm install
      - run: npx verist test --step extract-claims

verist test recomputes every baseline for the step and exits with code 1 if anything changed. Schema violations always fail. Value changes fail by default but can be relaxed with --no-fail-on-diff for steps where some drift is acceptable.

For PR comments with a summary table:

yaml
- run: npx verist test --step extract-claims --format markdown > verist-report.md
  continue-on-error: true
- uses: marocchino/sticky-pull-request-comment@v2
  with:
    path: verist-report.md

The JSON format (--format json) gives you machine-readable output with counts for passed, changed, schemaViolations, and failed, so you can build custom thresholds or notifications.

What this is not

  • Not an agent framework. No autonomous loops, no memory, no tool calling. Verist is the layer underneath that makes decisions reviewable.
  • Not observability. Logs tell you what happened. Verist tells you what would change before you ship.
  • Not a hosted platform. It's a library. Your code, your infrastructure, your database.
  • Not evals. Eval frameworks score outputs against a ground truth you label. Verist doesn't require ground truth. It diffs old output against new output so you can review the delta.

When this fits

  • Prompt iteration – You're tuning prompts and need to know what breaks across your full input set, not just three cherry-picked examples.
  • Model upgrades – You're switching from GPT-4 to Claude or upgrading to a new version and want to quantify the impact before deploying.
  • Safe recompute – You need to reprocess historical data with new logic without overwriting the human corrections your team has made.
  • Decision audit – Regulated or high-stakes domains where you need to reproduce and explain any past AI decision.

If your AI outputs are consumed by downstream code and a silent change would cause a bug, a bad customer experience, or a compliance issue, you need change control.

If you've ever hesitated to change a prompt because you couldn't predict the blast radius, that's what Verist solves.

Try it

bash
npm install verist @verist/cli zod
npx verist init

https://verist.dev

LLM context: llms.txt · llms-full.txt
Released under the Apache 2.0 License.