Verist: Next Steps (Impact Plan)
Focus (from verist-ops problems)
- Tier 1 wedge: Structured Output Regression
- Tier 2 expansion: Safe Recompute (overrides preserved)
- Tier 3 strategic: Decision Audit + Decision Backtesting
Shipped
verist initscaffolds a deterministic step + sample inputs (no API keys needed).verist capture --sample N --seed Sfor deterministic sampling.verist capture --meta key=valuepersisted in baseline envelopes.verist test --format json|markdownwith exit codes (0 = clean, 1 = diffs, 2 = infra).- Anthropic adapter with normalized
llm-input/llm-outputartifacts. - OpenAI adapter supports
baseURL(Ollama, Azure, Fireworks, etc.). - Cross-provider normalized artifacts hash identically for equivalent content.
examples/prompt-diff/quickstart.ts– end-to-end LLM regression demo.- README quickstart covers both zero-friction (regex) and LLM paths.
- CI integration guide with GitHub Actions examples.
- Observational schema validation in recompute;
RecomputeResult.statusclassifies diffs. - In-memory
RunStore+ overlay recompute example. defineExtractionStep()shorthand – eliminates schema duplication and manualonArtifact.fail()for structured step errors withretryableflag.StepResult.artifacts– automatic artifact collection without callbacks.ctx.emitEvent()– audit events without manual plumbing.- Flattened
StepResult–result.value.outputinstead ofresult.value.output.delta. - 27 DX issues resolved from sandbox testing (see
../verist-sandbox/issues.md).
Current State
The Tier 1 API is stable at v0.0.5. A user can go from verist init to first diff without API keys, and from examples/prompt-diff to a real LLM regression diff with one command. CI output formats are stable. The last 6 PRs were DX-driven refinements – the API surface feels settled.
The gap is no longer tooling – it's validation and distribution. No external team has used Verist in production. The thesis (structured output regression is acute pain) is well-reasoned but unproven with paying customers.
Three open DX issues remain (adapter annotation for non-LLM steps, diff() discoverability, createSnapshotFromResult naming) – none are blockers for adoption.
Top 5 Deliverables (Adoption-First)
1. README as Adoption Funnel (P0)
Why first: The README is the front door. A prospect who can't self-qualify in 60 seconds bounces. Right now it shows capabilities but doesn't help someone decide "is this for me?"
Scope:
- Add a "Good fit / Not a fit" checklist above the quickstart.
- Funnel to one adoption path:
init → capture → test(the Tier 1 wedge). - Lead with the problem ("You updated your extraction prompt. What broke?"), not the solution.
- Cut secondary content (Tier 2/3 features, architecture details) to linked pages.
- Ensure the quickstart terminal output is visible and compelling (the "aha" diff).
Acceptance:
- A new user can self-qualify before installing.
- The README tells one story with one call to action.
2. Polish verist init → First Diff (P0)
Why second: The zero-friction path IS the wedge. If verist init → verist test doesn't deliver a clear "aha" in under 60 seconds, the README promise falls flat.
Scope:
- Audit the
initscaffolding end-to-end: install, init, capture baseline, break, diff. - Ensure the generated step + inputs produce a meaningful, easy-to-read diff.
- The scaffolded project should run
verist testout of the box with zero edits. - Terminal output should be self-explanatory (no need to read docs to understand the diff).
- Consider: can
initscaffold averist.config.tssocaptureandtestwork immediately?
Acceptance:
npx verist init && verist capture && verist testproduces a clear regression diff.- A first-time user understands what happened without reading docs.
3. Problem-Framing Content (P1)
Why third: Distribution is the bottleneck, not features. The right engineers need to encounter the problem framing before they encounter the tool.
Scope:
- Blog post / article: "You updated your extraction prompt. What broke?"
- Frame the problem (silent regressions in structured LLM output), not the tool.
- Include a concrete before/after: prompt change → field disappears → downstream breaks.
- End with the solution pattern (capture → recompute → diff) and link to Verist.
- Short demo GIF: capture baseline → tweak prompt → see diff in terminal.
Acceptance:
- One published piece that frames the problem clearly.
- Shareable on HN, AI engineering communities, Twitter/X.
4. Copyable CI Workflow Template (P1)
Why fourth: Bridges "I tried it locally" → "it's in my pipeline." CI integration is the stickiness mechanism – once diffs run on every PR, Verist becomes infrastructure.
Scope:
- Working
.github/workflows/verist.ymlinexamples/ci/. - Handles: checkout, install, run
verist test --format markdown, post PR comment. - Works with committed baselines (no capture step in CI – baselines are checked in).
- Document the two patterns: baselines-in-repo vs baselines-from-capture.
- Exit codes already work (0 = clean, 1 = diffs, 2 = infra) – template should use them.
Acceptance:
- Copy-paste into any repo with
verist.config.ts+ committed baselines → works. - PR comment shows markdown diff table on regression.
5. Safe Recompute End-to-End Example (P1)
Why fifth: This is the Tier 2 hook – the reason teams stay after adopting for regression testing. The examples/overlay-recompute/ exists but doesn't tell a compelling story yet.
Scope:
- Rework the overlay-recompute example into a clear narrative:
- AI extracts a risk assessment from a document.
- Human reviewer corrects one field (e.g., risk level: "medium" → "high").
- Model upgrades. Recompute runs.
- AI output changes, but the human correction is preserved in effective state.
- Show the three-layer state model visually in terminal output.
- Make it runnable without API keys (deterministic step, like the init scaffolding).
Acceptance:
- Running example that shows human corrections surviving a recompute.
- Clear before/after demonstrating the problem (corrections lost) and solution (preserved).
Priorities
| Priority | Deliverable | Impact |
|---|---|---|
| P0 | README as adoption funnel | Front door – self-qualification in 60 seconds |
| P0 | Polish init → first diff | Delivers the "aha" that the README promises |
| P1 | Problem-framing content | Gets the problem in front of the right people |
| P1 | Copyable CI workflow | Stickiness – diffs on every PR |
| P1 | Safe recompute example | Tier 2 hook – why teams stay |
What Not to Build (Yet)
- Adapter step-level declaration – nice DX but affects few users (non-LLM adapters only)
- Domain primitives (claims, evidence, verdicts) – user space, not kernel
- Review queues – enterprise feature, not adoption driver
- Dashboard – premature before paying users
- Additional storage backends – Postgres is enough
- Backtesting windows – Tier 3, no demand yet (YAGNI)
- More LLM adapters – two providers cover the majority
createSnapshotFromResultrename – less important now thatrecompute(StepResult)exists
Immediate Next Steps
- Rewrite README with fit/no-fit checklist and single adoption funnel
- Audit and polish
verist initend-to-end flow - Draft problem-framing blog post outline