Skip to content

Verist: Next Steps (Impact Plan)

Focus (from verist-ops problems)

  • Tier 1 wedge: Structured Output Regression
  • Tier 2 expansion: Safe Recompute (overrides preserved)
  • Tier 3 strategic: Decision Audit + Decision Backtesting

Shipped

  • verist init scaffolds a deterministic step + sample inputs (no API keys needed).
  • verist capture --sample N --seed S for deterministic sampling.
  • verist capture --meta key=value persisted in baseline envelopes.
  • verist test --format json|markdown with exit codes (0 = clean, 1 = diffs, 2 = infra).
  • Anthropic adapter with normalized llm-input / llm-output artifacts.
  • OpenAI adapter supports baseURL (Ollama, Azure, Fireworks, etc.).
  • Cross-provider normalized artifacts hash identically for equivalent content.
  • examples/prompt-diff/quickstart.ts – end-to-end LLM regression demo.
  • README quickstart covers both zero-friction (regex) and LLM paths.
  • CI integration guide with GitHub Actions examples.
  • Observational schema validation in recompute; RecomputeResult.status classifies diffs.
  • In-memory RunStore + overlay recompute example.
  • defineExtractionStep() shorthand – eliminates schema duplication and manual onArtifact.
  • fail() for structured step errors with retryable flag.
  • StepResult.artifacts – automatic artifact collection without callbacks.
  • ctx.emitEvent() – audit events without manual plumbing.
  • Flattened StepResultresult.value.output instead of result.value.output.delta.
  • 27 DX issues resolved from sandbox testing (see ../verist-sandbox/issues.md).

Current State

The Tier 1 API is stable at v0.0.5. A user can go from verist init to first diff without API keys, and from examples/prompt-diff to a real LLM regression diff with one command. CI output formats are stable. The last 6 PRs were DX-driven refinements – the API surface feels settled.

The gap is no longer tooling – it's validation and distribution. No external team has used Verist in production. The thesis (structured output regression is acute pain) is well-reasoned but unproven with paying customers.

Three open DX issues remain (adapter annotation for non-LLM steps, diff() discoverability, createSnapshotFromResult naming) – none are blockers for adoption.


Top 5 Deliverables (Adoption-First)

1. README as Adoption Funnel (P0)

Why first: The README is the front door. A prospect who can't self-qualify in 60 seconds bounces. Right now it shows capabilities but doesn't help someone decide "is this for me?"

Scope:

  • Add a "Good fit / Not a fit" checklist above the quickstart.
  • Funnel to one adoption path: init → capture → test (the Tier 1 wedge).
  • Lead with the problem ("You updated your extraction prompt. What broke?"), not the solution.
  • Cut secondary content (Tier 2/3 features, architecture details) to linked pages.
  • Ensure the quickstart terminal output is visible and compelling (the "aha" diff).

Acceptance:

  • A new user can self-qualify before installing.
  • The README tells one story with one call to action.

2. Polish verist init → First Diff (P0)

Why second: The zero-friction path IS the wedge. If verist initverist test doesn't deliver a clear "aha" in under 60 seconds, the README promise falls flat.

Scope:

  • Audit the init scaffolding end-to-end: install, init, capture baseline, break, diff.
  • Ensure the generated step + inputs produce a meaningful, easy-to-read diff.
  • The scaffolded project should run verist test out of the box with zero edits.
  • Terminal output should be self-explanatory (no need to read docs to understand the diff).
  • Consider: can init scaffold a verist.config.ts so capture and test work immediately?

Acceptance:

  • npx verist init && verist capture && verist test produces a clear regression diff.
  • A first-time user understands what happened without reading docs.

3. Problem-Framing Content (P1)

Why third: Distribution is the bottleneck, not features. The right engineers need to encounter the problem framing before they encounter the tool.

Scope:

  • Blog post / article: "You updated your extraction prompt. What broke?"
  • Frame the problem (silent regressions in structured LLM output), not the tool.
  • Include a concrete before/after: prompt change → field disappears → downstream breaks.
  • End with the solution pattern (capture → recompute → diff) and link to Verist.
  • Short demo GIF: capture baseline → tweak prompt → see diff in terminal.

Acceptance:

  • One published piece that frames the problem clearly.
  • Shareable on HN, AI engineering communities, Twitter/X.

4. Copyable CI Workflow Template (P1)

Why fourth: Bridges "I tried it locally" → "it's in my pipeline." CI integration is the stickiness mechanism – once diffs run on every PR, Verist becomes infrastructure.

Scope:

  • Working .github/workflows/verist.yml in examples/ci/.
  • Handles: checkout, install, run verist test --format markdown, post PR comment.
  • Works with committed baselines (no capture step in CI – baselines are checked in).
  • Document the two patterns: baselines-in-repo vs baselines-from-capture.
  • Exit codes already work (0 = clean, 1 = diffs, 2 = infra) – template should use them.

Acceptance:

  • Copy-paste into any repo with verist.config.ts + committed baselines → works.
  • PR comment shows markdown diff table on regression.

5. Safe Recompute End-to-End Example (P1)

Why fifth: This is the Tier 2 hook – the reason teams stay after adopting for regression testing. The examples/overlay-recompute/ exists but doesn't tell a compelling story yet.

Scope:

  • Rework the overlay-recompute example into a clear narrative:
    1. AI extracts a risk assessment from a document.
    2. Human reviewer corrects one field (e.g., risk level: "medium" → "high").
    3. Model upgrades. Recompute runs.
    4. AI output changes, but the human correction is preserved in effective state.
  • Show the three-layer state model visually in terminal output.
  • Make it runnable without API keys (deterministic step, like the init scaffolding).

Acceptance:

  • Running example that shows human corrections surviving a recompute.
  • Clear before/after demonstrating the problem (corrections lost) and solution (preserved).

Priorities

PriorityDeliverableImpact
P0README as adoption funnelFront door – self-qualification in 60 seconds
P0Polish init → first diffDelivers the "aha" that the README promises
P1Problem-framing contentGets the problem in front of the right people
P1Copyable CI workflowStickiness – diffs on every PR
P1Safe recompute exampleTier 2 hook – why teams stay

What Not to Build (Yet)

  • Adapter step-level declaration – nice DX but affects few users (non-LLM adapters only)
  • Domain primitives (claims, evidence, verdicts) – user space, not kernel
  • Review queues – enterprise feature, not adoption driver
  • Dashboard – premature before paying users
  • Additional storage backends – Postgres is enough
  • Backtesting windows – Tier 3, no demand yet (YAGNI)
  • More LLM adapters – two providers cover the majority
  • createSnapshotFromResult rename – less important now that recompute(StepResult) exists

Immediate Next Steps

  1. Rewrite README with fit/no-fit checklist and single adoption funnel
  2. Audit and polish verist init end-to-end flow
  3. Draft problem-framing blog post outline

LLM context: llms.txt · llms-full.txt
Released under the Apache 2.0 License.