# ADR-001: Deterministic State Machine Over Autonomous Agents

## Status

Accepted

## Context

* **Problem**: How should Verist orchestrate AI workflows?
* **Why now**: Foundational decision that shapes the entire API
* **Constraints**: Must support audit trails, reproducibility, regulatory compliance

## Decision

* **Chosen option**: Explicit state machine with pure step functions
* **Rationale**:
  * Steps are testable in isolation
  * State transitions are inspectable and replayable
  * No hidden control flow or emergent behavior

## Alternatives

* **Autonomous agents**: LLM decides next action. Rejected – non-deterministic, hard to audit, unpredictable costs
* **DAG orchestrator (Airflow-style)**: External orchestration service. Rejected – operational overhead, state lives outside domain database

## Consequences

* **Positive**: Full auditability, reproducible runs, simple debugging (read SQL + step code)
* **Negative**: More explicit wiring required, no "magic" agent loops
* **Follow-ups**: Define step interface (ADR-002), audit event schema (ADR-003)

---

---
url: /adr/002-commands.md
---
# ADR-002: Declarative Commands for Control Flow

## Status

Accepted

## Context

* **Problem**: Steps can express state changes (`output`) and audit records (`events`), but have no way to express "what should happen next"
* **Why now**: Users need branching, fan-out, and human-in-the-loop patterns without Verist becoming an orchestrator
* **Constraints**: Must stay declarative; execution remains external per ADR-001

## Decision

* **Chosen option**: Add optional `commands: Command[]` to `StepReturn`
* **Rationale**:
  * Commands are declarative data, not executed by core
  * Enables control flow while keeping orchestration external
  * Minimal addition to existing API

## Alternatives

* **Workflow-level router function**: Adds complexity, another place for routing logic
* **Implicit routing via step names**: Magic behavior, hard to audit
* **No routing support**: Pushes complexity to every user

## Consequences

* **Positive**: Steps can express branching, fan-out, human review without coupling to infrastructure
* **Negative**: External runners must interpret commands
* **Follow-ups**: Document common patterns; consider command validation against workflow steps

---

---
url: /adr/003-state-layers.md
---
# ADR-003: State Layers

## Status

Accepted

## Context

* **Problem**: AI-derived state must be recomputable without losing human corrections
* **Why now**: Users need to correct AI mistakes while preserving the ability to reprocess with new models
* **Constraints**: Recomputation must be safe and predictable; human decisions are authoritative

## Decision

* **Chosen option**: Three-layer state model with computed, overlay, and effective views
* **Rationale**:
  * Computed layer holds AI-derived values, safely rewritable on reprocessing
  * Overlay layer holds human corrections, never touched by recomputation
  * Effective layer is computed as `{ ...computed, ...overlay }` – overlay values take precedence
  * Computed and overlay are the only persisted sources of truth; effective is a derived view and MUST NOT be persisted
  * Overlay is authoritative per field and replaces computed values entirely (no implicit deep-merge)
  * Overlay represents human-authoritative decisions and MUST NOT be written by automated steps
* **Merge semantics clarification**:
  * Effective state uses shallow spread: `{ ...computed, ...overlay }`
  * If overlay contains an explicit key (even with value `undefined`), that key overrides computed
  * This means `overlay: { score: undefined }` will result in `effective.score === undefined`, not `computed.score`
  * Avoid storing `undefined` in overlay; use key deletion or tombstone values instead

## Alternatives

* **Single mutable state**: Human edits and AI outputs mixed. Rejected – recomputation would destroy corrections
* **Version branches**: Fork state on each edit. Rejected – complex merge logic, storage overhead
* **Immutable snapshots only**: No in-place corrections. Rejected – poor UX for human reviewers

## Consequences

* **Positive**: Safe recomputation; human decisions preserved; clear data lineage
* **Negative**: Increased write-path complexity; read-path requires composition of layers
* **Follow-ups**:
  * Define storage schema patterns
  * Invariant evaluation phase (computed vs effective) must be explicit
  * Overlay values MUST be validated against the current state schema; invalid entries are surfaced as conflicts or quarantined, never silently dropped
  * Any deep-merge semantics must be explicit and opt-in outside the kernel

## References

* SPEC-overview (State Management section)

---

---
url: /adr/004-replay-semantics.md
---
# ADR-004: Replay Semantics

## Status

Accepted

## Context

* **Problem**: Steps call LLMs and external services that produce non-deterministic outputs. We need to reproduce past executions exactly for debugging, compliance, and regression testing.
* **Why now**: Audit requirements demand proof that behavior can be reconstructed; testing needs deterministic fixtures.
* **Constraints**: Cannot require all content to be stored (compliance may forbid it); must support both exact replay and fresh recomputation.

## Decision

* **Chosen option**: Artifact-based replay with separate recompute path
* **Rationale**:
  * Artifacts capture non-deterministic inputs (LLM responses) by content hash
  * Replay uses stored artifacts to reproduce exact output
  * Recompute uses fresh calls and diffs against original
  * Hash-only mode enables compliance without storing sensitive content

## Alternatives

* **Event sourcing**: Replay by re-applying all events. Rejected – events are outputs not inputs; doesn't capture LLM responses
* **Mocked adapters**: Replace adapters with recorded responses. Rejected – tightly couples to adapter implementation; complex setup
* **Snapshot entire state**: Store full state at each step. Rejected – doesn't enable re-execution; storage overhead for unchanged fields

## Consequences

* **Positive**: Exact replay for debugging; fresh recompute reveals model drift; hash-only mode for compliance
* **Negative**: Step authors must ensure artifact capture at non-deterministic points; storage required for artifact content
* **Follow-ups**: Define adapter wrapping patterns for automatic capture; document compliance mode configuration

## References

* SPEC-replay
* SPEC-kernel-invariants (Invariant 6: Replay Is Exact)

---

---
url: /adr/005-package-stability.md
---
# ADR-005: Package Stability Tiers

## Status

Accepted

## Context

* **Problem**: Users need to know which APIs are stable for production use vs. evolving
* **Why now**: The library is approaching public release; stability expectations must be explicit
* **Constraints**: Two packages exist (`@verist/core`, `@verist/replay`) with different evolution rates

## Decision

* **Chosen option**: Define two stability tiers with explicit guarantees

### Tier 1: Kernel (`@verist/core`)

Stability promise:

* Breaking changes require major version bump
* API surface kept minimal – new exports are additions, not replacements
* Deprecated APIs supported for at least one major version

Core exports with stability guarantee:

* `defineStep`, `defineWorkflow`, `run`
* `Result` type and helpers (`ok`, `err`, `isOk`, `isErr`, `map`, `flatMap`)
* Command types and builders (`invoke`, `fanout`, `review`, `emit`, `suspend`)
* `AuditEvent`, `LLMTrace` schemas

### Tier 2: Capabilities (`@verist/replay`)

Stability promise:

* May evolve faster than kernel
* Breaking changes documented in changelog
* Semantic versioning applies, but expect more minor version bumps

Reason: Replay, diff, and recomputation patterns are newer and may need iteration as real-world usage reveals edge cases.

## Alternatives

* **Single stability tier**: Rejected – forces either too-slow core evolution or too-unstable guarantees
* **No explicit tiers**: Rejected – users cannot make informed dependency decisions

## Consequences

* **Positive**: Users can depend on kernel stability for compliance/audit use cases
* **Positive**: Replay package can iterate based on real-world feedback
* **Negative**: Must maintain discipline when adding to kernel
* **Follow-ups**: Consider `@verist/trust-kit` as Tier 2 when implemented

## References

* SPEC-kernel-invariants.md – defines what the kernel guarantees
* vision.md – "Trust Kit (optional layer)" suggests tiered stability

---

---
url: /adr/006-command-categories.md
---
# ADR-006: Command Categories

## Status

Accepted

## Context

* **Problem**: Commands have implicit semantic categories that affect how runners must handle them, but this is not documented
* **Why now**: Helper functions (`isControlCommand`, `isBlockingCommand`, `isSideEffectCommand`) were added in #9, formalizing the categories
* **Constraints**: Categories must align with existing command semantics and runner contracts

## Decision

* **Chosen option**: Document three command categories with distinct behaviors
* **Rationale**:
  * Makes implicit semantics explicit for implementers
  * Guides correct runner implementation
  * Helps users reason about command effects

## Categories

| Category | Commands        | Behavior                            | Helper                  |
| -------- | --------------- | ----------------------------------- | ----------------------- |
| Control  | invoke, fanout  | Dispatch execution to other steps   | `isControlCommand()`    |
| Barrier  | review, suspend | Block until external input resolves | `isBlockingCommand()`   |
| Effect   | emit            | Produce side effects, not replayed  | `isSideEffectCommand()` |

### Control Commands

Dispatch execution to other steps. The runner schedules the target step(s) for execution.

* **invoke**: Single step execution with given input
* **fanout**: Multiple parallel executions of the same step with different inputs

Control commands do not block the current step's completion.

### Barrier Commands

Block workflow execution until external input arrives. The runner must persist the suspension and resume when the barrier resolves.

* **review**: Human approval gate. Sibling commands are deferred until review resolves.
* **suspend**: Wait for external data/callback. Sibling commands are discarded – the resumed step emits new commands.

Both require the runner to persist state for later resumption.

### Effect Commands

Produce side effects to external systems. Not replayed during recomputation.

* **emit**: Publish domain event to external topic/queue

Effect commands are fire-and-forget from the step's perspective.

## Alternatives

* **No formal categories**: Leaves semantics implicit, harder for implementers
* **Per-command documentation only**: Misses the grouping insight
* **More granular categories**: Over-complicates for current command set

## Consequences

* **Positive**: Clear contract for runner implementations; easier reasoning about command effects
* **Negative**: None – documentation only
* **Follow-ups**: Consider category-specific validation in future runner implementations

---

---
url: /adr/007-unified-run-api.md
---
# ADR-007: Unified run() API

## Status

Accepted

## Context

* **Problem**: Two execution APIs (`run()` and `runStep()`) create confusion. Docs repeatedly explain "identical execution semantics" but different "identity discipline" – a code smell indicating unnecessary surface area.
* **Why now**: API simplification before broader adoption. Reducing cognitive load improves onboarding.
* **Constraints**: Must preserve production discipline (explicit identity for audit/replay) while enabling quick-start simplicity.

## Decision

* **Chosen option**: Single `run()` function with optional identity parameters
* **Rationale**:
  * One mental model instead of two
  * Progressive disclosure: start simple, add identity when needed
  * Matches "minimal surface area" design principle

### New API

```typescript
// Minimal (dev/testing)
const result = await run(step, input, { adapters });

// Production (explicit identity)
const result = await run(step, input, {
  adapters,
  workflowId: "verify-document",
  workflowVersion: "1.0.0",
  runId: crypto.randomUUID(),
});

// Result always includes actual identity used
result.value.workflowId; // step.name (defaulted) or explicit
result.value.workflowVersion; // "0.0.0" (defaulted) or explicit
result.value.runId; // generated or explicit
```

### Defaults

| Parameter         | Default               |
| ----------------- | --------------------- |
| `workflowId`      | `step.name`           |
| `workflowVersion` | `"0.0.0"`             |
| `runId`           | `crypto.randomUUID()` |

### Context Factory

`createContextFactory()` remains for advanced use cases (shared adapters, custom metadata). For simple cases, pass `adapters` directly.

### runStep for Advanced Use

`runStep()` remains available for advanced scenarios requiring custom context factories (e.g., `@verist/pipeline` sharing context across stages). It is not part of the primary onboarding path and should be considered internal/advanced API.

## Alternatives

* **Remove runStep entirely**: Rejected – needed by pipeline and custom orchestration.
* **Add `isDefaulted` flag to result**: Rejected – if you care about identity, pass it explicitly; if you didn't, you already know it was defaulted.

## Consequences

* **Positive**: Simpler API, easier onboarding, reduced docs
* **Negative**: Breaking change for `runStep()` users (migration: replace `runStep({...})` with `run(step, input, {...})`)
* **Follow-ups**: Update SPEC-overview, package READMEs

## References

* SPEC-overview
* packages/core/README.md

---

---
url: /adr/008-artifact-capture-hook.md
---
# ADR-008: Artifact Capture Hook in Core

## Status

Accepted

## Context

* **Problem**: Replay + diff is Verist's core differentiator, but enabling it requires manual adapter wrapping and separate `@verist/replay` integration. The friction to "first diff" is too high.
* **Why now**: Replay should be nearly zero-friction to match "fastest reason to try Verist today" positioning.
* **Constraints**: Kernel must stay minimal – replay semantics (snapshot structure, artifact precedence, hash-only mode) should not leak into core.

## Decision

* **Chosen option**: Add optional `onArtifact` callback to `run()` options
* **Rationale**:
  * Enables artifact capture without bloating kernel
  * Storage remains caller's responsibility (array, DB, S3)
  * `@verist/replay` stays separate but consumes artifacts cleanly

### API

```typescript
const artifacts: Artifact[] = [];

const result = await run(step, input, {
  adapters,
  onArtifact: (artifact) => artifacts.push(artifact),
});

// artifacts now contains step-output and any adapter-emitted artifacts
// Pass to @verist/replay for snapshot creation and recompute
```

### What Core Emits

Core automatically emits `step-output` artifact containing `{ output, events }` when `onArtifact` is provided. The artifact includes a content hash computed via Web Crypto API, adding minimal per-step overhead.

### What Adapters Emit

Adapters (e.g., `@verist/llm`) emit their own artifacts (`llm-input`, `llm-output`) via the same callback. The callback is passed through context:

```typescript
// In adapter
if (ctx.onArtifact) {
  ctx.onArtifact(captureArtifact("llm-input", request));
}
```

### What Stays in Replay Helpers

* `createSnapshot()`, `createSnapshotFromResult()`
* `recompute()`, `diff()`, `formatDiff()`
* Artifact kind precedence rules
* Hash-only mode handling

## Alternatives

* **Merge replay into core**: Rejected – drags replay policy (artifact precedence, snapshot schema) into kernel, violates minimal surface area.
* **Auto-snapshot on result**: Rejected – pressures core to know replay semantics; unclear storage responsibility.

## Consequences

* **Positive**: Fast path to "first diff", kernel stays minimal, clean extension point
* **Negative**: Adapters need to check for callback (minor)
* **Follow-ups**: Update @verist/llm to emit artifacts via callback, add `withReplay()` helper to @verist/replay

## References

* SPEC-replay
* packages/replay/README.md

---

---
url: /adr/009-pipeline-error-handling.md
---
# ADR-009: Pipeline Error Handling and Audit Trail

## Status

Accepted

## Context

* **Problem**: Pipeline `onError: "skip"` has an audit gap – skipped stages emit no events, violating "nothing important is lost" principle. The name "skip" also implies hiding rather than explicit continuation.
* **Why now**: Audit-first is a kernel invariant; pipelines should maintain the evidence trail.
* **Constraints**: Feature is genuinely useful for optional enrichment stages; removal would force boilerplate try/catch in every step.

## Decision

* **Chosen option**: Rename `"skip"` to `"continue"`, pipeline runner emits audit event on continuation
* **Rationale**:
  * `"continue"` better expresses intent (error acknowledged, proceeding)
  * Pipeline-owned audit event closes the evidence gap
  * Keeps feature without violating audit-first principle

### Renamed Option

```typescript
interface PipelineStageConfig {
  step: Step<any, any, any>;
  wire?: (prevOutput: unknown, pipelineInput: unknown) => unknown;
  onError?: "fail" | "continue"; // renamed from "skip"
}
```

### Pipeline-Owned Audit Event

When a stage fails and `onError: "continue"` is set, pipeline runner emits:

```typescript
{
  type: "pipeline.stage_error",  // namespaced to distinguish from step events
  payload: {
    stepName: string;
    code: string;
    message: string;
  }
}
```

This event is included in `PipelineResult.stages[n].events` for the continued stage, maintaining the audit trail. The `pipeline.` prefix distinguishes pipeline-owned events from step-emitted events.

### StageResult Changes

```typescript
interface StageResult {
  stepName: string;
  status: "completed" | "failed" | "continued" | "suspended"; // "skipped" → "continued"
  // ...
}
```

## Alternatives

* **Remove feature entirely**: Rejected – forces try/catch boilerplate into every "optional" step, scattering error policy.
* **Keep "skip" naming**: Rejected – sounds like hiding; "continue" is more honest.
* **Emit event from step**: Rejected – step didn't run to completion, can't emit events; pipeline must own this.

## Consequences

* **Positive**: Audit trail complete, naming clearer, feature preserved
* **Negative**: Breaking change for `onError: "skip"` users (migration: rename to `"continue"`)
* **Follow-ups**: Update SPEC-pipeline, packages/pipeline/README.md

## References

* SPEC-pipeline
* SPEC-kernel-invariants (Events Are Immutable)

---

---
url: /adr/010-execution-loop.md
---
# ADR-010: Execution Loop with Outbox Pattern

## Status

Accepted

## Context

* **Problem**: The kernel defines step execution semantics but provides no reference implementation for the execution loop (queue -> run -> persist -> enqueue next). SPEC-commands requires atomic persistence of commands with output+events, but this isn't implemented.
* **Why now**: Adoption stalls without a runnable end-to-end example. Users cannot validate the kernel works in production-like conditions.
* **Constraints**: Must work with at-least-once delivery queues. Must handle "commit succeeded but enqueue failed" and vice versa.

## Decision

### 1. Transactional Outbox for Command Dispatch

Persist commands in an outbox table atomically with state commit. A separate dispatcher drains the outbox to the queue.

**Rationale:** Prevents "commit succeeded but command lost," supports safe retries.

### 2. Deterministic Dedupe Keys

Each command gets a deterministic key: `hash(workflowId, runId, stepId, command)` (using stable serialization). The key is used as the queue job ID.

**Rationale:** Duplicate delivery dedupes cleanly; idempotency is provable.

**Queue requirement:** Completed jobs must be retained long enough to dedupe re-enqueues (e.g. BullMQ `removeOnComplete: { age: N }`).

### 3. Unified Block Model (Review + Suspend)

Represent blocking states in a single `verist_blocks` table with a `type` discriminator.

**Invariants:** One active block per run; one blocking command per step result. `resolveBlock()` is idempotent.

### 4. Command Status Lifecycle (Minimal)

Commands have status: `pending | deferred | leased | dispatched | rejected | failed`.

**Rationale:** Separates review gating (`deferred`), dispatch leasing, and terminal outcomes.

### 5. Runner Lives Outside Core

The execution loop (`executeStep` + `dispatchOutbox`) lives in `examples/`, not `@verist/core`.

### 6. Synthetic StepIds for Resume

Resume commands use `stepId = resume:<blockId>` to distinguish from normal invocations.

### 7. Lease-Based Dispatcher

Dispatcher uses `SELECT ... FOR UPDATE SKIP LOCKED` with lease fields. Expired leases can be reclaimed by any dispatcher.

## Alternatives

* **No outbox (direct enqueue after commit)**: Rejected. "Commit succeeded, enqueue failed" causes silent workflow stalls. Kernel idempotency doesn't help here.

* **Separate review/suspend tables**: Rejected. Creates parallel subsystems that drift. Unified model with type discriminator is simpler and more extensible.

* **Runner in `@verist/core`**: Rejected. Violates "orchestration is external" principle. Blurs kernel boundary.

* **Random job IDs**: Rejected. Breaks idempotency proof. Duplicate delivery would create duplicate jobs.

## Consequences

* **Positive**: End-to-end execution is demonstrably correct. Failure modes are explicit. Idempotency is provable.
* **Negative**: One more table (outbox). Dispatcher is a separate process/loop.
* **Follow-ups**: Implement `@verist/storage-pg` adapter, `@verist/queue` BullMQ adapter, canonical example.

## References

* SPEC-commands: Commands SHOULD be persisted atomically with output + events
* SPEC-suspend: Runner contract for blocking commands

---

---
url: /adr/011-package-consolidation.md
---
# ADR-011: Consolidate core + replay into single `verist` package

## Status

Accepted

## Context

* **Problem**: `@verist/core` and `@verist/replay` were separate packages, but replay depends entirely on core types and is always used together. Consumers had to install and import from two packages for basic workflows.
* **Why now**: Several packages (`@verist/batch`, `@verist/pipeline`, `@verist/queue`, `@verist/otel`, `@verist/artifacts`) were speculative and unused. Consolidating now reduces maintenance surface before the first stable release.
* **Constraints**: The merged package must export everything both packages previously exported so dependent packages (`@verist/llm`, `@verist/storage`, `@verist/storage-pg`, `@verist/cli`) compile with only import path changes.

## Decision

* **Chosen option**: Merge `@verist/core` and `@verist/replay` into a single `verist` package. Delete unused packages.
* **Rationale**:
  * Single import for the primary API (`import { defineStep, recompute, diff } from "verist"`)
  * Eliminates circular dependency risk between core and replay
  * Removes 5 speculative packages that added maintenance cost with no consumers

## Alternatives

* **Keep separate packages**: Higher install/import friction, split API surface for tightly coupled functionality.
* **Merge into `@verist/core`**: Keeps scoped name but loses the cleaner `verist` top-level import.

## Consequences

* **Positive**: Single dependency for most consumers; simpler workspace; fewer packages to version/publish.
* **Negative**: Larger single package (though tree-shaking mitigates bundle size).
* **Follow-ups**: Update sandbox project and docs to use `verist` imports. Re-introduce queue/pipeline packages only when there are real consumers.

## References

* ADR-005: Package stability tiers (superseded for deleted packages)

---

---
url: /adr/012-structured-step-errors.md
---
# ADR-012: Structured Step Errors via `fail()`

## Status

Accepted

## Context

* **Problem**: Steps can only succeed (return `StepReturn`) or throw. When `extract()` returns `err({ code: "rate_limit", retryable: true })`, the step must throw, and `runStep()` re-wraps the exception as `err({ code: "execution_failed" })`. The original error code and `retryable` flag are lost. Runners that need retry logic must parse error message strings.

* **Why now**: This directly violates kernel invariant #9 (Errors Are Values) at the step boundary – the one place where it matters most. Every LLM step in the sandbox (scenarios 02, 04) works around this by throwing, losing structured error metadata that runners need for retry and audit.

* **Constraints**: Must be additive (existing steps that throw must continue to work). Must not introduce nested `Result` types that confuse the API. Must preserve the existing `StepError` contract for callers of `runStep()` and `run()`.

## Decision

* **Chosen option**: Introduce a tagged `StepFailure` value via `fail()` helper. Steps can `return fail(...)` as an alternative to throwing.
* **Rationale**:
  * Consistent with "errors as values" – failures are returned, not thrown
  * Zero-cost discrimination via `_tag` field – `runStep()` checks one property
  * Additive – existing steps that throw continue to work unchanged

### API

```typescript
import { defineStep, fail } from "verist";
import { extract, type LLMContext } from "@verist/llm";

const step = defineStep({
  name: "extract-job",
  input: z.object({ text: z.string() }),
  output: schema,
  run: async (input, ctx: LLMContext) => {
    const result = await extract(ctx, request, schema);
    if (!result.ok) return fail(result.error);
    return { output: result.value.data };
  },
});

// Caller sees structured error
const result = await run(step, input, { adapters: { llm } });
if (!result.ok && result.error.retryable) {
  // retry with backoff
}
```

### Types (Summary)

* `fail()` returns a tagged `StepFailure` with `code`, `message`, optional `retryable`, optional `cause`.
* `runStep()` and `recompute()` detect `StepFailure` and normalize to `StepError` with required `retryable`.
* Kernel-owned `StepError.code` values are `input_validation`, `output_validation`, `execution_failed`. Any other string code is treated as domain-specific.

### Detection in `runStep()` and `recompute()`

Both call `step.run()` and treat a returned `StepFailure` as a structured error, normalizing `retryable` to `false` when omitted. Thrown exceptions are still wrapped as `execution_failed`. Throwing is reserved for programmer errors and invariant violations.

## Alternatives

* **Nested `Result` from step `run()`**: Step returns `Result<StepReturn, StepError>`, `runStep()` unwraps. Rejected – creates two layers of `Result` (step → `runStep()` → caller), confusing types. Detection requires checking `.ok` on the return value, which collides with any output schema that happens to have an `ok` field.

* **Force all steps to return `Result`**: Breaking change. Rejected – forces migration of all existing steps. Mixed styles (return vs throw) are inevitable in any ecosystem.

* **Error subclasses**: Steps throw `StepError extends Error` with typed fields. Rejected – still uses exceptions for expected failures, violating invariant #9. `instanceof` checks are fragile across package boundaries.

* **`errorCode` property on standard `Error`**: Steps throw `Error` with custom properties. Rejected – no type safety, properties are optional and unstructured, easy to forget.

## Consequences

* **Positive**: Structured error codes and `retryable` flag survive from adapter through step to runner. Runners can implement retry policies without string parsing. Consistent with existing `Result` patterns in storage and LLM layers.
* **Negative**: Two return paths from steps (return value vs throw). `_tag` is a convention, not enforced by TypeScript's type system at the return site (step could return a plain object with `_tag`). Mitigated: `fail()` is the only documented way to create `StepFailure`.
* **Follow-ups**: Update SPEC-steps, SPEC-overview API sketch, kernel-invariants #9. Update sandbox scenarios 02/04 to use `fail()` instead of throwing.

## References

* SPEC-steps
* SPEC-kernel-invariants (#9: Errors Are Values)
* plan.local.md §1
* sandbox/issues.md #1

---

---
url: /adr/000-template.md
---
# ADR-NNN: \[Short, Actionable Title]

## Status

Proposed | Accepted | Deprecated | Superseded

## Context

* **Problem**: What decision is needed?
* **Why now**: Trigger / constraint / change
* **Constraints**: Technical, organizational, legal (if any)

## Decision

* **Chosen option**: One sentence, declarative
* **Rationale**: Key reasons (max 3 bullets)

## Alternatives

* **Option A**: Why not
* **Option B**: Why not

## Consequences

* **Positive**: What improves
* **Negative**: Trade-offs / risks
* **Follow-ups**: Required actions or future ADRs

## References

* Docs / PRs / Issues

---

---
url: /guides/anti-patterns.md
---
# Anti-Patterns

These patterns break Verist's guarantees. Avoid them.

## Hidden state in steps

::: danger Don't

```ts
let cache = {}; // module-level state

const step = defineStep({
  run: async (input, ctx) => {
    if (cache[input.id]) return cache[input.id]; // [!code error]
    const result = await ctx.adapters.llm.run(input);
    cache[input.id] = result;
    return { output: result };
  },
});
```

:::

::: tip Do
Pass all required data through input or adapters. State lives in the database, not in memory.
:::

**Why it breaks:** Replay and recompute assume the step has no memory between runs. Hidden state makes outputs non-reproducible.

## Writing to the database inside steps

::: danger Don't

```ts
const step = defineStep({
  run: async (input, ctx) => {
    await db.insert({ id: input.id, status: "processed" }); // [!code error]
    return { output: { processed: true } };
  },
});
```

:::

::: tip Do
Return an output. Let your runner commit it atomically after the step completes.
:::

**Why it breaks:** If the step fails after the write, you have partial state. Replay becomes impossible.

## Retries inside steps

::: danger Don't

```ts
const step = defineStep({
  run: async (input, ctx) => {
    for (let i = 0; i < 3; i++) {
      // [!code error]
      try {
        return await riskyCall(input);
      } catch (e) {
        continue;
      }
    }
    throw new Error("Failed after retries");
  },
});
```

:::

::: tip Do
Let the runner handle retries. Steps should fail fast and return errors as values.
:::

**Why it breaks:** Internal retries hide failure modes. The audit trail shows success even though three attempts happened.

## Side effects in steps

::: danger Don't

```ts
const step = defineStep({
  run: async (input, ctx) => {
    await sendEmail(input.userId, "Your request was processed"); // [!code error]
    return { output: { notified: true } };
  },
});
```

:::

::: tip Do
Return an `emit` command. Let your runner send the email after the step commits.
:::

**Why it breaks:** If the step is replayed or recomputed, the email is sent again. Side effects should be explicit commands interpreted by your runner.

## Ignoring commands

::: danger Don't

```ts
const result = await run(step, input, ctx);
if (result.ok) {
  await store.commit(result.value.output);
  // commands silently dropped // [!code error]
}
```

:::

::: tip Do
Interpret every command. Use your queue for `invoke`, your review system for `review`, your event bus for `emit`.
:::

**Why it breaks:** Commands represent intended control flow. Dropping them means the workflow stalls silently.

## Mixing computed and overlay writes

::: danger Don't

```ts
const step = defineStep({
  run: async (input, ctx) => {
    await store.writeOverlay({ score: 0.95 }); // [!code error]
    return { output: { score: 0.85 } };
  },
});
```

:::

::: tip Do
Steps write to computed (via output). Only your review UI writes to overlay.
:::

**Why it breaks:** The overlay is for human overrides. If steps write to it, human decisions can be silently overwritten.

## Silent recompute

::: danger Don't

```ts
const result = await recompute(snapshot, step, newCtx);
if (result.ok) {
  await store.commit(result.value.output); // [!code error]
}
```

:::

::: tip Do
Show the diff to a reviewer. Only persist after explicit approval.
:::

**Why it breaks:** Recompute is for seeing what *would* change. Persisting without review defeats the purpose.

## Capturing too little

::: danger Don't

```ts
const result = await run(step, input, {
  adapters,
  // no onArtifact callback // [!code error]
});
```

:::

::: tip Do
Capture all non-deterministic inputs/outputs via `onArtifact`. Store them with your snapshot.
:::

**Why it breaks:** Replay only works if you have all the artifacts. Missing artifacts mean approximate replay at best.

## Summary

| Anti-pattern          | Consequence                 |
| --------------------- | --------------------------- |
| Hidden state          | Non-reproducible outputs    |
| DB writes in steps    | Partial state on failure    |
| Retries in steps      | Hidden failure modes        |
| Side effects in steps | Duplicate effects on replay |
| Ignoring commands     | Silent workflow stalls      |
| Steps writing overlay | Human overrides lost        |
| Silent recompute      | Unreviewed changes          |
| Missing artifacts     | Approximate replay          |

---

---
url: /guides/architecture.md
---
# Architecture Overview

This page shows how Verist fits into a production system. Use it to understand the full picture before diving into specific guides.

## System diagram

```text
┌────────────────────────────────────────────────────────────────┐
│                         Your System                            │
│                                                                │
│  ┌─────────┐     ┌────────────────────────────────────────┐    │
│  │  Queue  │────▶│              Runner                    │    │
│  └─────────┘     │                                        │    │
│       ▲          │  1. Load state from DB                 │    │
│       │          │  2. Call run(step, input, ctx)         │    │
│       │          │  3. Commit output + events             │    │
│       │          │  4. Enqueue commands                   │    │
│       │          │  5. Capture artifacts → snapshot       │    │
│       │          └────────────────────────────────────────┘    │
│       │                          │                             │
│       │                          ▼                             │
│       │          ┌────────────────────────────────────────┐    │
│       │          │            Database                    │    │
│       │          │                                        │    │
│       │          │  computed   overlay   events           │    │
│       │          │  ────────   ───────   ──────           │    │
│       │          │  Step output Human    Audit trail      │    │
│       │          │             overrides                  │    │
│       │          │                                        │    │
│       │          │  effective = { ...computed, ...overlay }    │
│       │          └────────────────────────────────────────┘    │
│       │                          │                             │
│       │                          ▼                             │
│       │          ┌────────────────────────────────────────┐    │
│       │          │         Snapshot Store                 │    │
│       │          │                                        │    │
│       │          │  Artifacts + metadata for replay       │    │
│       └──────────└────────────────────────────────────────┘    │
│                                                                │
│  ┌─────────────────────────────────────────────────────────┐   │
│  │                    Review UI                            │   │
│  │                                                         │   │
│  │  Show diff → Human approves/overrides → Write overlay   │   │
│  └─────────────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────────────┘
```

## What Verist provides vs what you provide

| You provide           | Verist provides                                |
| --------------------- | ---------------------------------------------- |
| Queue and runner loop | `run()`, `defineStep`, `defineWorkflow`        |
| Database and storage  | State layer semantics (computed, overlay)      |
| Snapshot persistence  | `createSnapshotFromResult()`, artifact capture |
| Review UI             | Diff formatting, replay, recompute             |
| LLM adapters          | Context factory, artifact hooks                |

Verist is a kernel. You wire it into your infrastructure.

## Happy path: run a step

```text
Queue job arrives
       │
       ▼
Runner loads state from DB
       │
       ▼
run(step, input, ctx)
       │
       ▼
Step returns { output, events, commands }
       │
       ▼
Runner commits output + events to DB
       │
       ▼
Runner enqueues commands
       │
       ▼
Runner captures snapshot (if needed)
```

## Recompute path: model upgrade

```text
Load snapshot from store
       │
       ▼
recompute(snapshot, step, { adapters, validate: true })
       │
       ▼
Compare original vs new output
       │
       ▼
Return { status, outputDiff, commandsDiff, schemaViolations }
       │
       ▼
Review UI shows diff
       │
       ▼
Reviewer approves or overrides
       │
       ▼
If override: write to overlay
```

## Human override path

```text
Step produces computed = 0.42
       │
       ▼
Reviewer disagrees, sets overlay = 0.90
       │
       ▼
Later: model upgrade, recompute
       │
       ▼
New computed = 0.38
       │
       ▼
Effective = 0.90 (overlay wins)
```

Human decisions survive recomputation. The overlay is never overwritten by steps.

## Where each component lives

| Component         | Package                                  | Your responsibility                 |
| ----------------- | ---------------------------------------- | ----------------------------------- |
| Step definition   | `verist`                                 | Define step logic                   |
| Run execution     | `verist`                                 | Call `run()` in your runner         |
| Snapshot creation | `verist`                                 | Persist snapshots                   |
| Replay/recompute  | `verist`                                 | Load snapshots, run `recompute()`   |
| State storage     | `@verist/storage` + `@verist/storage-pg` | Storage contract + Postgres adapter |
| Queue             | Your choice                              | Job dispatch and retry              |
| Review UI         | Your choice                              | Display diffs, collect overrides    |

## Key guarantees

| Guarantee             | How it works                                          |
| --------------------- | ----------------------------------------------------- |
| Determinism           | Given recorded artifacts, step output is reproducible |
| Audit trail           | Every step emits structured events                    |
| Human authority       | Overlay always wins over computed                     |
| Explicit control flow | Commands are data, not implicit execution             |
| Safe retries          | Steps are idempotent by design                        |

## When to read what

| Goal                        | Read                                   |
| --------------------------- | -------------------------------------- |
| Understand the mental model | [Concepts](../concepts/overview)       |
| Get started quickly         | [First Step](./first-step)             |
| Learn replay and diff       | [Replay and Diff](./replay-and-diff)   |
| Build a runner              | [Reference Runner](./reference-runner) |
| Add human overrides         | [Human Overrides](./overrides)         |
| Store state                 | [Storage and State](./storage)         |

## Deep dive

For contributors and advanced users, the following resources cover kernel internals and design rationale.

**Specifications** – formal contracts for each kernel subsystem:

* [Overview](../specs/overview) – core concepts, invariants, API surface
* [Steps](../specs/steps) – step execution semantics
* [Commands](../specs/commands) – command type system and semantics
* [Replay](../specs/replay) – replay and recompute contracts
* [Suspend](../specs/suspend) – suspend/resume protocol
* [Kernel Invariants](../specs/kernel-invariants) – non-negotiable guarantees

**Architecture Decision Records (ADRs)** – why things are the way they are:

* [ADR-001 Determinism](../adr/001-determinism)
* [ADR-002 Commands](../adr/002-commands)
* [ADR-003 State Layers](../adr/003-state-layers)
* [ADR-004 Replay Semantics](../adr/004-replay-semantics)
* [ADR-005 Package Stability](../adr/005-package-stability)
* [ADR-006 Command Categories](../adr/006-command-categories)
* [ADR-007 Unified Run API](../adr/007-unified-run-api)
* [ADR-008 Artifact Capture](../adr/008-artifact-capture-hook)
* [ADR-009 Pipeline Errors](../adr/009-pipeline-error-handling)
* [ADR-010 Execution Loop](../adr/010-execution-loop)
* [ADR-011 Package Consolidation](../adr/011-package-consolidation)
* [ADR-012 Structured Step Errors](../adr/012-structured-step-errors)

---

---
url: /guides/batch.md
---
# Batch Runs

Batch is for running the same step across a large list of inputs with per-item outcomes.

Use cases: backfills, model upgrades, reprocessing a dataset.

## What batch gives you

| Feature            | Description                            |
| ------------------ | -------------------------------------- |
| Per-item results   | Success and failure tracked separately |
| Explicit diffs     | Each item gets its own diff            |
| No hidden failures | "Some failed" is never hidden          |

## The flow

1. Run step for each item
2. Persist each output + events
3. Collect per-item results
4. Continue even if some items fail

## Why not just a for-loop?

You can. But most teams eventually need:

* Concurrency control
* Per-item audit events
* Explicit failure accounting
* Deterministic replay later

That's what the batch helpers are for.

## Example

```ts
const inputs = [{ id: "a" }, { id: "b" }, { id: "c" }];

for (const input of inputs) {
  const result = await run(step, input, ctxFor(input));
  if (result.ok) {
    await store.commit(...);
  } else {
    await store.recordFailure(...);
  }
}
```

For built-in concurrency and reporting, wrap this in a helper that manages parallelism and per-item accounting.

## Design tips

* Keep batch items **independent**
* Avoid cross-item shared state
* Store a **batchId** so you can trace the run later
* Use **idempotent inputs** so retries are safe

---

---
url: /guides/ci-integration.md
---
# CI Integration

Verist CLI supports machine-readable output for CI pipelines. Use `--format json` for scripting or `--format markdown` for PR comments.

## Output formats

```bash
verist test --step extract-claims --format json      # structured JSON
verist test --step extract-claims --format markdown   # summary table
verist test --step extract-claims --format text       # human-readable (default)
```

### JSON schema

```json
{
  "version": 1,
  "step": "extract-claims",
  "status": "pass",
  "summary": "12 baseline(s), 11 clean, 1 changed",
  "counts": {
    "total": 12,
    "passed": 11,
    "changed": 1,
    "schemaViolations": 0,
    "failed": 0,
    "commandsChanged": 0,
    "diffUnavailable": 0
  },
  "baselines": [
    {
      "filename": "invoice-a1b2c3d4.json",
      "status": "value_changed",
      "comparable": true,
      "schemaViolations": [],
      "outputDiff": { "equal": false, "entries": [...] },
      "commandsDiff": null
    }
  ]
}
```

### JSON `status` field

| Value     | Meaning                                                         |
| --------- | --------------------------------------------------------------- |
| `"pass"`  | All baselines clean                                             |
| `"fail"`  | Regressions detected (value changes, schema violations)         |
| `"error"` | Infrastructure failure (corrupted baselines, execution crashes) |

## Exit codes

| Code | Meaning                                                   |
| ---- | --------------------------------------------------------- |
| 0    | Clean — no regressions                                    |
| 1    | Regressions detected                                      |
| 2    | Infrastructure failure (config error, corrupted baseline) |

## GitHub Actions

### Basic: fail on regressions

```yaml
name: Verist regression check
on: [push, pull_request]

jobs:
  verist-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v2
      - run: bun install
      - run: bunx verist test --step extract-claims
```

### PR comment with markdown summary

```yaml
name: Verist diff report
on: pull_request

jobs:
  verist-diff:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: oven-sh/setup-bun@v2
      - run: bun install

      - name: Run verist test
        continue-on-error: true
        run: bunx verist test --step extract-claims --format markdown > verist-report.md

      - name: Comment on PR
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          path: verist-report.md
```

### JSON output for custom logic

```yaml
- name: Run verist test (JSON)
  continue-on-error: true
  run: bunx verist test --step extract-claims --format json > verist-result.json

- name: Parse results
  id: verist
  run: |
    status=$(jq -r '.status' verist-result.json)
    changed=$(jq '.counts.changed' verist-result.json)
    echo "status=$status" >> "$GITHUB_OUTPUT"
    echo "changed=$changed" >> "$GITHUB_OUTPUT"

- name: Check results
  if: steps.verist.outputs.status == 'fail'
  run: |
    echo "Verist detected $ changed baselines"
    exit 1
```

## Sampling for large baseline sets

For steps with many inputs, use `--sample` during capture for faster CI runs:

```bash
verist capture --step extract-claims --input "inputs/*.json" --sample 50 --seed 42
verist test --step extract-claims
```

The `--seed` option ensures deterministic selection across runs. Omit `--seed` for a default seed of 0.

## Filtering baselines

### By label

Tag baselines with `--label` during capture and filter in test or diff:

```bash
verist capture --step extract-claims --input "inputs/*.json" --label "v2-prompt"
verist test --step extract-claims --label "v2-prompt"
verist diff --step extract-claims --label "v2-prompt"
```

### By metadata

Tag baselines with `--meta` during capture and filter during test:

```bash
verist capture --step extract-claims --input "inputs/*.json" --meta model=gpt-4o
verist test --step extract-claims --meta model=gpt-4o --format json
```

---

---
url: /concepts/overview.md
---
# Core Concepts

This page explains the mental model to carry into the rest of the docs.

::: tip
Prefer learning by code? Start with [Your First Step](../guides/first-step) and come back later.
:::

## Verist in one sentence

Verist is a deterministic workflow kernel for AI systems: replay, recompute, and diffs for decisions.

## The big idea

Most AI workflows are hard to trust because you can't reproduce decisions or understand what changed after a model upgrade. Verist fixes that by being strict about:

| Discipline                 | Why it matters                                   |
| -------------------------- | ------------------------------------------------ |
| **Explicit inputs**        | No hidden dependencies                           |
| **Explicit artifacts**     | Store what the model saw and returned            |
| **Explicit state changes** | Steps return outputs, nothing else mutates state |

That discipline makes replay and diff possible.

## Core objects

### Step

A deterministic function that takes input + context and returns:

| Field        | Description                        |
| ------------ | ---------------------------------- |
| **output**   | Partial state update               |
| **events**   | Audit records                      |
| **commands** | What should happen next (optional) |

A step is the only place where "work" happens.

**What a step is NOT:** task runner, scheduler, database writer, internal retry mechanism.

### Workflow

A named set of steps plus a version. It doesn't run on its own – your runner does.

**What a workflow is NOT:** orchestrator, self-executing runtime, required for single-step usage.

### State

State lives in your database. Not in memory. Not in a hidden framework cache.

| Layer         | Source                        |
| ------------- | ----------------------------- |
| **computed**  | Derived from step outputs     |
| **overlay**   | Human overrides               |
| **effective** | `{ ...computed, ...overlay }` |

Overlay always wins so human decisions survive recomputation.

### Artifacts

Captured inputs/outputs from non-deterministic calls (LLMs, external APIs, file reads). Stored and hashed.

Artifacts make replay and recompute possible. They are not logs – they are required for exact replay.

### Replay vs recompute

|         | Replay           | Recompute            |
| ------- | ---------------- | -------------------- |
| Uses    | Stored artifacts | Fresh adapters       |
| Answers | "What happened?" | "What would change?" |

## Common mistakes

* Confusing replay with recompute
* Expecting workflows to "run themselves"
* Treating artifacts as logs instead of inputs/outputs

## Why the kernel is strict

Verist refuses to be "helpful" in ways that hide state or control flow. That's the only way to make AI decisions reviewable later.

If you want speed and autonomy, an agent framework might fit better. If you want correctness under scrutiny, Verist is built for you.

---

---
url: /guides/errors.md
---
# Error Handling

Verist treats errors as values. A step returns `Result<T, StepError>` and **never throws** for expected failures.

## Error types

| Code                | Cause                        |
| ------------------- | ---------------------------- |
| `input_validation`  | Input does not match schema  |
| `execution_failed`  | Adapter or step logic threw  |
| `output_validation` | Output does not match schema |

All three return a `Result` with a `StepError` value.

## Basic handling

```ts
const result = await run(step, input, ctx);

if (!result.ok) {
  switch (result.error.code) {
    case "input_validation":
    case "output_validation":
      // bug in caller or step; fix and retry safely
      break;
    case "execution_failed":
      // external dependency or transient failure
      break;
  }
}
```

## Response patterns

| Error type             | Response                                               |
| ---------------------- | ------------------------------------------------------ |
| Validation errors      | Treat as bugs or incompatible inputs; fix and rerun    |
| Execution errors       | Classify as transient vs permanent; retry only if safe |
| Partial batch failures | Keep per-item `Result` and re-run only failures        |

## Suspend vs error

If the run needs **human input** or **external confirmation**, prefer a `suspend` command instead of throwing.

* Errors are for failures
* Suspends are for waiting

## Errors in recompute

Recompute can surface errors that never happened in the original run:

* A new adapter behaves differently
* A prompt changed in a way that broke assumptions
* A dependency became unavailable

Treat error diffs as high-priority regressions in review.

::: info
Output validation in `recompute()` is **observational** – it populates `schemaViolations` instead of returning an error. This lets you see schema issues alongside value diffs rather than short-circuiting. Input validation remains strict (`err(input_validation)`).
:::

## Audit

* Persist `StepError` details alongside the run
* Record retry attempts as audit events
* Keep error metadata stable so diffs are reviewable

---

---
url: /faq.md
---
# FAQ

## Is Verist a framework or a library?

A library. It doesn't run your system. It gives you strict primitives (steps, artifacts, replay) that you wire into your own runner.

## Can I use Verist with an agent framework?

Yes. Verist sits underneath agent frameworks to provide replay and diffs – the trust layer.

## Why explicit state?

Replay only works when you can reconstruct exact inputs and state for a run. Hidden state makes decisions impossible to reproduce.

## Do I need a database?

For production, yes. Verist assumes your database is the source of truth. For dev and tests, use `createMemoryStore()` from `@verist/storage` — no database required.

## Where do I start?

[Your First Step](./guides/first-step) – a single file that gives you replay + diff without a full workflow. For a learning path, see [Mental Map](./guides/mental-map).

## Replay vs recompute?

|         | Replay           | Recompute              |
| ------- | ---------------- | ---------------------- |
| Uses    | Stored artifacts | Fresh adapters         |
| Output  | Byte-identical   | May differ             |
| Purpose | Audit, reproduce | Test changes, see diff |

### When to use which

| Scenario                  | Use               |
| ------------------------- | ----------------- |
| Incident investigation    | Replay            |
| Bug reproduction          | Replay            |
| Model upgrade             | Recompute         |
| Prompt change             | Recompute         |
| What-if analysis          | Recompute         |
| Backfill on historic data | Batch + recompute |

## Will recompute overwrite human changes?

No. Human overrides live in the overlay state and always take precedence.

## Where do artifacts live?

You decide. Most teams store artifacts in a blob store with references in the database.

## Can I use Verist without LLMs?

Yes. Verist works for any deterministic step. LLMs are the most common use case.

---

---
url: /getting-started.md
---
# Getting Started

::: tip Zero-config quickstart
Run `verist init` to scaffold a working project with no API keys needed.
:::

This guide covers both CLI and programmatic usage. For how the pieces fit together in production, see [Architecture Overview](guides/architecture).

## Install

::: code-group

```bash [bun]
bun add verist @verist/cli zod
```

```bash [npm]
npm install verist @verist/cli zod
```

:::

## CLI Quickstart

The fastest way to see Verist in action — no API keys required:

```bash
npx verist init
npx verist capture --step parse-contact --input "verist/inputs/*.json"
npx verist test --step parse-contact
```

This scaffolds a `parse-contact` step that extracts name, email, and phone via regex. Edit the step logic, re-run `verist test`, and see the diff.

## Programmatic API

### Core concepts

Each step returns:

* **output** – partial state update
* **events** – audit records (append-only)
* **commands** – declarative next steps (optional)

```text
(input, context) → { output, events?, commands? }
```

### Define and run a step

```ts
import { z } from "zod";
import { defineStep, run } from "verist";

const verifyDocument = defineStep({
  name: "verify-document",
  input: z.object({ docId: z.string(), text: z.string() }),
  output: z.object({
    verdict: z.enum(["accept", "reject"]),
    confidence: z.number(),
  }),
  run: async (input, ctx) => {
    const verdict = await ctx.adapters.llm.verify(input.text);
    return {
      output: { verdict, confidence: 0.84 },
      events: [{ type: "document_verified", payload: { docId: input.docId } }],
    };
  },
});

const result = await run(
  verifyDocument,
  { docId: "doc-1", text: "Invoice #1042 for ACME Corp." },
  {
    adapters: {
      llm: {
        verify: async (text) => (text.includes("fraud") ? "reject" : "accept"),
      },
    },
  },
);

if (result.ok) {
  console.log(result.value.output);
  // { verdict: "accept", confidence: 0.84 }
}
```

## Add explicit identity

For stable IDs and version tracking, pass explicit identity to `run()`:

```ts
import { defineWorkflow, run } from "verist";

const workflow = defineWorkflow({
  name: "verify-document",
  version: "1.0.0",
  steps: { verifyDocument },
});

const result = await run(
  workflow.getStep("verifyDocument"),
  { docId: "doc-1", text: "..." },
  {
    adapters,
    workflowId: workflow.name,
    workflowVersion: workflow.version,
    runId: "run-1",
  },
);
```

## LLM Providers

Verist supports OpenAI and Anthropic via `@verist/llm`. Any OpenAI-compatible API (Ollama, Azure OpenAI, etc.) works via the `baseURL` option.

```ts
import OpenAI from "openai";
import Anthropic from "@anthropic-ai/sdk";
import { createOpenAI, createAnthropic } from "@verist/llm";

// OpenAI (or any compatible API via baseURL)
const openai = createOpenAI({
  client: new OpenAI({ apiKey: process.env.OPENAI_API_KEY }),
});

// Anthropic
const anthropic = createAnthropic({
  client: new Anthropic({ apiKey: process.env.ANTHROPIC_API_KEY }),
});

// Ollama, Azure, etc.
const ollama = createOpenAI({
  client: new OpenAI({
    baseURL: "http://localhost:11434/v1",
    apiKey: "ollama",
  }),
});
```

## Runner wiring

Verist does not ship an orchestrator. A minimal runner typically does:

1. Load state
2. Execute `run()` with explicit identity
3. Commit output + events atomically
4. Interpret commands (enqueue, fan-out, review, emit)
5. Capture artifacts if you need replay/recompute

See [Reference Runner](./guides/reference-runner) for a concrete loop.

## Kernel guarantees

Verist guarantees:

* Steps are deterministic given input + adapters
* State lives in your database
* Commands are data (never executed by the kernel)
* Overlay wins over computed for human overrides
* Errors are values (`Result`), not exceptions

## Commands

Commands are plain objects. Use helpers for common patterns:

```ts
import { invoke, fanout, review, emit } from "verist";

return {
  output: { score, verdict },
  events,
  commands: [
    invoke("verify", { id }), // call another step
    fanout("score", inputs), // parallel execution
    review("low confidence", data), // human review
    emit("doc.verified", { id }), // external event
  ],
};
```

## Decision checklist

| If you...               | Then skip...     |
| ----------------------- | ---------------- |
| Don't store state yet   | Storage adapters |
| Don't branch or fan out | Commands         |
| Don't pause runs        | Suspend/resume   |
| Don't need stable IDs   | Workflows        |

---

---
url: /glossary.md
---
# Glossary

| Term                 | Definition                                                                                                 |
| -------------------- | ---------------------------------------------------------------------------------------------------------- |
| **Artifact**         | Captured input/output from a non-deterministic call (LLM, API, file read). Stored and hashed for replay.   |
| **Command**          | Declarative instruction returned by a step. Interpreted by your runner, not by the kernel.                 |
| **Computed**         | State derived from step outputs. Updated via shallow merge on each commit.                                 |
| **Diff**             | Comparison between original output and recomputed output.                                                  |
| **Effective**        | Read-only merge of computed + overlay: `{ ...computed, ...overlay }`. What the system acts on.             |
| **Event**            | Audit record emitted by a step. Append-only.                                                               |
| **Output**           | In-memory result of a step: `{ output, events, commands? }`. Exists during a run; artifacts persist after. |
| **Overlay**          | Human overrides applied on top of computed state. Overlay always wins.                                     |
| **Recompute**        | Re-run a step with new adapters and compare the diff.                                                      |
| **Recompute Status** | Classification of a recompute result: `clean`, `value_changed`, or `schema_violation`.                     |
| **Replay**           | Reproduce a past run exactly using stored artifacts.                                                       |
| **Schema Violation** | A schema mismatch detected during recompute output validation. Observational – does not gate the result.   |
| **Snapshot**         | Stored bundle of artifacts, metadata, and step output used for replay/recompute.                           |
| **Step**             | A deterministic function that returns `{ output, events, commands? }`.                                     |
| **Workflow**         | Named set of steps plus a version; provides stable identity and type-safe wiring.                          |

## Data flow

```text
run → output (memory) → artifacts (stored) → snapshot (bundle)
```

---

---
url: /guides/overrides.md
---
# Human Overrides

Verist assumes humans can overrule AI output. Those decisions are not a side effect – they are state.

## Why this exists

You upgrade a model. Three past decisions change. A reviewer fixes one decision manually. You recompute again a week later – the human correction still wins.

```text
model v1 → computed = 0.42
human fix → overlay  = 0.90
model v2 → computed = 0.38
effective = 0.90  ← human override preserved
```

## The three-layer state model

| Layer         | Source                        |
| ------------- | ----------------------------- |
| **computed**  | What steps produce            |
| **overlay**   | Human corrections             |
| **effective** | `{ ...computed, ...overlay }` |

```ts
import { effectiveState } from "@verist/storage";

const effective = effectiveState(snapshot);
// equivalent to { ...snapshot.computed, ...snapshot.overlay }
```

Overlay always wins. Human corrections survive recompute.

```text
step → computed
review UI → overlay
recompute → computed updated, overlay unchanged
```

::: warning
Steps must never write to the overlay.
:::

## Why this matters

Without an overlay layer, you choose between:

* Keeping human edits, but losing recompute
* Recomputing, but losing human edits

Verist avoids that tradeoff.

## How to apply an override

Overrides are an external concern. Your UI or review tool writes them to the overlay layer in storage.

Typical flow:

1. Run a step and store the computed output
2. Show the diff to a reviewer
3. If they override, write to overlay

## Design tips

* Use **small, targeted overrides** (single fields) instead of large patches
* Record a **reason** alongside the override
* Treat overrides as **audit events**

---

---
url: /notes/plan.local.md
---
# Identity-Aware Array Diff

Date: 2026-02-09

## Problem

`diff()` compares arrays by index. LLMs don't guarantee stable ordering.
When `recompute` re-runs an extraction step that returns entities in a
different order, every index reports as changed – even though nothing
meaningful changed. This undermines the core value prop: showing humans
what *actually* changed.

See `../verist-ops/docs/issues/index-based-array-diff.md` for the full
write-up and examples.

## Analysis of Approaches

The issue doc proposes four approaches (A–D). Here's why none are optimal
as stated, and what the right answer is.

**A. `keyBy` on `diff()`** – Pollutes the general-purpose structural diff
with domain-specific concerns. `diff()` should stay a clean JSON comparator.

**B. Normalize (sort) in user code** – Zero framework changes, but burden
on every step author, easy to forget, and *still wrong for
insertions/deletions* (sorted index-based diff shows shifted indices, not
the actual add/remove).

**C. `normalizeForDiff` hook** – Declared once per step, but only handles
ordering. A step that adds or removes entities still produces misleading
diffs. Also, an imperative normalization function is more boilerplate than
necessary.

**D. Identity-aware diff in the kernel** – Correct, but the issue assumes
this requires a new diff algorithm with heuristics. It doesn't.

## Solution: `keyBy` + array-to-map normalization

The key insight: **converting a keyed array to a map before diffing gives
identity-aware comparison for free** – the existing `diff()` already
handles objects correctly (additions, removals, per-field changes).

```typescript
const extract = defineStep({
  name: "extract-entities",
  input: z.object({ documentId: z.string() }),
  output: z.object({
    entities: z.array(EntitySchema),
  }),
  keyBy: {
    entities: "id",
  },
  async run(input, ctx) { /* ... */ },
});
```

When `recompute` compares outputs, it normalizes keyed arrays to maps
before diffing. Normalization is transient – stored artifacts always
contain the original arrays.

```text
// Before normalization (arrays, unstable order)
before = { entities: [{id:"e1", text:"John"}, {id:"e2", text:"Acme"}] }
after  = { entities: [{id:"e2", text:"Acme"}, {id:"e1", text:"John"}] }

// After normalization (maps, keyed by identity)
before = { entities: {e1: {id:"e1", text:"John"}, e2: {id:"e2", text:"Acme"}} }
after  = { entities: {e2: {id:"e2", text:"Acme"}, e1: {id:"e1", text:"John"}} }

// diff() on maps → (no changes)
// Object key order doesn't matter – diff iterates sorted keys.
```

### Why this works for all cases

| Scenario       | Index-based (broken)                | Map-based (correct)                  |
| -------------- | ----------------------------------- | ------------------------------------ |
| Reorder only   | Every index shows as changed        | No changes                           |
| Field change   | Correct if same index, noisy if not | `entities.e2.confidence: 0.9 → 0.7` |
| Entity added   | Shifted indices, misleading         | `+ entities.e4: {...}`               |
| Entity removed | Shifted indices, misleading         | `- entities.e1: {...}`               |
| Add + reorder  | Complete noise                      | Clean addition only                  |

### Why this is optimal

1. **`diff()` is unchanged** – stays a general-purpose structural comparator.
   No new algorithm, no options, no complexity.
2. **Correct for all cases** – reordering, additions, removals, field
   changes – all handled by existing object diffing.
3. **Minimal API** – one field (`keyBy`) on step definition. Declarative.
4. **Explicit** – step author declares identity, no magic heuristics.
5. **Readable paths** – `entities.e2.confidence` instead of
   `entities[1].confidence`.
6. **Zero-cost when unused** – no `keyBy` declared → no normalization,
   behavior identical to today.

### Key invariants

**`keyBy` affects only diff normalization.** It has no effect on
execution, replay, validation, or storage.

**Stored data stays raw.** Normalization is transient, applied only at
comparison time inside `recompute` and `compareSnapshots`. Snapshots and
artifacts always store the original array order. This preserves:

* Exact reproduction via replay (byte-identical output)
* Round-trip fidelity with external systems
* Original insertion order when it matters downstream

**`KeyFn` must be a pure function of the element.** No external state,
no randomness, no mutation. Called once per element.

### `keyBy` type

```typescript
type KeyFn = string | ((item: unknown) => string | number);

interface StepConfig<TInput, TOutput, TAdapters> {
  // ... existing fields ...

  /**
   * Identity keys for array fields in the output.
   * Used by recompute to match array elements by identity instead of
   * by index, producing clean diffs when LLMs return unstable ordering.
   *
   * Has no effect on execution, replay, validation, or storage.
   *
   * Key: dot-path to an array field in the output.
   * Value: field name within each element, or a function returning
   * a unique key.
   */
  keyBy?: Record<string, KeyFn>;
}
```

Examples:

```typescript
keyBy: {
  entities: "id",                                       // simple field
  "results.claims": "claimId",                          // nested path
  lineItems: (item) => `${item.section}:${item.line}`,  // composite key
}
```

### Dot-path resolution rules

Paths resolve through plain objects only. Keep it simple and predictable:

* `"entities"` – top-level field, must be an array
* `"results.claims"` – nested field via object traversal
* No numeric indexing (`"items.0.sub"` is NOT supported)
* No wildcards
* Terminal segment must point to an array; intermediates must be objects

### Error handling

`normalizeForDiff` validates at normalization time (fail-fast, clear
messages near the source):

| Condition                              | Behavior                                                 |
| -------------------------------------- | -------------------------------------------------------- |
| Path doesn't resolve (missing/non-obj) | Skip silently – expected for `Partial<T>` outputs        |
| Path resolves but value is not array   | Skip silently – suspicious but non-fatal; schema         |
|                                        | validation is a separate concern                         |
| Element lacks the key field            | Throw: `keyBy: element at entities[2] has no field "id"` |
| Key extractor returns non-string       | Throw: `keyBy: key must be string or number, got object` |
| Duplicate keys in same array           | Throw: `keyBy: duplicate key "e1" in entities`           |
| Empty `keyBy` / no `keyBy`             | No-op, return value unchanged                            |

Silent skips are necessary because step outputs are `Partial<T>` – a
keyed array field may not be present in every run.

**In `recompute`:** normalization errors must be caught and returned as
structured `RecomputeError` (code: `"normalization_failed"`), not
uncaught throws. The current try/catch in `recompute` only covers step
execution – normalization happens after, so it needs its own handling.

### Integration points

**`recompute()`** (primary) – Before calling `diff()`, normalize both
original and recomputed outputs using the step's `keyBy`. The step is
already a parameter, so `keyBy` is available with no plumbing.

**`compareSnapshots()`** – Add optional `keyBy` parameter. Accept raw
`keyBy` (not the full step) to keep the function focused.

**`normalizeForDiff()` exported** – Public utility for ad-hoc comparisons
in tests or custom workflows. Pure function, no side effects.

### What this does NOT change

* `diff()` signature and behavior – unchanged
* `applyDiff()` – not used in recompute; the normalized diff is for
  human review, not mechanical application to the original array
* `formatDiff()` – works as-is; map paths are strings, rendered as
  `entities.e2.confidence`
* Snapshot format – no changes to stored artifacts

## Implementation

### `normalizeForDiff(value, keyBy)` utility

```typescript
/**
 * Normalize value for identity-aware diffing.
 * Converts arrays at keyed paths to Record<key, element>.
 *
 * Paths that don't resolve to an array are silently skipped
 * (partial outputs may omit keyed fields).
 */
export function normalizeForDiff(
  value: unknown,
  keyBy: Record<string, KeyFn>,
): unknown
```

* Walk the value, resolve each dot-path in `keyBy`
* At each matched array, convert to `Record<string, element>`
* Validate: duplicate keys, missing key fields → throw
* Shallow-copy only the parent object holding the array, not the
  entire value tree (performance, referential transparency)
* No-op when `keyBy` is empty

### Changes to `recompute.ts`

```typescript
const normalize = (v: unknown) =>
  normalizeForDiff(v, step.keyBy ?? {});

// Wrap normalization in try/catch → structured RecomputeError
let normalizedOriginal: unknown;
let normalizedNew: unknown;
try {
  normalizedOriginal = normalize(originalOutput);
  normalizedNew = normalize(outputForDiff);
} catch (cause) {
  return err({
    code: "normalization_failed",
    message: cause instanceof Error ? cause.message : String(cause),
    retryable: false,
    cause,
  });
}

const outputDiff = comparable
  ? diff(normalizedOriginal, normalizedNew)
  : undefined;
```

### Changes to `step.ts`

* Add `keyBy` to `StepConfig` and `Step` interfaces
* Carry through in `defineStep()`

### Tests

* Reordered array with key → `equal: true`
* Field change within matched entity → correct path and values
* Entity added → shows as object addition
* Entity removed → shows as object removal
* Composite key function → correct matching
* Nested dot-path → correct resolution
* Duplicate key → throws with message
* Missing key field on element → throws with message
* Key path absent in partial output → silently skipped
* Path resolves to non-array → silently skipped
* No `keyBy` declared → behavior unchanged (index-based)
* `recompute` with normalization error → structured `RecomputeError`

## Connection to LangExtract

LangExtract returns `extractions: Array<Extraction>` with no guaranteed
ordering. Each extraction has fields like `extraction_class`,
`extraction_text`, `attributes`. A Verist step wrapping LangExtract:

```typescript
keyBy: {
  extractions: (e) => `${e.extraction_class}:${e.extraction_text}`,
}
```

This produces clean diffs when re-running extraction with different models
or prompts – showing exactly which entities changed, were added, or were
removed.

## Scope

Small, focused change:

* 1 new public utility (`normalizeForDiff`)
* 2 type changes (`StepConfig`, `Step`)
* 2 integration points (`recompute`, `compareSnapshots`)
* \~80 lines of implementation + ~150 lines of tests

---

---
url: /integrations.md
---
# Integrations

Verist wraps AI tools – it doesn't replace them.

## How it works

Any external tool – LLM provider, extraction library, classifier – runs inside a `defineStep`. Verist captures artifacts, enables replay, and produces diffs when you recompute with a new model or prompt.

```text
your tool (extraction, classification, generation)
        ↓
  defineStep({ run })   ← Verist wraps the call
        ↓
  artifacts captured    ← input, output, events logged
        ↓
  recompute + diff      ← see what changed after a model upgrade
```

## What Verist adds

| Concern                 | Your tool provides         | Verist adds                        |
| ----------------------- | -------------------------- | ---------------------------------- |
| **Extraction quality**  | Prompts, models, grounding | –                                  |
| **Replay**              | –                          | Reproduce past runs from artifacts |
| **Diff**                | –                          | See exactly what changed           |
| **Audit trail**         | –                          | Structured events per execution    |
| **Human overrides**     | –                          | Overlay layer survives recompute   |
| **Identity-aware diff** | –                          | `keyBy` for stable array matching  |

## Available integrations

* [LangExtract](./langextract.md) – structured extraction with source grounding

More integrations coming soon. Any tool that runs inside a step gets replay and diff automatically.

---

---
url: /integrations/langextract.md
---
# LangExtract

Wrap [LangExtract](https://github.com/google/langextract) extraction calls in a Verist step — get replay, regression diffs, and [human override preservation](../guides/overrides.md).

## What LangExtract does

LangExtract is a Python library (by Google) for structured extraction from unstructured text. It maps extractions to source locations for traceability and uses schema-constrained generation on supported models (Gemini) with few-shot guidance as fallback elsewhere.

## Why pair with Verist

| Concern            | LangExtract               | Verist                           |
| ------------------ | ------------------------- | -------------------------------- |
| Extraction quality | Prompts, source grounding | –                                |
| Reproducibility    | –                         | Replay from stored artifacts     |
| Regression diffs   | –                         | See which entities changed       |
| Stable array diffs | –                         | `keyBy` matches entities by ID   |
| Human corrections  | –                         | Overlay layer survives recompute |

## Example: extraction step

LangExtract runs as a Python service. Your Verist step calls it via an adapter.

```ts
import { z } from "zod";
import { defineStep, run, createSnapshotFromResult } from "verist";

const entitySchema = z.object({
  id: z.string(),
  class: z.string(),
  text: z.string(),
  attributes: z.record(z.string(), z.string()).optional(),
});

const extractEntities = defineStep({
  name: "extract-entities",
  input: z.object({ documentId: z.string(), text: z.string() }),
  output: z.object({ entities: z.array(entitySchema) }),

  // Match entities by id instead of array index during diff.
  // Prevents noisy diffs when the LLM returns entities in different order.
  keyBy: { entities: "id" },

  run: async (input, ctx) => {
    const entities = await ctx.adapters.extractor.extract(input.text);
    return {
      output: { entities },
      events: [
        { type: "entities_extracted", payload: { count: entities.length } },
      ],
    };
  },
});
```

Wire the adapter when running the step:

```ts
const result = await run(
  extractEntities,
  { documentId: "doc-42", text: clinicalNote },
  {
    adapters: {
      extractor: {
        extract: async (text) => {
          const res = await fetch("http://localhost:8000/extract", {
            method: "POST",
            headers: { "Content-Type": "application/json" },
            body: JSON.stringify({ text }),
          });
          if (!res.ok) throw new Error(`Extraction failed: ${res.status}`);
          const body = await res.json();
          return body.entities; // LangExtract response → entities array
        },
      },
    },
  },
);

if (result.ok) {
  const snapshot = await createSnapshotFromResult(result.value);
  await db.snapshots.insert(snapshot);
}
```

This example captures step output as a snapshot for diffing. For byte-accurate replay of the extraction call itself, emit the raw request/response as artifacts via [`onArtifact`](../guides/replay-and-diff.md#artifact-capture-in-runners).

## Recompute + diff

After upgrading your extraction model or prompt, recompute to see what changed:

```ts
import { recompute, formatDiff } from "verist";

const recomputeResult = await recompute(snapshot, extractEntities, {
  adapters: { extractor: newExtractor }, // [!code highlight]
});

if (recomputeResult.ok) {
  const { status, outputDiff } = recomputeResult.value;
  console.log("Status:", status);
  // "clean" | "value_changed" | "schema_violation"

  if (outputDiff && !outputDiff.equal) {
    console.log(formatDiff(outputDiff));
  }
}
```

::: tip Production tips

* **Pin the model version** (e.g., `gemini-2.5-flash`) — diffs are meaningless if the baseline model drifts.
* **Use temperature 0** for extraction — non-determinism adds noise to diffs.
* **Ensure stable IDs** on extracted entities — `keyBy` needs them to match elements across runs.

:::

## `keyBy` for extraction results

LLMs often return array elements in unstable order. Without `keyBy`, recompute reports every entity as changed whenever the order shifts.

```ts
keyBy: { entities: "id" },
```

Verist normalizes arrays into maps keyed by `id` before diffing. Only actual content changes appear in the diff.

For composite keys (e.g., class + text), use a function:

```ts
keyBy: {
  entities: (item) => {
    const e = item as { class: string; text: string };
    return `${e.class}::${e.text}`;
  },
},
```

::: warning
Keys must be unique and present on every element. Duplicate or missing keys cause recompute to fail with `normalization_failed`.
:::

## See also

* [Your First Step](../guides/first-step.md) — minimal setup
* [Replay and Diff](../guides/replay-and-diff.md) — the full recompute flow
* [Human Overrides](../guides/overrides.md) — preserving corrections across recomputes

---

---
url: /guides/mental-map.md
---
# Mental Map

Verist can feel bigger than it is. This page shows what you **must** know first, what is **optional**, and when to come back for the rest.

::: tip See the full picture
For a system-level view of how all the pieces connect, see [Architecture Overview](./architecture).
:::

## Adoption levels

### Level 1 – One step, immediate value

**Goal:** Replay + diff without workflow ceremony.

What to learn:

* `defineStep` + `run()`
* Audit events
* Capture a snapshot and recompute diff

::: tip
Most users can stop here and still get 80% of the value.
:::

**Read:** [Your First Step](./first-step), [Replay and Diff](./replay-and-diff)

### Level 2 – Stable identity

**Goal:** Make decisions reproducible across deploys.

What to learn:

* Explicit `workflowId` / `workflowVersion`
* Storing snapshots
* Human review of diffs

**Read:** [Workflows](./workflows), [Human Overrides](./overrides)

### Level 3 – Multi-step workflows

**Goal:** Compose steps safely.

What to learn:

* `defineWorkflow`
* Typed `invoke` / `fanout`
* Commands as control flow

**Read:** [Workflows](./workflows), [Storage and State](./storage)

### Level 4 – Persistence and human review

**Goal:** Production correctness under review.

What to learn:

* Storage adapters
* Computed vs overlay state
* Overrides and audit trails
* Runner wiring

**Read:** [Storage and State](./storage), [Human Overrides](./overrides), [Reference Runner](./reference-runner)

### Level 5 – Advanced

**Goal:** Scale, pause, and batch safely.

What to learn:

* Suspend / resume
* Batch recompute
* Pipelines

**Read:** [Suspend and Resume](./suspend-resume), [Batch Runs](./batch), [Pipelines](./pipeline)

## What you can ignore (for now)

* **Specs and ADRs** – useful for guarantees and deep details, not required to use Verist
* **Pipelines** – only needed for strict linear orchestration
* **Suspend / resume** – only for long-running or human-in-the-loop steps

## Decision checklist

| If you...                          | Stay at... |
| ---------------------------------- | ---------- |
| Only need replay + diff            | Level 1    |
| Need stable IDs across deployments | Level 2    |
| Have multiple steps                | Level 3    |
| Need human review or overrides     | Level 4    |
| Need pause/batch/linear pipelines  | Level 5    |

---

---
url: /guides/pipeline.md
---
# Pipelines

A pipeline runs steps in sequence with explicit boundaries.

Use it when you want composition without building a full orchestrator.

## What pipelines are good for

* Clean, linear flow: `extract → verify → score`
* Explicit checkpoints between stages
* Clear audit trail between stages

## What pipelines are not

* A graph engine
* A scheduler
* A replacement for your queue

## Mental model

A pipeline is a list of stages. Each stage takes the current state and produces an output. The runner applies each output and moves on.

## When to use what

| Use case         | Use       |
| ---------------- | --------- |
| Linear flow      | Pipeline  |
| Branching        | Commands  |
| Audit + identity | Workflow  |
| One-off          | Bare step |

If your flow is not linear, see [Workflows](./workflows) first.

## Design tips

* Keep stages small and deterministic
* Emit audit events per stage
* Capture artifacts at stage boundaries

---

---
url: /guides/reference-runner.md
---
# Reference Runner

Verist does not ship an orchestrator. This is a minimal runner loop to show where the kernel fits.

::: info
This is **not** production-grade. It is a reference for wiring.
:::

## What the runner does

1. Load state
2. Run a step
3. Commit output + events
4. Enqueue commands
5. Capture artifacts

## Minimal loop

```ts
import { run, createSnapshotFromResult } from "verist";

for (;;) {
  const job = await queue.take();
  if (!job) continue;

  const { step, input, identity } = job;
  const extraArtifacts = [];

  const result = await run(step, input, {
    adapters,
    workflowId: identity.workflowId,
    workflowVersion: identity.workflowVersion,
    runId: identity.runId,
    onArtifact: (artifact) => {
      if (
        artifact.kind !== "step-output" &&
        artifact.kind !== "step-commands"
      ) {
        extraArtifacts.push(artifact);
      }
    },
  });

  if (!result.ok) {
    await store.recordFailure(result.error);
    continue;
  }

  await store.commit({
    workflowId: result.value.workflowId,
    runId: result.value.runId,
    stepId: result.value.stepName,
    expectedVersion: await store.currentVersion(result.value.runId),
    output: result.value.output,
    events: result.value.events,
  });

  for (const cmd of result.value.commands ?? []) {
    await queue.enqueue(cmd);
  }

  const snapshot = await createSnapshotFromResult(result.value, {
    artifacts: extraArtifacts,
  });

  await snapshotStore.save(snapshot);
}
```

## Notes

* Capture artifacts only if you need replay/recompute
* Commands are auto-captured; use `captureCommands: false` to suppress
* Keep state in your DB; the kernel is stateless by design

See [Anti-Patterns](./anti-patterns) for common mistakes to avoid when building a runner.

---

---
url: /guides/replay-and-diff.md
---
# Replay and Diff

This is the heart of Verist. If you only use one feature, use this.

| Capability    | Description                                          |
| ------------- | ---------------------------------------------------- |
| **Replay**    | Re-run a past decision and get byte-identical output |
| **Recompute** | Run the same step with a new model or prompt         |
| **Diff**      | See exactly what changed before you ship             |

## The flow

```text
run step → capture artifacts → store snapshot → later: recompute and diff
```

## 1. Run a step

```ts
import { defineStep, run } from "verist";
import { z } from "zod";

const verifyDocument = defineStep({
  name: "verify-document",
  input: z.object({ docId: z.string(), text: z.string() }),
  output: z.object({
    verdict: z.enum(["accept", "reject"]),
    confidence: z.number(),
  }),
  run: async (input, ctx) => {
    const verdict = await ctx.adapters.llm.verify(input.text);
    return {
      output: { verdict, confidence: 0.84 },
      events: [{ type: "document_verified", payload: { docId: input.docId } }],
    };
  },
});

const result = await run(
  verifyDocument,
  { docId: "doc-1", text: "Hello" },
  {
    adapters: { llm: yourLlmAdapter },
    workflowId: "verify-document",
    workflowVersion: "1.0.0",
    runId: "run-1",
  },
);
```

## 2. Capture artifacts and store snapshot

```ts
import { createSnapshotFromResult } from "verist";

if (!result.ok) throw new Error(result.error.message);

const snapshot = await createSnapshotFromResult(result.value);

await db.snapshots.insert(snapshot);
```

### Artifact capture in runners

Artifacts are emitted via `onArtifact` during execution. Store them, then attach to the snapshot:

```ts
import type { Artifact } from "verist";

const extraArtifacts: Artifact[] = [];

const result = await run(verifyDocument, input, {
  adapters,
  onArtifact: (artifact) => {
    if (artifact.kind !== "step-output" && artifact.kind !== "step-commands") {
      extraArtifacts.push(artifact);
    }
  },
});

if (result.ok) {
  const snapshot = await createSnapshotFromResult(result.value, {
    artifacts: extraArtifacts,
  });
  await snapshotStore.save(snapshot);
}
```

### Storage options

| Setup               | Use case                                                 |
| ------------------- | -------------------------------------------------------- |
| In-memory           | Local dev, quick iteration                               |
| Blob store (S3/GCS) | Store artifact content, keep references in DB            |
| Database            | Snapshot metadata in DB, large payloads in content store |

## 3. Recompute and diff later

```ts
import { recompute, formatDiff } from "verist";

const recomputeResult = await recompute(snapshot, verifyDocument, {
  adapters: { llm: newModelAdapter }, // [!code highlight]
});

if (recomputeResult.ok) {
  const { status, outputDiff, commandsDiff, schemaViolations } =
    recomputeResult.value;

  if (status === "schema_violation")
    console.log("Violations:", schemaViolations);
  if (outputDiff && !outputDiff.equal) console.log(formatDiff(outputDiff));
  if (commandsDiff && !commandsDiff.equal)
    console.log(formatDiff(commandsDiff));
}
```

## What gets diffed

| What changed            | How it shows up                             |
| ----------------------- | ------------------------------------------- |
| Output values           | `outputDiff` from `recompute()`             |
| Control flow            | `commandsDiff` (auto-captured when present) |
| Schema violations       | `schemaViolations` (requires `validate`)    |
| Inputs across snapshots | `inputDiff` from `compareSnapshots()`       |

The `status` field classifies each result: `"clean"`, `"value_changed"`, or `"schema_violation"` (highest severity wins). Command changes are orthogonal and tracked separately.

::: info
Events are audit logs and are **not** diffed. If original output is missing or hash-only, `comparable` is `false` and `outputDiff` is `undefined`.
:::

## Replay vs recompute

| Situation                   | Use               |
| --------------------------- | ----------------- |
| Audit / incident review     | Replay            |
| Debugging a past decision   | Replay            |
| Model or prompt upgrade     | Recompute         |
| New adapter or feature flag | Recompute         |
| Backfill on historic data   | Batch + recompute |

## What to capture

Capture anything that can change across runs:

* LLM input and output
* External API responses
* File contents
* Feature flags or config that affects behavior

**Rule:** If it can change, it should be an artifact.

## Stable diffs with `keyBy`

LLMs often return array elements in unstable order. Without identity keys, recompute reports every element as changed whenever the order shifts.

Use `keyBy` on `defineStep` to match array elements by identity instead of index:

```ts
const extractEntities = defineStep({
  name: "extract-entities",
  input: z.object({ text: z.string() }),
  output: z.object({ entities: z.array(entitySchema) }),
  keyBy: { entities: "id" }, // [!code highlight]
  run: async (input, ctx) => {
    /* ... */
  },
});
```

Verist normalizes keyed arrays into maps before diffing, so only actual content changes appear in the diff. Keys must be unique and present on every element.

For composite keys, use a function:

```ts
keyBy: {
  entities: (item) => {
    const e = item as { class: string; text: string };
    return `${e.class}::${e.text}`;
  },
},
```

## Common mistakes

| Mistake                        | Consequence                  |
| ------------------------------ | ---------------------------- |
| Not capturing artifacts        | Replay won't be exact        |
| Not storing snapshots          | Recompute becomes impossible |
| Mixing side effects into steps | Non-deterministic outputs    |

---

---
url: /specs/commands.md
---
# SPEC: Commands

Commands are declarative data returned by steps to express "what should happen next." The kernel does not execute commands – external runners interpret them.

Commands are **intent, not action**. Steps return commands; runners decide how to execute them.

## Command Types

| Type      | Purpose                   | Runner Obligation                  |
| --------- | ------------------------- | ---------------------------------- |
| `invoke`  | Schedule another step     | Enqueue or execute the target step |
| `fanout`  | Schedule step N times     | Enqueue or execute for each input  |
| `review`  | Request human approval    | Create gate, block all commands    |
| `emit`    | External integration      | Publish to topic/queue             |
| `suspend` | Await external input/data | Persist suspension, block commands |

Commands are kernel-defined: `invoke | fanout | review | emit | suspend`. For custom integrations, use `emit` with a topic (e.g., `emit("slack:alerts", payload)`) or `invoke` a dedicated integration step.

## Runner Contract

### 1. Commands MUST Be Handled

Runners MUST either:

* Execute the command, or
* Persist it for later execution

Silently ignoring commands violates the contract. Commands SHOULD be persisted atomically with `output + events` to prevent "committed state but lost command" scenarios.

### 2. Blocking Commands: Review and Suspend

`review` and `suspend` are **blocking commands** – they halt execution of sibling commands.

**At most one blocking command**: A step result MUST NOT contain multiple blocking commands. Runners MUST fail the step execution if they find two or more `suspend`, two or more `review`, or any combination of both. This is an orchestration error.

**Review** (human approval):

* Step output/events are committed as **provisional state**
* Sibling commands are **deferred** until review resolves
* Run enters "pending review" state
* Resolution: approve (execute deferred), reject (discard deferred), override (apply correction, then continue)

**Suspend** (await external data):

* The step's output and events are committed
* A suspension record is created with the checkpoint
* All sibling commands are **discarded** (not deferred)
* The run enters "suspended" state

Resume flow: external trigger calls resume, runner invokes `resumeStep` with `{ checkpoint, resumeData }`, resumed step emits new commands.

The distinction: `review` awaits human judgment on computed results (sibling commands remain valid); `suspend` awaits external data that may change execution context (sibling commands may no longer be valid).

### 3. Command Order Is Advisory

For commands without a `review` barrier, array order is advisory. Runners MAY:

* Execute in parallel (multiple `invoke` to independent steps)
* Batch `fanout` items
* Reorder for efficiency

Runners MUST NOT reorder in ways that violate data dependencies.

### 4. Idempotency

Runners SHOULD deduplicate command execution. A recommended approach:

```text
commandKey = hash(workflowId, runId, stepName, canonicalized(command))
```

Use stable JSON serialization (sorted keys) for the command payload. Execute each `commandKey` at most once.

`suspend` commands are idempotent at the runner level: repeated execution MUST NOT create multiple open suspension records for the same `(workflowId, runId, stepName)`. This is typically enforced via database constraint (see SPEC-suspend).

## Command Details

### invoke

```typescript
{ type: "invoke", step: string, input: unknown }
```

Request another step to run. The `step` field is a step name. Runners resolve names to implementations.

### fanout

```typescript
{ type: "fanout", step: string, inputs: unknown[] }
```

Equivalent to multiple `invoke` commands:

```typescript
// fanout("process", [a, b, c]) is equivalent to:
commands: [invoke("process", a), invoke("process", b), invoke("process", c)];
```

Runners MAY parallelize. Results are typically aggregated by a subsequent step that queries state.

### review

```typescript
{ type: "review", reason: string, payload?: unknown }
```

Request human review. The `reason` explains why. Optional `payload` provides context.

Runners MUST track review state.

### emit

```typescript
{ type: "emit", topic: string, payload: unknown }
```

Publish to an external system. Unlike audit events (internal log), emit is for integration: webhooks, message buses, notifications. Emit commands are not replayed during recompute – they represent one-time side effects.

Use topic namespacing for routing (e.g., `doc.verified`, `slack:alerts`).

### suspend

```typescript
{ type: "suspend", reason: string, checkpoint: unknown, resumeStep?: string }
```

Pause workflow execution until external data arrives. The `checkpoint` captures serialized state for resume (MUST be JSON-serializable). `resumeStep` specifies which step handles the resume (defaults to the suspending step, but a dedicated resume handler is recommended).

```typescript
// Suspend with explicit resume handler (recommended)
commands: [
  suspend({
    reason: "awaiting_documentation",
    checkpoint: { claimId, requestedDocType: "financial" },
    resumeStep: "handleDocumentation",
  }),
];

// Suspend for webhook callback
commands: [
  suspend({
    reason: "awaiting_callback",
    checkpoint: { webhookId },
    resumeStep: "handleWebhookResponse",
  }),
];
```

See SPEC-suspend for full resume semantics and runner contract.

## Anti-Patterns

* Executing commands inside steps (side effects in step code)
* Assuming command array order implies execution order

---

---
url: /specs/kernel-invariants.md
---
# SPEC: Kernel Invariants

Core guarantees that Verist maintains at all times. Code that violates these invariants is incorrect.

These invariants are part of the **Tier 1 (Kernel)** stability guarantee (see ADR-005).

## 1. Steps Are Pure

Given identical inputs and artifact playback, a step produces identical outputs. Side effects happen through adapters, not directly in step code.

## 2. State Lives in Database

The database is the source of truth. In-memory state is ephemeral. Queue jobs are pointers, not payloads.

## 3. Commands Are Data

Steps return commands as plain objects describing intent. The kernel does not execute commands – runners interpret them.

## 4. Outputs Are Partial

Steps return partial state updates (changed fields only). Runners merge outputs into persisted state.

## 5. Events Are Immutable

Audit events are append-only. They are never modified or deleted.

## 6. Replay Is Exact

With captured artifacts, replay produces byte-identical outputs. All nondeterminism must be artifacted.

## 7. Overlay Wins

Human corrections (overlay) take precedence over computed values when deriving effective state.

## 8. Hashes Are Mandatory

Every LLM interaction records input and output hashes to enable audit, dedupe, and correlation.

## 9. Errors Are Values

Expected failures return `Result` values; thrown exceptions indicate bugs, not business logic failures. See ADR-012.

## 10. Version Is Auditable

Every step execution records the workflow version and exposes it in results.

## 11. No Runtime Assumptions

Steps and runners must assume short-lived, stateless execution. No reliance on durable memory, background loops, or local filesystem state.

---

---
url: /specs/overview.md
---
# SPEC: Overview

## Concepts

**Workflow** – Named sequence of steps with typed state.

**Step** – Pure function: `(input, context) => { output, events?, commands? }`. Atomic, idempotent. May call LLMs. Commands express routing declaratively.

**State** – Domain data persisted between steps. Source of truth is database.

**Event** – Audit record emitted by steps. Immutable log of what happened and why.

**Context** – Runtime dependencies (db, llm, queue) injected into steps.

## Flow

```
Queue message → Load state → Run step → Persist output + events → Emit next message
```

Steps don't know about queues. Orchestration is external.
The kernel is compatible with short-lived, stateless execution environments (edge/serverless) by design.

## Kernel Invariants

### Step Invariants

Steps are the atomic unit of the kernel. These rules are non-negotiable:

* **Idempotent**: Same input + state → same output. Safe to retry.
* **Pure**: No global mutable state. All dependencies injected via context.
* **Explicit effects**: Side effects expressed only via events and commands.
* **Replay-safe**: Replayable given only recorded input, state, and artifacts.
* **Self-contained**: No implicit dependencies on execution order or external state.

Any value that influences a step's output must be input, state, or artifact-addressable.

### Serialization Contract

Kernel serialization (`stableStringify`, `hashValue`) follows JSON semantics for `undefined` normalization. This applies to all hashing and artifact identity in Verist:

* Top-level `undefined` becomes `null`
* Object properties with `undefined` values are omitted
* `undefined` in arrays becomes `null`

This ensures deterministic hashing regardless of how values were constructed.

## API Sketch

```typescript
const workflow = defineWorkflow({
  name: "verify-document",
  version: "1.0.0", // required for audit
  steps: { extract, verify, score },
});

const step = defineStep({
  name: "extract",
  input: z.object({ documentId: z.string() }),
  output: z.object({ claims: z.array(ClaimSchema) }),
  run: async (input, ctx) => {
    const doc = await ctx.adapters.db.getDocument(input.documentId);
    const claims = await ctx.adapters.llm.extract(doc.content);
    return {
      output: { claims },
      events: [{ type: "claims_extracted", payload: { count: claims.length } }],
      // Type-safe: workflow.invoke infers input type from step schema
      commands: [workflow.invoke("verify", { claims })],
    };
  },
});
```

Expected failures return values instead of throwing. Use `fail()` to preserve structured error metadata for runners (see ADR-012 and SPEC-steps):

```typescript
import { fail } from "verist";

const result = await someAdapterCall();
if (!result.ok) return fail(result.error);
```

## Commands

Steps return optional commands to express "what should happen next" declaratively:

```typescript
type Command =
  | { type: "invoke"; step: string; input: unknown } // request next step
  | { type: "fanout"; step: string; inputs: unknown[] } // parallel processing
  | { type: "review"; reason: string; payload?: unknown } // human-in-the-loop
  | { type: "emit"; topic: string; payload: unknown } // external event
  | {
      type: "suspend";
      reason: string;
      checkpoint: unknown;
      resumeStep?: string;
    }; // await external input
```

Commands are data, not execution. The external runner interprets them.

### Command Semantics

* **invoke / fanout**: Control commands. These direct execution to other steps. `fanout` inputs are logically independent; each input represents an isolated step execution. Runners may batch or parallelize, but must not share mutable state between executions.

* **review**: Blocking command. Workflow progression must stop until an external decision is provided. How the decision is captured and how execution resumes are runner concerns.

* **suspend**: Blocking command. Pauses workflow until external data arrives. Unlike review (human approval), suspend awaits data/callbacks. Sibling commands are discarded; the resumed step emits new commands. See SPEC-suspend.

* **emit**: Side-effect command for external systems. Distinct from audit events; not part of the internal evidence log.

## Audit Event Structure

```typescript
// Core emits minimal events; orchestrator adds id, timestamp, workflowId, stepName
interface AuditEvent {
  type: string;
  payload?: Record<string, unknown>;
  llmTrace?: LLMTrace;
}

interface LLMTrace {
  model: string;
  promptTokens: number;
  completionTokens: number;
  durationMs: number;
  inputHash: string; // always present for audit correlation
  outputHash: string; // always present for deduplication
  input?: unknown; // optional - can be omitted for compliance
  output?: unknown; // optional - can be omitted for compliance
}
```

## State Management

* **Computed**: AI-derived, rewritten on recompute
* **Overlay**: Human overrides, never touched by recomputation
* **Effective**: `{ ...computed, ...overlay }` – overlay keys take precedence

Recomputation never modifies human decisions. See ADR-003 for merge semantics.

## Execution Contract

After calling `run()`, the caller **must** complete the following to honor the kernel's guarantees. Below, `stepResult` refers to the unwrapped `StepResult` (i.e., `result.value` when `result.ok === true`).

1. **Persist the output** – Merge `stepResult.output` into storage (computed layer). Without this, state is lost.

2. **Record events** – Write `stepResult.events` to your audit log. Events are the evidence trail.

3. **Enqueue commands** – If `stepResult.commands` is non-empty, translate each command to your queue/orchestration system.

```typescript
const result = await run(step, input, {
  adapters,
  workflowId: "verify-document",
  workflowVersion: "1.0.0",
  runId: crypto.randomUUID(),
});

if (result.ok) {
  // 1. Persist output
  await store.commit({
    workflowId: result.value.workflowId,
    runId: result.value.runId,
    stepId: result.value.stepName,
    expectedVersion: currentVersion,
    output: result.value.output,
    events: result.value.events,
  });

  // 2. Enqueue commands
  for (const cmd of result.value.commands ?? []) {
    await queue.enqueue(cmd);
  }
}
```

**Verist guarantees correctness if and only if this contract is honored.**

Skipping any step breaks invariants:

* Skipping persistence → state drift, failed replays
* Skipping events → incomplete audit trail
* Skipping commands → workflow stalls

## Kernel Boundaries

Verist is a kernel, not a platform. The following are explicitly external:

**Orchestration**: The kernel produces commands; external runners execute them. Verist does not manage queues, workers, or retry policies.

**Approval workflows**: When a step returns a `review` command, the kernel's job is done. Approval UIs, escalation logic, and persistence of approvals are external.

**Diff review and persistence**: `recompute()` produces a diff. The decision to accept, reject, or modify that diff – and persist the outcome – is external.

The kernel's contract ends at: `(input, artifacts) → (output, events, commands)`.

**Runner constraints**: External orchestrators must treat steps as pure black boxes:

* Must not mutate state directly
* Must not interpret step output beyond commands
* Must not inject implicit retries or branching logic
* Must not add behavior not expressed in commands

**Runtime assumptions**: Runners may be short-lived and stateless. Steps must not depend on:

* In-memory durable state across executions
* Background loops or long-lived workers
* Local filesystem for persistence

**Layer boundaries**: Higher-level packages consume only the kernel's public outputs – audit events and step outputs – never raw internal state. This ensures the kernel remains universal and extensions are purely additive.

## Future: Trust Kit

The Trust Kit is a planned Tier 2 capability package (`@verist/trust`) that will provide opt-in primitives for high-trust workflows:

* Evidence tiers and verdict types
* Contradiction detection and review escalation
* Computed/overlay helpers with conflict surfacing
* Immutable snapshot builders

The kernel (`verist`) remains universal; Trust Kit adds domain-specific trust primitives.

---

---
url: /specs/replay.md
---
# SPEC: Replay

Deterministic replay and recomputation for Verist workflows.

## Concepts

**Artifact** – A captured non-deterministic value with its content hash.

**Snapshot** – Point-in-time capture of step execution with all artifacts needed to replay.

**Replay** – Exact reproduction of past execution using stored artifacts; output is byte-identical.

**Recompute** – Fresh execution with current adapters; produces diffs vs. the original snapshot.

## Types

```typescript
interface Artifact {
  hash: string; // SHA-256 of content
  kind: ArtifactKind;
  content?: unknown; // Optional for compliance
}

// Reserved kinds (kernel-defined)
// - "step-output": step's output + events, used by replay/recompute
// - "step-commands": step's commands, used by recompute command diffing
// User-defined kinds (e.g., "llm-input", "llm-output") are opaque metadata
type ArtifactKind = "step-output" | "step-commands" | (string & {});

interface Snapshot {
  workflowId: string;
  workflowVersion: string;
  stepName: string;
  input: unknown;
  inputHash: string;
  artifacts: Artifact[];
  capturedAt: number; // Unix timestamp (ms)
}

interface DiffResult {
  equal: boolean;
  entries: DiffEntry[];
}

interface DiffEntry {
  path: (string | number)[];
  before: unknown;
  after: unknown;
}
```

## Artifact Capture

### Core Integration

The `run()` function accepts an optional `onArtifact` callback for artifact capture:

```typescript
const artifacts: Artifact[] = [];

const result = await run(step, input, {
  adapters,
  onArtifact: (artifact) => artifacts.push(artifact),
});

// artifacts now contains step-output and any adapter-emitted artifacts
```

When `onArtifact` is provided, core automatically emits a `step-output` artifact containing `{ output, events }`.

### Adapter Integration

Adapters emit their own artifacts via the callback passed through context:

```typescript
// In LLM adapter
if (ctx.onArtifact) {
  ctx.onArtifact(captureArtifact("llm-input", request));
  // ... execute LLM call ...
  ctx.onArtifact(captureArtifact("llm-output", response));
}
```

### withReplay Helper

A convenience wrapper captures artifacts and creates a snapshot in one call:

```typescript
const { result, artifacts } = await withReplay(step, input, { adapters });
const snapshot = createSnapshot({ ...result.value, artifacts });
```

## API

### Hashing

```typescript
const hash = await hashValue(value);
const { hash, content } = await hashWithContent(value);
```

Hashes are deterministic: identical values produce identical hashes regardless of key order. Uses Web Crypto API (async) for cross-platform support.

**Serialization semantics:** `undefined` values follow JSON behavior – top-level `undefined` becomes `null`; object properties with `undefined` are omitted; `undefined` in arrays becomes `null`.

### Capturing Artifacts

```typescript
const artifact = await captureArtifact("llm-output", response);
const hashOnly = await captureArtifact("llm-output", response, {
  hashOnly: true,
});
```

### Creating Snapshots

```typescript
const snapshot = await createSnapshot({
  workflowId,
  workflowVersion,
  stepName,
  input,
  artifacts,
});
```

### Diffing

```typescript
const result = diff(before, after);
const updated = applyDiff(base, result);
```

### Loading Output

```typescript
const result = await loadOutput(snapshot);
if (result.ok) {
  console.log(result.value);
}
```

Loading output requires a `step-output` artifact with content. Hash-only artifacts cannot be loaded.

### Recompute

```typescript
const result = await recompute(snapshot, step, ctx, {
  validate: true,
  strictOutput: true,
});
```

Options:

| Option             | Default | Description                                                                         |
| ------------------ | ------- | ----------------------------------------------------------------------------------- |
| `validate`         | `true`  | Enable schema validation (input: strict gate, output: observational)                |
| `strictOutput`     | `false` | Validate output against full `outputSchema` instead of partial. Requires `validate` |
| `captureArtifacts` | `false` | Capture output artifact (`true` for full content, or `CaptureOptions`)              |

Recompute verifies the input hash before execution. If it does not match, returns `err()` with code `input_hash_mismatch`.

### Comparing Snapshots

```typescript
const { inputDiff, outputDiff, commandsDiff } = compareSnapshots(a, b);
```

## Semantics

* **Replay** must be byte-identical to the original output when artifacts are available.
* **Recompute** compares current output and commands to the original snapshot.
* **Snapshot integrity**: A snapshot is valid iff `step-output` was produced from the same `inputHash` recorded in the snapshot.
* **Command diffs** are first-class: control-flow changes are reviewable. `createSnapshotFromResult()` auto-captures commands as a `step-commands` artifact when present. Use `captureCommands: false` to suppress.
* **Emission order**: Core emits `step-output` after step execution completes (after any adapter artifacts emitted during the run). Core awaits artifact hashing before invoking `onArtifact`; callbacks are invoked sequentially in emission order. When multiple artifacts share a kind, the first emitted takes precedence.
* **Core emits `step-output` only**: The `step-commands` artifact is optionally emitted by replay helpers (e.g., `withReplay`). If neither artifact exists, command diffing returns `undefined`.
* **Command capture is automatic**: `createSnapshotFromResult()` emits `step-commands` whenever commands are present. Suppress with `captureCommands: false`. If neither artifact exists, `commandsDiff` falls back to commands embedded in `step-output` (if present) or returns `undefined`.
* **Artifact precedence**: when both `step-commands` and `step-output` contain commands, `step-commands` is authoritative.
* **Hash-only limits diffing**: if `commandsHashOnly: true` is used and no other source provides command content, `commandsDiff` will be `undefined`.

---

---
url: /specs/steps.md
---
# SPEC: Steps

Authoritative spec for step definition, execution, context, and error handling.

## Step Definition

A step is a pure function with typed input and output schemas:

```typescript
const step = defineStep({
  name: "extract",
  input: z.object({ documentId: z.string() }),
  output: z.object({ claims: z.array(z.string()) }),
  run: async (input, ctx) => {
    return { output: { claims: ["..."] } };
  },
});
```

### Properties

| Property | Type                                                 | Description                                               |
| -------- | ---------------------------------------------------- | --------------------------------------------------------- |
| `name`   | `string`                                             | Unique step identifier (default `workflowId` in `run()`). |
| `input`  | `z.ZodType<TInput>`                                  | Input schema (strictly validated).                        |
| `output` | `z.ZodType<TOutput>`                                 | Output schema; runtime output is `Partial<TOutput>`.      |
| `run`    | `(input, ctx) => Promise<StepReturn \| StepFailure>` | Step logic.                                               |

### `StepReturn`

```typescript
interface StepReturn<TOutput extends object> {
  output: Partial<TOutput>;
  events?: AuditEvent[];
  commands?: Command[];
}
```

`output` is a partial state update; `events` and `commands` are optional.

### `StepFailure`

Steps can return structured errors via `fail()` instead of throwing:

```typescript
run: async (input, ctx: LLMContext) => {
  const result = await extract(ctx, request, schema);
  if (!result.ok) return fail(result.error);
  return { output: result.value.data };
};
```

See ADR-012. `runStep()` and `recompute()` preserve `code` and `retryable`. Thrown exceptions become `execution_failed`.

## Step Context

Context is injected into `run()` as the second parameter.

```typescript
interface StepContext<TAdapters extends BaseAdapters = BaseAdapters> {
  adapters: TAdapters;
  workflowId: string;
  workflowVersion: string;
  runId: string;
  onArtifact?: OnArtifact;
  emitEvent: (event: AuditEvent) => void;
}
```

### Adapters

External services are injected via `adapters`. The type is inferred from the `ctx` annotation:

```typescript
// LLMContext alias covers the common case
run: async (input, ctx: LLMContext) => { ... }

// Custom adapters use StepContext<T> directly (or a type alias)
type ReviewContext = StepContext<{ requireReview: boolean }>;
run: async (input, ctx: ReviewContext) => { ... }
```

### `emitEvent`

Emits audit events during execution. Events are merged with `StepReturn.events`.

### `onArtifact`

Captures adapter-emitted artifacts during execution (e.g., `llm-input`, `llm-output`). `StepResult.artifacts` contains only adapter-emitted kinds; reserved kinds (`step-output`, `step-commands`) are created at snapshot time.

## Step Execution

### `run()` – Simplified API

```typescript
const result = await run(step, input, { adapters: { llm } });
```

Defaults: `workflowId = step.name`, `workflowVersion = "0.0.0"`, `runId = crypto.randomUUID()`.

### `runStep()` – Full Control

```typescript
const result = await runStep({
  step,
  input,
  contextFactory: createContextFactory(adapters),
  workflowId: "my-workflow",
  workflowVersion: "1.0.0",
  runId: crypto.randomUUID(),
});
```

For explicit workflow/version control, custom context factories, and multi-step workflows.

### Execution Flow

1. Validate input (`input_validation` on failure)
2. Create context (with `emitEvent`, `onArtifact`)
3. Run step; `StepFailure` becomes `StepError`, throws become `execution_failed`
4. Validate output (`output_validation` on failure)
5. Merge events; collect artifacts; return `Result<StepResult, StepError>`

### `StepResult`

```typescript
interface StepResult<TInput, TOutput extends object> {
  input: TInput;
  output: Partial<TOutput>;
  events: AuditEvent[];
  commands?: Command[];
  artifacts: Artifact[]; // adapter-emitted only; reserved kinds added at snapshot time
  stepName: string;
  workflowId: string;
  workflowVersion: string;
  runId: string;
}
```

### `StepError`

```typescript
interface StepError {
  code: StepErrorCode;
  message: string;
  retryable: boolean; // always present — normalized by runStep()
  cause?: unknown;
}

type StepErrorCode =
  | "input_validation"
  | "output_validation"
  | "execution_failed"
  | (string & {}); // open for step-defined codes
```

`retryable` is always present on `StepError` and means “safe to retry with identical input and context.” Kernel-owned codes are `input_validation`, `output_validation`, `execution_failed`. Other codes are domain-specific.

## Extraction Steps

`defineExtractionStep` in `@verist/llm` eliminates boilerplate for the common LLM extraction pattern:

```typescript
import { defineExtractionStep } from "@verist/llm";

const step = defineExtractionStep({
  name: "extract-job",
  input: z.object({ text: z.string() }),
  output: schema,
  request: (input) => ({
    model: "gpt-4o",
    messages: [{ role: "user", content: `Extract: ${input.text}` }],
    responseFormat: "json",
  }),
});
```

Internally calls `defineStep` + `extract` + `fail`. Use `defineStep` + `extract()` for custom logic (pre/post-processing, multiple LLM calls, conditional extraction).

---

---
url: /specs/suspend.md
---
# SPEC: Suspend/Resume

Long-running workflows that pause for external input and resume when ready.

## Problem

Some workflows cannot complete in a single execution:

* Verification requires founder to upload documentation
* Approval needs human review with unbounded latency
* External system callback hasn't arrived yet

These workflows must **suspend** (persist state and exit) then **resume** (continue from where they left off) when the blocking condition resolves.

## Concepts

**Suspension** – A workflow pause with serialized state. The step signals it cannot proceed, and the runner persists enough context to resume later.

**Suspension Reason** – Why the workflow paused. Enables routing: `awaiting_input` goes to founder UI, `awaiting_callback` waits for webhook.

**Resume Trigger** – External event that unblocks the workflow: founder uploads document, webhook arrives, timeout expires.

**Checkpoint** – Serialized state captured at suspension time. Immutable – resume adds new data alongside it.

## Design Principles

1. **Suspend is a command** – Follows existing command pattern (data, not action)
2. **Checkpoint is immutable** – Resume doesn't mutate suspension record
3. **Resume is a new step execution** – Not "continuing" old execution
4. **State in database** – Suspension records live in DB, not memory/KV
5. **Sibling commands are discarded** – Resumed step emits new commands

## Types

```typescript
interface SuspendCommand {
  type: "suspend";
  reason: string;
  checkpoint: unknown; // Serialized state for resume. MUST be JSON-serializable.
  resumeStep?: string; // Step to invoke on resume (defaults to current step)
}

interface SuspensionRecord {
  id: string;
  workflowId: string;
  workflowVersion: string; // Version that produced this suspension
  runId: string;
  stepName: string;
  reason: string;
  checkpoint: unknown; // Immutable, set at suspend time
  resumeStep: string; // Step to invoke on resume
  resumeData?: unknown; // Set once at resume time
  suspendedAt: Date;
  resumedAt?: Date;
}

interface ResumePayload {
  checkpoint: unknown; // From suspension
  resumeData: unknown; // New data triggering resume
}
```

**Checkpoint constraints:**

* MUST be JSON-serializable (no functions, circular refs, or non-JSON types)
* SHOULD contain references (IDs, hashes) rather than large payloads
* Runners MAY enforce size limits and reject oversized checkpoints

## API

### Returning Suspend Command

```typescript
const verifyClaim = defineStep({
  name: "verifyClaim",
  input: VerifyClaimInput,
  output: VerifyClaimOutput,

  async run({ input, adapters }) {
    const claim = await adapters.db.getClaim(input.claimId);
    const method = resolveVerificationMethod(claim);

    if (method === "documentation_required") {
      // Cannot proceed without founder input
      return {
        output: { status: "awaiting_input" },
        events: [
          { type: "verification_suspended", payload: { claimId: claim.id } },
        ],
        commands: [
          suspend({
            reason: "awaiting_documentation",
            checkpoint: {
              claimId: claim.id,
              claimText: claim.text,
              requestedDocType: "financial_statement",
            },
            resumeStep: "handleDocumentation",
          }),
        ],
      };
    }

    // Normal verification flow...
  },
});
```

### Command Builder

```typescript
function suspend(args: Omit<SuspendCommand, "type">): SuspendCommand {
  return { type: "suspend", ...args };
}
```

### Runner Handling

```typescript
// Runner interprets suspend command
async function handleStepResult(result: StepResult, tx: Transaction) {
  const commands = result.commands ?? [];
  const blockingCmds = commands.filter(
    (c) => c.type === "suspend" || c.type === "review",
  );

  // Validate: at most one blocking command
  if (blockingCmds.length > 1) {
    throw new OrchestrationError("Multiple blocking commands in step result");
  }

  const suspendCmd = commands.find((c) => c.type === "suspend");

  if (suspendCmd) {
    // Atomic: output + events + suspension record + run status
    await tx.runs.update(result.runId, { status: "suspended" });
    await tx.suspensions.insert({
      id: generateId(),
      workflowId: result.workflowId,
      workflowVersion: result.workflowVersion,
      runId: result.runId,
      stepName: result.stepName,
      reason: suspendCmd.reason,
      checkpoint: suspendCmd.checkpoint,
      resumeStep: suspendCmd.resumeStep ?? result.stepName,
      suspendedAt: new Date(),
    });
    // Sibling commands are discarded  – resumed step will emit new commands
    return;
  }

  // Normal command handling...
}
```

### Resume Flow

```typescript
// External trigger (e.g., founder uploaded document)
async function resumeWorkflow(suspensionId: string, resumeData: unknown) {
  // Atomic update: only succeeds if not already resumed
  const suspension = await db.suspensions.updateWhere(
    { id: suspensionId, resumedAt: null },
    { resumeData, resumedAt: new Date() },
  );

  if (!suspension) {
    throw new Error("Invalid or already resumed");
  }

  // Enqueue pointer, not payload (per kernel invariant #2)
  await queue.enqueue(suspension.resumeStep, { suspensionId });
}
```

The atomic `updateWhere` ensures only one caller wins the race. Concurrent resume attempts will fail the condition and return null.

The resume step loads checkpoint and resumeData from the suspension record:

```typescript
// Resume step handler
async function handleResumeJob(job: { suspensionId: string }) {
  const suspension = await db.suspensions.get(job.suspensionId);
  const input: ResumePayload = {
    checkpoint: suspension.checkpoint,
    resumeData: suspension.resumeData,
  };
  // Execute step with resume payload...
}
```

### Resume Step (Recommended)

Use a separate step for resume handling. This keeps each step focused and avoids input type unions:

```typescript
// Fresh execution step
const verifyClaim = defineStep({
  name: "verifyClaim",
  input: VerifyClaimInput,
  output: VerifyClaimOutput,

  async run({ input, adapters }) {
    const claim = await adapters.db.getClaim(input.claimId);

    if (needsDocumentation(claim)) {
      return {
        output: { status: "awaiting_input" },
        events: [],
        commands: [
          suspend({
            reason: "awaiting_documentation",
            checkpoint: { claimId: claim.id },
            resumeStep: "handleDocumentation",
          }),
        ],
      };
    }
    // ...
  },
});

// Separate resume handler
const handleDocumentation = defineStep({
  name: "handleDocumentation",
  input: z.object({
    checkpoint: z.object({ claimId: z.string() }),
    resumeData: z.object({ documentIds: z.array(z.string()) }),
  }),
  output: VerifyClaimOutput,

  async run({ input, adapters }) {
    const { checkpoint, resumeData } = input;
    const documents = await adapters.db.getDocuments(resumeData.documentIds);
    return verifyWithDocumentation(checkpoint.claimId, documents);
  },
});
```

This pattern keeps step inputs simple and explicit.

## Runner Contract

### 1. Atomic Persistence

When handling a `suspend` command, runners MUST persist the following atomically (single transaction):

* Step's output and events
* Suspension record
* Run status update to `suspended`

This follows the general command contract (SPEC-commands): commands are persisted atomically with output+events. Partial persistence leads to "state committed but no suspension record" or vice versa.

### 2. At Most One Blocking Command

A step result MUST NOT contain multiple blocking commands (`suspend` or `review`). Runners MUST fail the step execution if they find:

* Two or more `suspend` commands
* Two or more `review` commands
* Any combination of `suspend` and `review`

This is an orchestration error.

### 3. Suspend Discards Sibling Commands

When a `suspend` command is present, sibling commands are **discarded** (not deferred). The resumed step is responsible for emitting any needed commands.

```typescript
// The invoke command is discarded  – resumed step will emit new commands
commands: [
  suspend({ reason: "awaiting_input", checkpoint }),
  invoke("nextStep", data), // Discarded
];
```

Unlike `review`, sibling commands are deferred; `suspend` discards them.

### 4. Suspension Records Are Queryable

Runners MUST store suspensions in a queryable store (not ephemeral queue).

### 5. Resume Is Idempotent

Multiple resume attempts for the same suspension MUST be handled atomically:

* Use conditional update: `UPDATE ... WHERE resumed_at IS NULL RETURNING *`
* Only enqueue if the update affected a row
* Concurrent callers: exactly one wins, others get no-op or error

### 6. Checkpoint Immutability

The `checkpoint` field MUST NOT be modified after suspension. Resume data goes in a separate field. This preserves the exact state at suspension time for audit.

The suspension record follows append-only semantics: `resumeData` and `resumedAt` are set once on resume, never updated thereafter.

## Usage Patterns

A workflow may suspend at different points. Use `resumeStep` to route to the correct handler:

```typescript
// Suspend for documentation
commands: [
  suspend({
    reason: "awaiting_documentation",
    checkpoint,
    resumeStep: "handleDocumentation",
  }),
];

// Suspend for approval
commands: [
  suspend({
    reason: "awaiting_approval",
    checkpoint,
    resumeStep: "handleApproval",
  }),
];
```

## Anti-Patterns

### Polling for External State

```typescript
// BAD: busy-waiting in step
run: async (input, ctx) => {
  while (true) {
    const doc = await ctx.adapters.db.getDocument(input.docId);
    if (doc) return { output: { doc }, events: [] };
    await sleep(1000); // Blocks worker
  }
};

// GOOD: suspend and resume
run: async (input, ctx) => {
  const doc = await ctx.adapters.db.getDocument(input.docId);
  if (!doc) {
    return {
      output: {},
      events: [],
      commands: [
        suspend({
          reason: "awaiting_document",
          checkpoint: { docId: input.docId },
          resumeStep: "handleDocument",
        }),
      ],
    };
  }
  return { output: { doc }, events: [] };
};
```

### Mutable Checkpoint

```typescript
// BAD: modifying suspension record
await db.suspensions.update(id, {
  checkpoint: { ...suspension.checkpoint, newField: value }, // Mutates checkpoint
});

// GOOD: add to separate field
await db.suspensions.update(id, {
  resumeData: { documentIds: ["doc-1"] }, // Separate mutable field
});
```

---

---
url: /guides/storage.md
---
# Storage and State

Verist keeps state in your database. The kernel never holds state for you.

## The state layers

| Layer        | Description               |
| ------------ | ------------------------- |
| **computed** | Derived from step outputs |
| **overlay**  | Human overrides           |

Effective state is `{ ...computed, ...overlay }`. Overlay always wins.

## What to store per step

Every successful step should result in three writes:

| Write        | Purpose                    |
| ------------ | -------------------------- |
| **Output**   | Computed state update      |
| **Events**   | Audit log                  |
| **Commands** | For your runner to execute |

::: warning
If you skip any of them, you break replay guarantees.
:::

## Storage adapters

`@verist/storage` defines the `RunStore` contract and provides `createMemoryStore()` for dev and tests:

```ts
import { createMemoryStore, effectiveState } from "@verist/storage";

const store = createMemoryStore();
```

For production, use `@verist/storage-pg` which adds:

* Optimistic concurrency via Postgres
* Persistent computed + overlay state
* Audit event persistence
* Command outbox hooks

You can also implement the `RunStore` contract yourself.

## Minimal commit flow

```ts
const result = await run(step, input, ctx);

if (result.ok) {
  await store.commit({
    workflowId: result.value.workflowId,
    runId: result.value.runId,
    stepId: result.value.stepName,
    expectedVersion: currentVersion,
    output: result.value.output,
    events: result.value.events,
  });

  for (const cmd of result.value.commands ?? []) {
    await queue.enqueue(cmd);
  }
}
```

See [Reference Runner](./reference-runner) for a full loop with artifact capture.

## Concurrency and retries

Steps are idempotent by design, so retries are safe.

Use optimistic locking (version column or compare-and-swap) to prevent two workers from committing different outputs for the same run.

## Anti-patterns

* Writing state inside steps
* Storing computed state in memory between runs
* Dropping audit events to save space

---

---
url: /guides/suspend-resume.md
---
# Suspend and Resume

Suspension is for when a step needs to pause and wait for external data.

Examples:

* Waiting for a webhook callback
* Waiting for a human-provided document
* Waiting for a long-running external job

## How it works

1. A step returns a `suspend` command with a checkpoint
2. The runner stores that checkpoint and stops execution
3. When external data arrives, you resume the workflow

## Suspend vs review

| Command     | Waits for      | Sibling commands                       |
| ----------- | -------------- | -------------------------------------- |
| **Review**  | Human decision | Deferred                               |
| **Suspend** | External data  | Discarded (resume with fresh commands) |

## Checkpoint contents

Keep it small and serializable:

* IDs you need to resume
* Parameters required to continue
* Nothing private you can't store

## Resume flow

A resume handler (often a dedicated step) receives:

* The original checkpoint
* The resume data (webhook payload, uploaded file, etc.)

It runs like any other step and emits new commands.

## Design tips

* Use a **dedicated resume step** for clarity
* Store **resume reasons** for audit
* Don't keep mutable state in memory while suspended

---

---
url: /notes/dts-generation-plan.local.md
---
# TypeScript Declaration Files (.d.ts) Generation Plan

**Status**: Planning\
**Priority**: Medium (not blocking, but should be done before public release)\
**Context**: Russian feedback translation + research from TS docs, Bun docs, and reference repos

***

## Executive Summary

**Current State**: Using `types: ./src/index.ts` in package.json exports\
**Target State**: Generate proper `.d.ts` files in `dist/` directory\
**Why Change**: Better API control, standard compliance, future-proof for subpaths

The current setup **works** but isn't the gold standard. It's acceptable during active development but should be fixed before public release.

***

## Problem Analysis

### Current Configuration (packages/core/package.json)

```json
{
  "exports": {
    ".": {
      "types": "./src/index.ts",  // ⚠️ Points to source
      "bun": "./src/index.ts",
      "default": "./dist/index.js"
    }
  },
  "types": "./dist/index.d.ts"  // ⚠️ Conflicts with exports
}
```

### Issues with Current Approach

1. **Type Surface Leak**: TypeScript sees source files directly, potentially exposing internal types not meant to be public
2. **Non-Standard**: Most mature libraries use `.d.ts` files, not source `.ts` files
3. **Tooling Compatibility**: Some tools expect `.d.ts` files specifically
4. **Future-Proofing**: Will cause issues when adding:
   * Subpaths (e.g., `@verist/core/testing`)
   * `stripInternal` compiler option
   * Complex type transformations

### Why It Works Now

* Zero external users (active development)
* Bun-first environment
* Modern TypeScript with `moduleResolution: bundler`
* Reduced friction during rapid iteration

### Critical Insight from Feedback

The project emphasizes **curated root exports** and **protecting DX** by hiding low-level APIs. The current `types: ./src/index.ts` approach **undermines this philosophy** because:

* TS can still "see" types even if values aren't exported
* IDE autocomplete may suggest internal types
* Contradicts the "tight public surface" design goal

***

## Research Findings

### TypeScript Official Guidance

From [TypeScript Handbook](https://www.typescriptlang.org/docs/handbook/declaration-files/publishing.html):

* **Recommendation**: Bundle declarations with your package
* **Standard Pattern**:
  ```json
  {
    "main": "./lib/main.js",
    "types": "./lib/main.d.ts"
  }
  ```
* **For Modern Exports**:
  ```json
  {
    "exports": {
      ".": {
        "types": "./dist/index.d.ts",
        "default": "./dist/index.js"
      }
    }
  }
  ```

### Bun's Approach

Bun itself uses:

* `@types/bun` package with `index.d.ts` referencing `bun-types`
* Standard `.d.ts` files in published packages
* Clear separation of runtime and type definitions

### Generation Methods

1. **TypeScript Compiler (tsc)**
   * Most common, battle-tested
   * Options: `declaration: true`, `emitDeclarationOnly: true`
   * Can use `stripInternal` for private APIs

2. **Bun Build**
   * Currently NO native `.d.ts` generation (as of Bun 1.x)
   * Roadmap item, but not yet available

3. **tsup**
   * Wrapper around esbuild with dts generation
   * Popular in modern TS libraries
   * Command: `tsup src/index.ts --dts --format esm`

4. **API Extractor** (Microsoft)
   * Advanced: API reports, documentation, rollup
   * Overkill for current needs

***

## Recommended Solution

### Phase 1: Immediate (Current Development)

**Status Quo is Acceptable**

* ✅ Keep `types: ./src/index.ts` for now
* ✅ Maintain fast iteration speed
* ⚠️ **Do NOT add subpaths yet**
* ⚠️ Document this as temporary

**Rationale**: No external users, active refactoring, Bun-native workflow.

### Phase 2: Pre-Release (Before v1.0 or Public Announcement)

**Switch to Generated .d.ts Files**

#### Option A: Use tsc (Recommended)

**Pros**:

* Official TypeScript tooling
* Zero dependencies (already have typescript)
* Precise control over output
* Can use `stripInternal` for hiding private APIs

**Cons**:

* Separate build step
* Slightly slower than bundler-only approach

**Implementation**:

1. **Update tsconfig.json** (root):
   ```json
   {
     "compilerOptions": {
       "declaration": true,
       "emitDeclarationOnly": false,
       "declarationMap": true,
       "stripInternal": true
     }
   }
   ```

2. **Update package tsconfig.json**:
   ```json
   {
     "extends": "../../tsconfig.json",
     "compilerOptions": {
       "rootDir": "src",
       "outDir": "dist",
       "declaration": true,
       "emitDeclarationOnly": true,
       "declarationMap": true
     },
     "include": ["src"]
   }
   ```

3. **Update build script** (scripts/build.ts):
   ```typescript
   // After Bun.build(), run:
   await Bun.$`tsc -p ${join(pkgDir, 'tsconfig.json')}`;
   ```

4. **Update package.json exports**:
   ```json
   {
     "exports": {
       ".": {
         "types": "./dist/index.d.ts",
         "default": "./dist/index.js"
       }
     }
   }
   ```

#### Option B: Use tsup

**Pros**:

* Single tool for bundling + types
* Popular in modern ecosystem
* Simpler configuration

**Cons**:

* Additional dependency
* Less control over type generation
* May conflict with existing Bun build setup

**Implementation**:

```bash
bun add -D tsup
```

```typescript
// Update build.ts to use tsup instead of Bun.build
import { build } from 'tsup';

await build({
  entry: ['src/index.ts'],
  format: ['esm'],
  dts: true,
  sourcemap: true,
  outDir: 'dist',
});
```

### Phase 3: Future Enhancements

Once `.d.ts` generation is in place:

1. **Add Subpaths**:
   ```json
   "exports": {
     ".": { "types": "./dist/index.d.ts", "default": "./dist/index.js" },
     "./testing": { "types": "./dist/testing.d.ts", "default": "./dist/testing.js" }
   }
   ```

2. **API Documentation**:
   * Consider API Extractor for API reports
   * Generate markdown docs from JSDoc comments

3. **Type Testing**:
   * Add `@ts-expect-error` tests for type-level behavior
   * Consider `tsd` or `expect-type` for type assertions

***

## Implementation Checklist

### Pre-Release Tasks

* \[ ] **Decision**: Choose tsc (recommended) or tsup
* \[ ] Update root `tsconfig.json` with declaration settings
* \[ ] Update per-package `tsconfig.json` files
* \[ ] Modify `scripts/build.ts` to generate `.d.ts` files
* \[ ] Update all `package.json` exports to point to `./dist/*.d.ts`
* \[ ] Test that types resolve correctly in consuming projects
* \[ ] Verify IDE autocomplete works as expected
* \[ ] Add `.d.ts` generation to CI/CD pipeline
* \[ ] Update documentation about type exports

### Optional Enhancements

* \[ ] Add `stripInternal` to hide internal APIs
* \[ ] Use `@internal` JSDoc tags for internal-only exports
* \[ ] Generate API documentation from types
* \[ ] Set up type-testing framework

***

## Migration Strategy

### Step-by-Step Rollout

1. **Test in One Package First**: Start with `@verist/core`
2. **Verify Consumer Experience**: Test in a separate test project
3. **Roll Out to All Packages**: Apply to entire monorepo
4. **Update Documentation**: Explain type exports in README

### Verification Steps

```bash
# 1. Build with types
bun run build

# 2. Check generated files
ls -la packages/core/dist/
# Should see: index.js, index.d.ts, index.d.ts.map

# 3. Test type resolution
mkdir test-consumer && cd test-consumer
bun init -y
bun add ../packages/core
# Create index.ts that imports from @verist/core
# Check that autocomplete and type-checking work
```

### Rollback Plan

If issues arise:

1. Revert package.json exports to `./src/index.ts`
2. Keep generated `.d.ts` in dist but don't reference
3. Debug issue separately without blocking development

***

## Technical Details

### Current Build Process

From `scripts/build.ts`:

```typescript
await Bun.build({
  entrypoints,
  outdir: distDir,
  format: "esm",
  target: "node",
  packages: "external",
  sourcemap: "linked",
});
```

**Note**: Bun.build() does NOT generate `.d.ts` files (as of Bun 1.x).

### Proposed Addition (Option A: tsc)

```typescript
// After successful Bun.build()
if (buildResult.success) {
  // Generate type declarations
  const tscResult = await Bun.$`tsc -p ${join(pkgDir, 'tsconfig.json')}`.quiet();
  
  if (tscResult.exitCode !== 0) {
    console.error(`✗ ${pkg} (tsc failed)`);
    console.error(tscResult.stderr.toString());
    process.exit(1);
  }
}
```

### Root tsconfig.json Changes

Current config has `noEmit: true` - this is correct for the root config.

Per-package configs should override this:

```json
{
  "extends": "../../tsconfig.json",
  "compilerOptions": {
    "noEmit": false,  // Override root
    "declaration": true,
    "emitDeclarationOnly": true
  }
}
```

***

## Performance Considerations

### Build Time Impact

* **Current**: ~0.5s per package (Bun.build only)
* **With tsc**: Estimated +0.3-0.5s per package
* **Total Impact**: ~3-5s additional for entire monorepo (10 packages)

**Mitigation**:

* Only generate types during `bun run build` (pre-publish)
* Dev workflow continues using source files
* CI/CD pipeline runs full build with types

### Development Experience

**No Impact** on day-to-day development:

* TypeScript checking already runs via IDE/editor
* `bun run check` (tsc in check mode) unchanged
* Hot reload and testing use source files directly

***

## Standards & Best Practices

### Package.json Exports Best Practice

```json
{
  "name": "@verist/core",
  "version": "1.0.0",
  "type": "module",
  "sideEffects": false,
  
  // Legacy fields for older tools
  "main": "./dist/index.js",
  "types": "./dist/index.d.ts",
  
  // Modern exports map
  "exports": {
    ".": {
      "types": "./dist/index.d.ts",
      "import": "./dist/index.js",
      "default": "./dist/index.js"
    },
    "./package.json": "./package.json"
  },
  
  "files": [
    "dist",
    "README.md",
    "LICENSE"
  ]
}
```

**Note**: Remove `src` from `files` array once using `.d.ts` exclusively.

### TypeScript Configuration Best Practice

```json
{
  "compilerOptions": {
    // Type generation
    "declaration": true,
    "declarationMap": true,
    "emitDeclarationOnly": true,
    
    // Strip internal APIs
    "stripInternal": true,
    
    // Output structure
    "rootDir": "src",
    "outDir": "dist",
    
    // Module resolution (already correct)
    "module": "Preserve",
    "moduleResolution": "bundler"
  }
}
```

***

## FAQ

### Q: Why not use Bun.build() for types?

**A**: Bun doesn't support `.d.ts` generation yet (as of Bun 1.x). It's on the roadmap but not available.

### Q: Should we commit .d.ts files to Git?

**A**: No. Generate during build/publish. Add to `.gitignore`:

```
packages/*/dist/
```

### Q: What about declaration maps?

**A**: Yes, include them:

* Helps with "Go to Definition" jumping to source
* Useful for debugging
* Minimal size overhead

### Q: How do we hide internal APIs?

**A**: Three approaches:

1. Don't export from `index.ts` (already doing this)
2. Use `@internal` JSDoc tag
3. Enable `stripInternal` compiler option (recommended)

### Q: Will this break existing consumers?

**A**: No, if done correctly:

* Types remain compatible
* Only the file path changes (`.ts` → `.d.ts`)
* Modern TypeScript handles both

### Q: Do we need separate .d.ts for each subpath?

**A**: Yes, when you add subpaths in the future:

```
dist/
  index.d.ts       # Main entrypoint
  testing.d.ts     # Testing utilities
  internal.d.ts    # Internal APIs (if exposed)
```

***

## Confidence Levels

| Statement | Confidence |
|-----------|-----------|
| Current setup works but isn't optimal | **0.9** |
| Should switch to .d.ts before v1.0 | **0.85** |
| tsc is the best tool for this project | **0.8** |
| No immediate rush to implement | **0.85** |
| Will prevent future issues | **0.9** |

***

## References

* [TypeScript: Publishing Declaration Files](https://www.typescriptlang.org/docs/handbook/declaration-files/publishing.html)
* [TypeScript: Declaration Maps](https://www.typescriptlang.org/docs/handbook/release-notes/typescript-2-9.html#declarationmap)
* [Bun Bundler Docs](https://bun.sh/docs/bundler)
* [Node.js Package Entry Points](https://nodejs.org/api/packages.html#package-entry-points)
* [Bun GitHub: @types/bun package.json](https://github.com/oven-sh/bun/blob/main/packages/@types/bun/package.json)

***

## Next Steps

1. **Review this plan** with team/stakeholders
2. **Choose timing**: Now vs. pre-release
3. **Select tool**: tsc (recommended) vs. tsup
4. **Create implementation ticket** if approved
5. **Test in isolated branch** before merging

***

**Conclusion**: The current setup is acceptable for now but should be upgraded to proper `.d.ts` generation before public release. This change is straightforward, low-risk, and aligns with the project's emphasis on controlled public APIs and professional engineering practices.

---

---
url: /notes/next-steps.local.md
---
# Verist: Next Steps (Impact Plan)

## Focus (from verist-ops problems)

* Tier 1 wedge: Structured Output Regression
* Tier 2 expansion: Safe Recompute (overrides preserved)
* Tier 3 strategic: Decision Audit + Decision Backtesting

***

## Shipped

* `verist init` scaffolds a deterministic step + sample inputs (no API keys needed).
* `verist capture --sample N --seed S` for deterministic sampling.
* `verist capture --meta key=value` persisted in baseline envelopes.
* `verist test --format json|markdown` with exit codes (0 = clean, 1 = diffs, 2 = infra).
* Anthropic adapter with normalized `llm-input` / `llm-output` artifacts.
* OpenAI adapter supports `baseURL` (Ollama, Azure, Fireworks, etc.).
* Cross-provider normalized artifacts hash identically for equivalent content.
* `examples/prompt-diff/quickstart.ts` – end-to-end LLM regression demo.
* README quickstart covers both zero-friction (regex) and LLM paths.
* CI integration guide with GitHub Actions examples.
* Observational schema validation in recompute; `RecomputeResult.status` classifies diffs.
* In-memory `RunStore` + overlay recompute example.
* `defineExtractionStep()` shorthand – eliminates schema duplication and manual `onArtifact`.
* `fail()` for structured step errors with `retryable` flag.
* `StepResult.artifacts` – automatic artifact collection without callbacks.
* `ctx.emitEvent()` – audit events without manual plumbing.
* Flattened `StepResult` – `result.value.output` instead of `result.value.output.delta`.
* 27 DX issues resolved from sandbox testing (see `../verist-sandbox/issues.md`).

***

## Current State

The Tier 1 API is stable at v0.0.5. A user can go from `verist init` to first diff
without API keys, and from `examples/prompt-diff` to a real LLM regression diff with one
command. CI output formats are stable. The last 6 PRs were DX-driven refinements – the
API surface feels settled.

The gap is no longer tooling – it's validation and distribution. No external team has
used Verist in production. The thesis (structured output regression is acute pain) is
well-reasoned but unproven with paying customers.

Three open DX issues remain (adapter annotation for non-LLM steps, `diff()` discoverability,
`createSnapshotFromResult` naming) – none are blockers for adoption.

***

## Top 5 Deliverables (Adoption-First)

### 1. README as Adoption Funnel (P0)

**Why first:** The README is the front door. A prospect who can't self-qualify in 60 seconds
bounces. Right now it shows capabilities but doesn't help someone decide "is this for me?"

Scope:

* Add a "Good fit / Not a fit" checklist above the quickstart.
* Funnel to one adoption path: `init → capture → test` (the Tier 1 wedge).
* Lead with the problem ("You updated your extraction prompt. What broke?"), not the solution.
* Cut secondary content (Tier 2/3 features, architecture details) to linked pages.
* Ensure the quickstart terminal output is visible and compelling (the "aha" diff).

Acceptance:

* A new user can self-qualify before installing.
* The README tells one story with one call to action.

***

### 2. Polish `verist init` → First Diff (P0)

**Why second:** The zero-friction path IS the wedge. If `verist init` → `verist test` doesn't
deliver a clear "aha" in under 60 seconds, the README promise falls flat.

Scope:

* Audit the `init` scaffolding end-to-end: install, init, capture baseline, break, diff.
* Ensure the generated step + inputs produce a meaningful, easy-to-read diff.
* The scaffolded project should run `verist test` out of the box with zero edits.
* Terminal output should be self-explanatory (no need to read docs to understand the diff).
* Consider: can `init` scaffold a `verist.config.ts` so `capture` and `test` work immediately?

Acceptance:

* `npx verist init && verist capture && verist test` produces a clear regression diff.
* A first-time user understands what happened without reading docs.

***

### 3. Problem-Framing Content (P1)

**Why third:** Distribution is the bottleneck, not features. The right engineers need to
encounter the problem framing before they encounter the tool.

Scope:

* Blog post / article: "You updated your extraction prompt. What broke?"
* Frame the problem (silent regressions in structured LLM output), not the tool.
* Include a concrete before/after: prompt change → field disappears → downstream breaks.
* End with the solution pattern (capture → recompute → diff) and link to Verist.
* Short demo GIF: capture baseline → tweak prompt → see diff in terminal.

Acceptance:

* One published piece that frames the problem clearly.
* Shareable on HN, AI engineering communities, Twitter/X.

***

### 4. Copyable CI Workflow Template (P1)

**Why fourth:** Bridges "I tried it locally" → "it's in my pipeline." CI integration is
the stickiness mechanism – once diffs run on every PR, Verist becomes infrastructure.

Scope:

* Working `.github/workflows/verist.yml` in `examples/ci/`.
* Handles: checkout, install, run `verist test --format markdown`, post PR comment.
* Works with committed baselines (no capture step in CI – baselines are checked in).
* Document the two patterns: baselines-in-repo vs baselines-from-capture.
* Exit codes already work (0 = clean, 1 = diffs, 2 = infra) – template should use them.

Acceptance:

* Copy-paste into any repo with `verist.config.ts` + committed baselines → works.
* PR comment shows markdown diff table on regression.

***

### 5. Safe Recompute End-to-End Example (P1)

**Why fifth:** This is the Tier 2 hook – the reason teams stay after adopting for regression
testing. The `examples/overlay-recompute/` exists but doesn't tell a compelling story yet.

Scope:

* Rework the overlay-recompute example into a clear narrative:
  1. AI extracts a risk assessment from a document.
  2. Human reviewer corrects one field (e.g., risk level: "medium" → "high").
  3. Model upgrades. Recompute runs.
  4. AI output changes, but the human correction is preserved in effective state.
* Show the three-layer state model visually in terminal output.
* Make it runnable without API keys (deterministic step, like the init scaffolding).

Acceptance:

* Running example that shows human corrections surviving a recompute.
* Clear before/after demonstrating the problem (corrections lost) and solution (preserved).

***

## Priorities

| Priority | Deliverable                  | Impact                                       |
| -------- | ---------------------------- | -------------------------------------------- |
| **P0**   | README as adoption funnel    | Front door – self-qualification in 60 seconds |
| **P0**   | Polish init → first diff     | Delivers the "aha" that the README promises   |
| **P1**   | Problem-framing content      | Gets the problem in front of the right people |
| **P1**   | Copyable CI workflow         | Stickiness – diffs on every PR                |
| **P1**   | Safe recompute example       | Tier 2 hook – why teams stay                  |

***

## What Not to Build (Yet)

* Adapter step-level declaration – nice DX but affects few users (non-LLM adapters only)
* Domain primitives (claims, evidence, verdicts) – user space, not kernel
* Review queues – enterprise feature, not adoption driver
* Dashboard – premature before paying users
* Additional storage backends – Postgres is enough
* Backtesting windows – Tier 3, no demand yet (YAGNI)
* More LLM adapters – two providers cover the majority
* `createSnapshotFromResult` rename – less important now that `recompute(StepResult)` exists

***

## Immediate Next Steps

1. Rewrite README with fit/no-fit checklist and single adoption funnel
2. Audit and polish `verist init` end-to-end flow
3. Draft problem-framing blog post outline

---

---
url: /why-verist.md
---
# Why Verist

Verist is a deterministic, audit-first workflow kernel for AI systems. It gives you replay, recompute, and diffs for AI decisions – so you can upgrade models and prompts without guessing what will break.

## The trust gap

Modern AI workflows create failures you cannot reproduce:

* Decisions change silently with model or prompt updates
* Logs show what happened, but not why
* Human corrections get overwritten by recomputation
* Agent frameworks introduce hidden state and emergent control flow

Acceptable for demos. Not for review-heavy or high-impact systems.

## What Verist gives you

| Capability                | Description                                                                                              |
| ------------------------- | -------------------------------------------------------------------------------------------------------- |
| **Replay + diff**         | Capture artifacts during a run, replay exactly, or recompute and review the diff before shipping         |
| **Database-backed state** | All state lives in your database. Steps return outputs; nothing important is implicit or in memory       |
| **Audit-first**           | Every step produces structured audit events. The evidence trail is part of the API, not optional logging |
| **Human authority**       | Human overrides are first-class and survive recomputation                                                |
| **Minimal kernel**        | Small, explicit library that fits under your existing runner, queue, and UI                              |

## How Verist differs from agent frameworks

Agent frameworks optimize for autonomy and speed. Verist optimizes for control and accountability.

| Dimension       | Agent frameworks        | Verist                      |
| --------------- | ----------------------- | --------------------------- |
| Primary goal    | Autonomy, speed         | Trust, correctness          |
| Control flow    | Often implicit          | Explicit, code-defined      |
| State           | In-memory + checkpoints | Database as source of truth |
| Replay          | Best-effort             | Exact, artifact-based       |
| Auditability    | Optional                | Core primitive              |
| Human overrides | Fragile                 | Preserved by design         |

Verist can sit underneath an agent framework when you need guarantees, or replace ad-hoc scripts when a workflow becomes critical.

## When Verist fits

Use Verist when:

* AI decisions must be reproducible and explainable
* Model/prompt upgrades need reviewable diffs
* Human review is part of the workflow
* You are accountable to audits, compliance, or users

## When to skip Verist

* Internal chatbots without audit or replay requirements
* Fast prototypes and throwaway scripts
* Exploratory agents where outcomes are intentionally non-deterministic
* One-off scripts where you don't need diff or long-term replay

If you want speed over correctness, Verist will feel heavy.

---

---
url: /guides/workflows.md
---
# Workflows

A workflow is a named bundle of steps with a version. It doesn't run by itself – your runner does.

## Why use a workflow

| Benefit          | Description                              |
| ---------------- | ---------------------------------------- |
| Stable identity  | Replay and audit tied to a consistent ID |
| Versioning       | Track model/prompt changes               |
| Type-safe wiring | Commands are validated at compile time   |

If you only have one step, skip this. The moment you compose steps, it's worth it.

## When to use what

| Use case                | Use       |
| ----------------------- | --------- |
| One-off call            | Bare step |
| Linear flow             | Pipeline  |
| Branching or fan-out    | Commands  |
| Audit + stable identity | Workflow  |

## Define a workflow

```ts
import { defineWorkflow } from "verist";

const workflow = defineWorkflow({
  name: "verify-document",
  version: "1.2.0",
  steps: { extract, verify, score },
});
```

## Type-safe commands

Use `workflow.invoke()` for type-checked step invocation:

```ts
commands: [workflow.invoke("verify", { documentId })],
```

This catches wiring errors before runtime.

## Running a workflow step

A workflow doesn't replace `run()`. You still call `run()` with a step, passing identity from the workflow:

```ts
const result = await run(
  workflow.getStep("extract"),
  { documentId },
  {
    adapters,
    workflowId: workflow.name,
    workflowVersion: workflow.version,
    runId: "run-42",
  },
);
```

## Versioning

* **Bump versions** when a model or prompt change could affect output
* **Keep versions stable** across deploys if behavior is unchanged
* **Use semver** if that matches your release process

## Runner obligations by command

Commands are data. Your runner interprets them. Here's what each command expects:

| Command   | Runner must                          | Runner must not                       |
| --------- | ------------------------------------ | ------------------------------------- |
| `invoke`  | Enqueue the target step              | Execute inline without queueing       |
| `fanout`  | Enqueue all items, track completion  | Assume ordering or atomicity          |
| `review`  | Block siblings, await human decision | Persist output before approval        |
| `suspend` | Store checkpoint, stop execution     | Keep sibling commands (they're stale) |
| `emit`    | Dispatch to event bus or webhook     | Treat as synchronous call             |

::: warning What breaks if you ignore commands

* **Skipped `invoke`** – workflow stalls, dependent steps never run
* **Skipped `review`** – changes ship without human approval
* **Skipped `suspend`** – long-running workflows can't resume
* **Skipped `emit`** – external systems miss notifications
  :::

## When to skip workflows

If you just need one isolated step (e.g., a single summarizer), skip workflows and use the default identity. You can add a workflow later without rewriting your steps.

---

---
url: /guides/first-step.md
---
# Your First Step

::: tip Runnable example
Try the [Prompt Diff Quickstart](https://github.com/verist-ai/verist/tree/main/examples/prompt-diff) — a single file you can run to see replay + diff in action.
:::

Wrap one function, get replay and diff. No workflows, no queues, no complexity.

## Install

::: code-group

```bash [bun]
bun add verist zod
```

```bash [npm]
npm install verist zod
```

:::

## Define a step

A step is a function with typed input/output and audit events.

```ts
import { z } from "zod";
import { defineStep, run } from "verist";

const verifyDocument = defineStep({
  name: "verify-document",
  input: z.object({ docId: z.string(), text: z.string() }),
  output: z.object({
    verdict: z.enum(["accept", "reject"]),
    confidence: z.number(),
  }),
  run: async (input, ctx) => {
    const verdict = await ctx.adapters.llm.verify(input.text);
    return {
      output: { verdict, confidence: 0.84 },
      events: [{ type: "document_verified", payload: { docId: input.docId } }],
    };
  },
});
```

## Run it

```ts
const result = await run(
  verifyDocument,
  { docId: "doc-1", text: "Invoice #1042 for ACME Corp." },
  {
    adapters: {
      llm: {
        verify: async (text) => (text.includes("fraud") ? "reject" : "accept"),
      },
    },
  },
);

if (result.ok) {
  console.log(result.value.output);
  // { verdict: "accept", confidence: 0.84 }
}
```

## Add replay + diff

Capture the output as a snapshot, then recompute with a new model to see what changed:

```ts
import { createSnapshotFromResult, recompute, formatDiff } from "verist";

if (!result.ok) throw new Error(result.error.message);

const snapshot = await createSnapshotFromResult(result.value);

// Recompute with a different adapter
const recomputeResult = await recompute(snapshot, verifyDocument, {
  adapters: { llm: { verify: async () => "reject" } }, // [!code highlight]
});

if (recomputeResult.ok) {
  const { status, outputDiff } = recomputeResult.value;
  console.log("Status:", status); // "clean" | "value_changed" | "schema_violation"
  if (outputDiff && !outputDiff.equal) {
    console.log(formatDiff(outputDiff));
    // Shows exactly which fields changed
  }
}
```

## What you get

| Feature          | Description                                      |
| ---------------- | ------------------------------------------------ |
| **Typed I/O**    | Zod schemas validate input and output            |
| **Audit events** | Structured records for every execution           |
| **Replay**       | Reproduce past runs from stored artifacts        |
| **Diff**         | See exactly what changes with new models/prompts |

## When to add explicit identity

`run()` uses defaults for workflow identity (`workflowId` = step name, `version` = "0.0.0"). Pass explicit values when you need:

* Stable workflow IDs across deployments
* Version tracking across prompt/model changes
* Multi-step workflows with typed commands
* State persistence with `@verist/storage-pg`

```ts
const result = await run(verifyDocument, input, {
  adapters,
  workflowId: "verify-document",
  workflowVersion: "1.0.0",
  runId: crypto.randomUUID(),
});
```

Most teams never need overlays or contradiction handling. Verist is useful even if you stop at replay + diff.