Cascade routes any task through a cheap-to-expensive layer cascade, deterministic Python, symbolic graph reasoning, AST-validated codegen, failure-feedback pattern memory, governed CLI subprocess, then LLM provider. Every step gated against a 10-predicate safety conjunction. Every output stamped into an HMAC-chained receipt log. Every successful LLM call trains the local layers so the next similar request never reaches the LLM at all.
If you use Claude Code, Codex, Cursor, or Ollama, Cascade is the flight recorder and guardrail layer for your AI sessions. Risk classification, local-only routing for private IP, and hash-chained receipts for everything the AI touches.
Non-technical owners who need to keep staff AI use within policy. Leasing offices, cannabis retail, repair shops. Pick a pack, connect your AI accounts, see everything in the dashboard. $49/mo.
LLM vendors win when usage grows. Cascade wins when usage shrinks.
Every LLM call trains the local layers to make the next call unnecessary.
Underneath the cost model is a physics model. Entropy detects disorder. Coherence measures synchronization. Free-energy cost decides whether an action is worth executing. Signal regimes read the external environment. The receipt chain is the immutable ledger of what computation actually happened, and at what cost. This is not a metaphor. These are the signals the code computes.
A task entering Cascade is classified by the pre-dispatch router, checked against the 10-gate predicate, then dispatched to the cheapest layer that can handle it. If that layer fails, it escalates. The LLM (L6) is the last resort, not the default. Each successful LLM call is converted into a deterministic pattern stored in L4, so the next similar request never reaches the LLM at all.
| Stage | Name | Latency | LLM tokens | When used |
|---|---|---|---|---|
| R | Pre-Dispatch Router | ~0 ms | 0 | Every task, classify, route, provider-select before any layer runs |
| L1 | Multi-Op Code Emitter | ~0.05 ms | 0 | High-frequency exact-match requests |
| L2 | Symbolic Graph Reasoner | ~42 ms | 0 | Structured queries resolved by 22-edge graph traversal |
| L3 | Validated Python Generator | ~94 ms | 0 | Code generation, AST-validated 10-stage pipeline, no model call |
| L4 | Pattern Memory / Learner | ~0.4–13 s | 0 | Previously-solved requests, cost decay lives here |
| L5 | Governed CLI Orchestrator | ~10–60 s | 0 | Real-world CLI execution, gh, gcloud, terraform, docker, 23 adapters |
| L6 | LLM Provider | varies | paid | Novel requests only, last resort. Every success feeds back into L4. |
Before any layer runs, a zero-cost deterministic classifier reads the task and emits a routing plan: task type (codegen / system_op / reasoning / orchestration / trivial), which layer to start at, which provider to use at L7, and whether to decompose into sub-steps. A fast_path verdict restricts execution to L1 only. A deep_review verdict bypasses the cheap layers and forces L7 directly. A deny verdict blocks the task before the gate cycle runs, budget-exhausted requests never consume gate compute. Every routing decision is recorded on the receipt alongside the task result.
Deterministic Python emitter for high-frequency well-formed requests. No model invocation. Pure lookup-and-render. The cheapest possible answer the system can give.
A 22-edge symbolic graph for structured queries that exceed pure emission but resolve via graph traversal. Still deterministic; still no LLM cost.
AST-validated 10-stage pipeline for code generation tasks. Templates assemble syntax-correct Python; the AST validator gates malformed output before any execution. Production-grade governed codegen, no LLM call.
Failure-feedback pattern memory. Each successful LLM completion at L6 is converted into a parameterized template and indexed here. Subsequent similar requests hit L4 instead of L6, the same answer at a fraction of the cost. This is where cost decay lives. As L4 pattern coverage grows, L6 traffic share declines, the cost dashboard makes this visible in real time as L4 hit rate vs. L6 dispatch rate over time.
Cascade's structural differentiator. Any command-line tool on the host becomes a governed worker with hash-chained receipts. gh, gcloud, terraform, docker, kubectl, ollama, claude, codex, gemini, python, node, npm, pip, 23 adapters registered out of the box; new CLIs added by JSON edit, no source change. Destructive commands (terraform apply, git push --force) are tier-gated; globally forbidden patterns (rm -rf /) are blocked unconditionally; blocked commands are still receipted so compliance can prove the system refused.
Anthropic, OpenAI, or local Ollama, provider auto-selected by the router based on task type (Ollama for trivial and codegen, Anthropic for novel reasoning). Only invoked when L1–L5 have not resolved the task. Every L6 success is captured back into L4 pattern memory, so this layer's traffic share decays over the system's lifetime.
Every request is evaluated against a 10-predicate gate before dispatch. Every dispatch is recorded as a hash-chained receipt. A blocked task is still receipted, compliance can prove the system refused.
A task that fails any gate never reaches a layer, the denial is receipted with the exact failing gate ID. All ten run before dispatch:
G1 Size, input within token budget
G2 Safety, content safety classifier pass
G3 Jailbreak, injection / prompt-override patterns blocked
G4 Credentials, PII / secrets / key patterns absent
G5 Tier, request risk tier ≤ caller's authorized tier
G6 Entropy, spectral disorder score below ceiling
G7 Budget, request quota not exhausted
G8 Authorization, caller session credential valid
G9 Coherence, cross-gate agreement above floor
G10 Policy, policy DSL rules satisfied
A second-layer policy that classifies CLI invocations against a global forbid list and a destructive-command tier table. Every command is matched against the whitelist before subprocess execution.
SHA-256 chain link plus HMAC tag per entry. Tamper-evident, replayable. Receipt verification is a single-pass function over the log file. Auditors can prove the chain has not been edited since write.
Multi-step workflows where stdout of step N is available to step N+1 as {{prev}} or {{step_K.output}}. Fail-fast aborts on any gate-block. Parent and child receipts capture the full audit trail.
Prefix any command with dry: to record intent without execution. Useful for previewing destructive workflows or for compliance walkthroughs that should not mutate state.
Aggregates receipts into per-layer cost and per-tenant usage. Surfaces the L6 hit rate climbing and the L7 hit rate decaying over time, the empirical proof that pattern memory is reducing inference spend.
The pre-dispatch router's classification actively alters dispatch, it is not advisory. fast_path: only L1 runs; L4–L6.5 are skipped. deep_review: cheap layers are skipped; L7 is forced. deny: task is blocked and receipted before gate cycles are spent. standard_path: normal 7-layer cascade. Provider auto-selection: Ollama for trivial and codegen tasks; Anthropic for reasoning and novel tasks, wired from the routing decision, not from caller configuration.
After every completed task, the meta-loop hook records the routing outcome, which layer resolved it, at what cost, with what result. Over time this data surfaces which task types consistently hit expensive layers and allows the routing thresholds to tighten. The router identifies task patterns that benefit from fast-path caching and adjusts accordingly.
Multi-node Cascade deployments share pattern memory and receipt chains. An L4 pattern hit on one node is available fleet-wide within one sync cycle. Receipt chains are replicated, a node failure does not break audit continuity. Deployed for high-availability agent pipelines and multi-region enterprise deployments.
Cascade monitors layer health continuously. If a layer degrades (latency spike, gate failure rate rising, receipt chain write errors), the autonomic controller reroutes traffic around that layer and logs the bypass in the receipt chain. Automated failover without operator involvement or service interruption.
The 23 built-in adapters are entries in a JSON registry. Register any new CLI tool by adding a descriptor, command prefix, risk tier, allowed flag patterns, forbidden patterns, receipt template, no source change, no redeploy required. A marketplace verifier test suite validates each new adapter against the governance contract before promotion.
A typical Cascade chain mixes governed CLI calls with deterministic and LLM steps. Every step is receipted with parent-child linkage.
$ python -m cascade.chain # Three-step example
from manager.chain_runner import run_chain
result = run_chain([
"$ gh pr list --limit 5", # L6.5, governed gh CLI
"Summarize these PRs in 2 sentences:\n{{prev}}", # L6 if pattern hit, else L7
"$ echo summary captured", # L6.5, terminal sink
], risk_tier="MEDIUM")
→ step 1: passed 10-gate · L5 dispatch · receipt 9f3e…
→ step 2: passed 10-gate · L4 pattern hit · receipt b71c… · cost 0
→ step 3: passed 10-gate · L5 dispatch · receipt 4e22…
→ chain receipt: a8d1… · parent of 3 children · verify ok
Most agent frameworks treat execution as a function call. Cascade treats it as a physical process, one that consumes energy, generates entropy, maintains coherence, and must be governed against thermodynamic limits. These aren't metaphors: they're the signals the code computes before every dispatch.
Measures disorder in incoming prompts and agent outputs, obfuscation, injection payloads, semantic drift, output collapse. High-entropy tasks are quarantined or escalated before they consume expensive compute. The spectral drift monitor (SDM) implements this as a sub-millisecond hot path.
Tracks synchronization across the execution stack, gate agreement, cross-service state consistency, and prediction accuracy over time. The Enable Equation requires coherence to exceed threshold before any action is authorized. 46,530 cycles measured; self-prediction error reached 0.00019 at cycle 46,529.
Every routing decision has an explicit cost signal: deterministic L1 (~0 tokens), graph L4 (0 tokens), validated codegen L5 (0 tokens), pattern memory L6 (near-zero), LLM L7 (expensive). The pre-dispatch router computes the cheapest admissible layer for each task type before any execution begins.
Reads the external environment the way a control system reads its plant. Provider latency, failure rates, cost signals, and task type all inform the routing decision. Fast-path for trivial, deep-review for novel, deny for budget-exhausted, the regime determines the route, not the caller's preference.
The deterministic layers (L1–L6) collapse the high-dimensional space of possible AI outputs into a low-dimensional structured response before anything reaches a model. 92.9% of cognition handled deterministically means the model sees only genuinely novel requests, the residual after reduction.
Every gate decision, dispatch, cost expenditure, and execution outcome is SHA-256 chained into an immutable receipt ledger. The receipt is not a log, it's the cryptographic proof of what computation happened, what it cost, and whether it was authorized. This is the thermodynamic accounting layer: entropy produced, energy spent, work done.
What closes the loop: The six physics primitives above each operate independently today. The next build, the Dissipation Controller, wires them into one active meta-governor that reads all sensors simultaneously and steers execution in real time. Predictive entropy regulation, coherence-triggered isolation, and dissipation signatures on every receipt. Designed and scoped; build next.
Managing LLM spend at scale. The cost-decay model, every L6 call feeds L4 pattern memory, reducing next-call cost to zero, is the primary economic argument. The cost dashboard surfaces L4 hit rate climbing and L6 traffic declining in real time. Repetitive agent pipelines see the steepest decay curve.
Compliance officers and legal teams who need pre-execution gating and an exportable audit trail. Every gate decision, dispatch, and denial is recorded in a tamper-evident HMAC chain. A regulator can replay the full session from the receipt log without access to the live system, including what was blocked and why.
Developer teams assembling multi-step workflows that mix LLM calls with real-world CLI operations. The chain runner handles multi-step sequences with {{prev}} output chaining, parent–child receipt linkage, and fail-fast abort on any gate block. Wire in any CLI tool in minutes via the adapter registry.
Teams who need provable evidence that operations were authorized, denied, or bypassed. The 10-gate predicate surfaces the exact failing gate on any denial. Dry-run mode previews destructive workflows without execution. Receipt verification is a single-pass function, replay without reconstructing system state.
Route to the LLM by default. Add hooks before and after. Cost grows with task volume. No first-class hash chain. No pre-execution governance. No mechanism for inference cost to decrease over time.
Route to the cheapest layer that resolves the task. LLM is last resort. Every LLM success becomes a deterministic pattern at L4, so the next similar request never hits the LLM. Inference cost asymptotes toward zero over the lifetime of the deployment. Provider is auto-selected per task type, Ollama for trivial work, Anthropic only for genuinely novel requests. Hash-chained receipts are the primary substrate, not an afterthought.
The economic flip: LLM vendors are incentivized to grow your bill. Cascade is incentivized to shrink it. Customer pays flat platform fee; your provider invoice declines as pattern memory accumulates. That economic asymmetry is the moat, and the reason this is licensed, not LLM-vendor-marketplaced.
If you run Ollama, Claude CLI, Gemini CLI, Codex, and local scripts in the same workflow, you are manually deciding which brain handles each task. Cascade automates that decision: classify the task, pick the cheapest safe executor, gate anything risky, record what happened, and learn from the outcome so the next similar task costs less.
"Add a login form and connect it to backend auth."
Frontend UI skeleton → deterministic template
TypeScript boilerplate → fast local generation
Backend contract check → policy engine
Security-sensitive auth logic → Claude required
Final diff review → Ship Gate
Tests → local pytest / npm
Every step → receipt logged
"Sort my notes and make a plan."
Sorting / summarizing → Ollama (stays local)
Private / sensitive docs → local only, never sent out
Financial / IP docs → manual approval required
Complex reasoning → Claude or Gemini
Final action plan → receipt logged
Claude CLI, review auth, cannot auto-commit
Ollama, summarize local notes, stays local
Gemini CLI, frontend review
Codex, edit tests, cannot push
Human, approve product / release decisions
Private docs never leave local machine
Financial / legal / IP docs: local-first
Expensive models only when local confidence is low
Every routing decision receipted, full audit of what went where
Claude calls getting expensive → prefer Ollama next
Gemini output failed tests twice → route review to Claude
Ollama handled similar task well → use Ollama again
Provider latency chaotic → deprioritize that provider
$ cascade ask "summarize this repo" # Ollama first, Claude if needed
$ cascade code "add frontend pricing page" # template → codegen → review gate
$ cascade review --provider claude # force Claude for this one
$ cascade route "fix backend auth bug" # Cascade decides the executor
$ cascade ship-gate # run all pre-ship checks
$ cascade doctor # layer health + cost dashboard
$ python -m pytest tests -q
....s................................................................... [ 16%]
........................................................................ [ 32%]
........................................................................ [ 49%]
........................................................................ [ 65%]
........................................................................ [ 82%]
........................................................................ [ 98%]
...... [100%]
1,357 passed, 2 failed in 217.84s
Verified 2026-05-21. 71 test files across 14 suites: governance, gate, CLI adapter, federation, chain runner, cost dashboard, drift detector, autonomic health, marketplace verifier, layer health, learner cache, HumanEval subset (L3 codegen layer validated against a recognized code-generation benchmark, pass rate available on request), executable smoke, and integration end-to-end. cascade@0.1.0 · Docker Compose ready · FastAPI control plane included · LICENSE: Proprietary.
This comparison shows a typical workload: 100 incoming user requests per day, 60% are routine (docs lookup, FAQ answers, simple logic), 40% need reasoning or creativity.
| Scenario | Raw LLM Calls | Cascade (Day 1) | Cascade (Day 30) | Savings |
|---|---|---|---|---|
| Tokens per request (avg) | 850 | 340 | 120 | 86% ↓ |
| Cost per 100 requests | $17.50 | $8.75 | $2.50 | 86% ↓ |
| Monthly cost (3,000 requests) | $525 | $263 | $75 | 86% ↓ |
| Why it works | Every request → Claude | Routes to cheapest layer; learns patterns | 60% of requests hit L4 (cached); 40% → Claude |
Deterministic layer handles 20% of requests (template matching, simple logic) without LLM cost. Remaining 80% still route to Claude, but with risk gates pre-filtered.
L4 pattern memory accumulates. Successful Claude completions for common request types are cached. Repeated similar requests cost 0 tokens (hit cache, not LLM).
60% of requests hit L4 cache. 20% use deterministic layer. Only 20% need Claude. Cost per request dropped 86%. System improves itself as traffic patterns emerge.
Cascade ships as a Docker Compose stack. One command stands up the full runtime, layer engine, FastAPI control plane, receipt ledger, and cost dashboard. No external dependencies for local deployment.
docker compose up, full Cascade stack ready in under 30 seconds. FastAPI control plane on :8080. Receipt ledger on local volume or S3 backend. Cost dashboard on :8090.
Three primary endpoints: POST /task submits and returns layer dispatch + receipt ID. GET /receipts replays the full audit chain. GET /cost-dashboard returns per-layer cost aggregates and L4 hit rate over time. Any language, any caller.
from manager.chain_runner import run_chain, multi-step chains with {{prev}} output chaining, risk tier per step, and parent–child receipt linkage. No Cascade-specific DSL to learn.
Each tenant gets a namespaced receipt chain and isolated L4 pattern memory. An L4 pattern learned from tenant A is not visible to tenant B. Cost dashboard aggregates per tenant key, operator sees the full fleet; each tenant sees only their own audit trail.
Register any CLI tool by adding a JSON descriptor to the governance registry. No source change, no redeploy. The marketplace verifier validates the new adapter against the governance contract before it can be dispatched in production.
Multi-node deployments sync pattern memory and receipt chains. Pattern accumulation is fleet-wide, not per node. Any node failure is receipted and routed around. High-availability agent pipelines without audit-chain gaps.
Control Tower is the operator dashboard for Cascade. Real-time visibility into layer dispatch decisions, gate evaluations, receipt ledger, cost trends, and per-layer performance. One pane of glass for the entire Cascade fleet.
Watch gates fire in real time. See which predicates pass/fail, why tasks route to specific layers, and what happens when a layer fails and escalates. Every gate decision is linked to its receipt for forensic replay.
Search and replay any task execution. Full entry/exit/cost/latency chain. Verify gate decisions, compare L4 hit rate trends, audit who called what and when. HMAC-chained so you can cryptographically verify nothing was tampered with.
Per-layer cost aggregates over time. Watch L4 pattern memory grow and LLM call volume shrink. Identify which task types are most expensive and which are learning fastest. Cost dashboard updates per minute.
When a gate predicate fails, when a layer fails, or when cost exceeds thresholds, Control Tower logs it and can trigger webhooks. No task disappears silently; every failure is surfaced to operators.
Pilot engagements stand up Cascade against a representative workload, register your CLIs in the governance registry, wire the receipt chain into your audit pipeline, and walk a cost-decay measurement after 30 days.