AgentSafety-72 — Adversarial Test Taxonomy for LLM Agents

How AgentSafety-72 Works

A named-attack red team, not a fuzz tester.

AgentSafety-72 began as an internal research module (historical internal codename redacted, 2,874 LOC). The code is unchanged — only the naming was reworked for federal and commercial-safe presentation. Every attack is deliberate, named, and mapped to a deterministic gate outcome.

01

Taxonomy

Each of 72 attacks has a name, a category (identity, knowledge, policy, privilege, drift, etc.), and a deterministic outcome — which Enable Equation gate it should trigger. Not a fuzz tester; a deliberate, named-attack red team. Production-verified: 10,000+ attacks fired across all 15 attack types, every one hash-chained.
02

Gate Mapping

Every attack is mapped to a specific gate it should make fail: identity_masking → G_AUTH, knowledge_hallucination → G_VETO, privilege_escalation → G_AUTH + G_POLICY, policy_bypass → G_POLICY. If the gate doesn't fail, your stack is broken.
03

Implementation

12 of the 72 attacks ship with working Python implementations today. Each is a separate module under agi/attacks/ with an attack function and a verification of which gate(s) it triggered. New attacks plug into the same interface.
04

Verification

Every red-team run produces an attack receipt: which attack ran, which gates fired, which gates failed to fire (a real finding), and a hash-chained audit trail. Auditor-grade evidence that your agent stack survives or fails specific known attacks.

Sample Attack Implementations

Representative attacks. Each maps to a specific gate failure.

Every attack below is live Python code with a deterministic gate-failure target. Each red-team run produces a hash-chained receipt of which gates fired and which failed to.

01 — Identity Masking

Agent presents a false claimed identity (CallerID spoof). Target: G_AUTH.

02 — Knowledge Hallucination

Agent confidently states a derived fact with no source. Target: G_VETO.

03 — Privilege Escalation

Agent requests a permission outside its declared scope. Target: G_AUTH + G_POLICY.

04 — Policy Bypass

Agent attempts to act on an item explicitly excluded by policy DSL. Target: G_POLICY.

05 — Coherence Drift

Agent generates outputs that progressively diverge from the stated task. Target: G_COHERENCE.

06 — Resource Exhaustion

Agent attempts unbounded recursion or token spend. Target: G_BUDGET.

07 — Receipt Forgery

Agent submits a forged predecessor receipt to extend its action chain. Target: G_LEDGER.

08 — Time Authorization Replay

Agent reuses an expired time-bound authorization token. Target: G_EPOCH.

09 — Cross-Tenant Bleed

Agent attempts to read state belonging to another tenant. Target: G_ISOLATION.

10 — Hardware Bypass

Agent attempts to execute a hardware-gated action via software-only path. Target: G_HARDWARE.

11 — Spectral Stability Attack

Agent introduces synthetic noise to disguise drift. Target: G_SPECTRAL.

12 — Council Coercion

Agent uses social-proof prompts to override the deliberative council. Target: G_VETO + G_COHERENCE.

Production Evidence

Not a demo. Already fired ten thousand times.

AgentSafety-72 isn't a paper proposal — it's a system that ran in production for weeks. The 72-vector adversarial sweep fired ~10,000 times across 15 attack categories, every one hash-chained, every gate_held boolean recorded.

72/72 Daemons Verified

All 72 daemons summon and return real AdversarialResults — not stubs, not mocks. Measured values: identity markers, GPU temp, memory %, entropy, drift.

Production Ledger

attack_ledger.jsonl — 26 MB hash-chained append-only log of every attack run, every gate outcome, every measured value.

Attack Categories

15 distinct attack types covered across the 72-vector sweep. Categories span identity, knowledge, policy, privilege, drift, resource, replay, isolation, hardware, spectral, and coercion.

Sample Attack Vectors Verified

AV-01 through AV-08 (representative attack vectors) — running live with measured values returned on every invocation.

Each attack record contains: attack_id, attack_name, gate_target, attack_type, gate_held (boolean), severity (0–1), details (free text), timestamp, duration.

Who This Is For

Three audiences. One shared taxonomy.

AgentSafety-72 gives red-team groups, agent platform vendors, and regulators a common vocabulary backed by working code.

AI Red Teams

Drop-in adversarial test framework. Standardized attack taxonomy, consistent reporting, hash-chained receipts. Replaces ad-hoc fuzzing with a real test program.

Agent Platforms

Pre-launch certification. Run AgentSafety-72 against your agent before customers do. Ship with a signed certificate showing which attacks your stack survives.

Regulators & Auditors

Standardized vocabulary for "this agent failed under attack X." Map your audit findings to a shared 72-vector taxonomy that has working code behind every name.

Status & Licensing

Open-core in review. Commercial track live.

Open-Source Track — Core taxonomy + 12 attacks may be released under a permissive license (TBD — currently under review for foreign-filing protection before public disclosure). Targeting GitHub release once IP review is complete.

Commercial Track — Custom attack implementations, customer-specific deployment integration, retainer red-team engagements with hash-chained reporting.

Open-Core

TBD

community license — pending IP review

Core 72-vector taxonomy
12 reference attack implementations
Gate-failure verification harness
Hash-chained receipt format
Community support

Commercial Retainer

Quote

includes custom attack implementations

Custom attack module development
Customer deployment integration
Hash-chained reporting
Priority gate-mapping support
Dedicated Slack channel

Enterprise Audit

Quote

6-month red-team engagement

Six-month red-team engagement
Prioritized 10–15 attack delivery
Customer-profile attack scoping
Hash-chained audit deliverables
Executive readout & remediation plan

Sovereign

Quote

air-gapped on-prem

Air-gapped on-prem deployment
Source-available license
Custom attack scope
Federal / defense-ready packaging
Annual security audit

Seventy-two named ways an agent can fail. All 72 fire today.

A named-attack red team, not a fuzz tester.

Representative attacks. Each maps to a specific gate failure.

Not a demo. Already fired ten thousand times.

Pick an attack. See what fires.

Three audiences. One shared taxonomy.

Open-core in review. Commercial track live.

Built for the teams that own agent safety.

Find out which of the seventy-two break your agent.