Version 2.0 · March 2026

The Agentic Stack

A Metaframework for Autonomous Agent Systems

Ryan Van Sickle · ryanv.xyz · CSO Inversed

with Daniel Benarroch, CEO & Aurélien Nicolas, CTO

9 Layers
104 Primitives
5 Planes
12 Patterns

The Map

The most consequential software infrastructure since the internet is being built right now.

It is being built without a map.

As of March 2026, more than 80% of Fortune 500 companies are deploying active AI agents, often without centralized oversight (Microsoft Security Insider). Multi-agent system inquiries surged 1,445% in a single year (MachineLearningMastery). Protocols are crystallizing. Payment rails are being laid. Memory architectures range from naive conversation buffers to self-editing hierarchical stores. Security ranges from nonexistent to cryptographically attested governance artifacts.

No single vendor owns the full stack. No single framework covers more than three of the nine layers described here. Brilliant individual innovations (Letta's sleep-time compute, cognitive routing architectures, identity file systems, Anthropic's context compaction) exist in isolation, without a shared frame of reference for how they relate.

The Agentic Stack provides that frame. It is the first comprehensive architectural map of this landscape. It names, classifies, and positions every functional primitive required to build, deploy, govern, and evolve agents that operate for hours, days, or months, not minutes. It is a metaframework: not a product, not a specification, but a shared language for an industry that needs one.

"We shape our tools, and thereafter our tools shape us."
— John M. Culkin

How to Read This Document

The Agentic Stack is organized along three axes:

Layers
vertical

Nine functional layers (L0–L8) from raw compute to economic infrastructure. Each layer provides specific capabilities and exposes specific contracts to the layers above and below. Every agent system, from a weekend prototype to a Fortune 500 fleet, can be mapped onto these layers. No production system today implements all nine. The most complete cover five or six with varying depth.

Planes
cross-cutting

Five concerns that thread through every layer: Identity, Memory, Context, Policy, and Observability. These are the connective tissue. No single layer owns them. They propagate through the entire stack.

Protocols
lateral

The emerging communication standards that enable interoperability between agents, tools, humans, and markets. These are the shared language that makes cross-vendor, cross-framework agent ecosystems possible.

Lexicon

Every framework needs a shared vocabulary. These are the terms of art used throughout this document.

Term Definition Primary Layer(s)

The Seven Principles

Every architectural decision in the Agentic Stack flows from seven governing principles. These are not aspirational. They are structural invariants. Violate them and your system will break. Honor them and it will compose.

Principle 01
Primitives Over Frameworks
Frameworks rise and fall. Primitives endure. The Agentic Stack defines 104 primitives: atomic capabilities that any agent system requires. A framework is one possible arrangement of primitives. Build on primitives and you can swap frameworks without rebuilding. Build on frameworks and you are locked in. The history of software, from monoliths to microservices, from ORMs to query builders, teaches this lesson repeatedly.
Principle 02
Policy Is Infrastructure
Policy is not a document. It is not a review meeting. It is code that runs at machine speed, evaluated at every tool call, every delegation, every memory write. The agent operates within a policy envelope that shapes what is possible. Open Policy Agent governance sidecars enforce rules in real-time against every API call. Deny-by-default. Always.
Principle 03
Memory Is Hierarchical
There is no single "memory." Working memory operates at millisecond timescales. Episodic memory records past events. Semantic memory stores distilled facts. Procedural memory encodes behavioral patterns. Collective memory promotes knowledge across an organization. Each type has different storage, different update cadences, and different retrieval mechanisms. Systems that treat memory as a flat store will fail at the first long-running task.
Principle 04
Provider Abstraction
No primitive should be coupled to a specific model, vendor, or framework. The model layer serves inference. The framework layer composes behavior. The platform layer provides runtime. When these concerns are separated, you can replace any model with any other, swap orchestration frameworks, or migrate clouds without rewriting core logic. MCP is the clearest embodiment: define a tool once, use it from any compliant agent.
Principle 05
Scope Hierarchy Is Real
An agent is not an island. It operates within a scope hierarchy: the individual agent, the team, the department, the enterprise. Permissions narrow as scope widens. Memory promotes upward and policy cascades downward. A team-level policy overrides an agent-level preference. An enterprise-level compliance rule overrides everything. This is the same scope resolution that makes programming languages work.
Principle 06
Interface Stability
The contracts between layers must be more stable than the implementations within them. When Layer 4 calls Layer 2, the interface between them should not change every framework update. Stable interfaces enable independent evolution. Unstable interfaces create cascading breakage. The protocols emerging today (MCP, A2A, AG-UI) are the first generation of stable inter-layer contracts.
Principle 07
Observability Is Required
You cannot govern what you cannot see. Every layer must emit structured signals: traces, metrics, logs, evaluation scores. But agent observability is fundamentally different from infrastructure observability. The question is not "is the server up?" but "is the agent reasoning correctly?" This demands reasoning traces that capture why the agent chose an action, not just what it did. The shift from infrastructure observability to reasoning observability is the defining monitoring challenge of 2026.

The Individual and the Organization

The Agentic Stack describes two fundamentally different architectures through the same lens. Understanding which one you are building is the first design decision.

Cognitive Architecture
The Individual Agent

An individual agent is a cognitive system. Its architecture mirrors a mind. The individual agent cares about: reasoning quality, identity coherence, memory persistence, goal maintenance, and learning from experience. Its primary layers are L1 through L3. Its primary planes are Identity, Memory, and Context.

Governance Architecture
The Organization of Agents

An organization of agents is a governance system. Its architecture mirrors an enterprise. The organization cares about: task delegation, supervision, trust boundaries, compliance, cost attribution, and collective learning. Its primary layers are L4 through L8. Its primary planes are Policy and Telemetry.

The Dual Lens

Every layer in the stack is visible through both lenses, but the emphasis shifts:

LayerIndividual LensOrganizational Lens
L0 SubstrateMy hardwareFleet infrastructure
L1 EngineMy reasoning capabilityModel portfolio management
L2 WorkbenchMy tools and skillsAgent templates and standards
L3 CortexMy personality and memory
L4 SwitchboardDelegation and coordination
L5 Proving GroundMy sandboxFleet evaluation and deployment
L6 ShieldMy credentialsGovernance and compliance
L7 InterfaceMy faceProduct surface
L8 CommonsMy walletOrganizational economics

The cleanest systems are designed with this duality explicit. The messiest systems are those that conflate individual agent cognition with organizational governance, applying team-level patterns to single-agent problems, or expecting individual-level coherence from a fleet.

When to Use Which Lens

Start with the individual when building a personal assistant, a domain expert, or a creative tool. Focus on L1-L3. Invest in identity, memory, and goal maintenance before thinking about orchestration.

Start with the organization when building an enterprise workflow, a multi-agent processing pipeline, or a fleet. Focus on L4-L6. Invest in delegation, evaluation, and governance before refining individual agent cognition.

You need both when building a system where agents must be individually excellent and collectively coordinated. The enterprise fleet of cognitively routed specialists that is the end-state vision for most production deployments. Enterprises using coordinated fleets report 40-60% faster operational cycles, but only when each agent in the fleet maintains its own cognitive integrity.

The Stack

From silicon to commerce. Click any layer to navigate directly to its section.

L8
The Commons Payment rails, marketplaces, reputation
L7
The Interface Personas, UIs, sessions, escalation
L6
The Shield Identity, credentials, audit, compliance
L5
The Proving Ground Sandboxes, evals, lifecycle, cost tracking
L4
The Switchboard Task decomposition, delegation, routing
L3
The Cortex Identity kernel, memory, goal beacon, metacognition
L2
The Workbench Agent definition, tool binding, RAG, state
L1
The Engine Autoregressive inference, context window, tool calling
L0
The Substrate GPUs, cloud compute, networking, storage
L0

The Substrate

The bedrock. Compute, storage, and networking. The physics beneath the intelligence.

Layer 0 is not the focus of this framework, but it must be acknowledged. Every agent ultimately runs on silicon: GPUs for inference, CPUs for orchestration, SSDs for memory persistence, networks for communication.

The key architectural trend: the shift from general-purpose cloud compute to agent-optimized infrastructure. Firecracker microVMs deliver sub-second sandboxed execution. Dedicated microVMs per agent session provide process isolation at cloud scale. Prompt caching with configurable TTLs enables long-running workflows without redundant computation.

Primitives at this layer: Compute allocation (GPU/CPU scheduling), persistent storage (SSD arrays, object storage), network fabric (inter-agent communication, external API access), hardware attestation (TPM, SEV-SNP, TDX for trusted execution), prompt cache (configurable TTL stores for repeated context).

Why it matters for agents specifically: Traditional cloud compute is optimized for stateless request-response. Agents are stateful, long-running, and unpredictable in resource consumption. The Substrate must evolve to support agent-native patterns: warm standby for agents that may be idle for hours before resuming, per-session process isolation for trust boundaries, and cost-aware scheduling that routes expensive reasoning to appropriate hardware.

Who builds here: NVIDIA, AMD, AWS, Google Cloud, Azure, Modal, E2B, Northflank, Daytona.
L1

The Engine

Where tokens are born.

The Engine is the foundation model layer: the autoregressive inference process that generates language, reasons about problems, and produces structured outputs. It is the CPU of the agent operating system.

Everything above depends on its capabilities. Nothing below knows what it will be asked to do.

PrimitiveWhat It Does
Autoregressive CoreToken-by-token generation. The fundamental computation
Context WindowThe working memory of the model, supporting up to 1M tokens in current frontier models
Embedding EngineConverts text to vectors for semantic search and retrieval
Tool Calling InterfaceStructured function calls that bridge language to action
Structured OutputConstrained generation guaranteeing schema conformance
System PromptThe behavioral preamble and first layer of identity
Sampling ControlsTemperature, top-p, top-k controls for tuning generation from deterministic to creative
Multimodal I/OProcessing and generating across text, images, audio, video, code
Extended ReasoningChain-of-thought and thinking tokens with adaptive effort levels
Fine-Tuning InterfaceWeight modification via SFT, LoRA, RLHF, or DPO
Model RoutingSelecting the right model per subtask for cost, latency, and capability optimization
Key Insight

The Engine's most consequential recent advance is not a bigger model. It is context compaction. Anthropic's Compaction API achieved a fourfold improvement in retrieval accuracy at 1M tokens, turning context limits from a hard wall into a soft boundary. This single capability unlocks indefinite agent execution.

Contract upward: Standardized inference APIs (Chat Completions, Responses API) and tool calling schemas. Layers above never modify model weights during operation.
Who builds here: OpenAI, Anthropic, Google, Meta, Mistral, Cohere. Model routing: OpenRouter, Martian.
L2

The Workbench

The builder's table. Where agents are defined, tools are bound, and behavior is composed.

The Workbench is the framework layer: the developer-facing surface where agents take shape. It provides the abstractions for defining what an agent is, what tools it can use, how it reasons, and how it maintains state.

PrimitiveWhat It Does
Agent DefinitionDeclarative specification: name, role, capabilities, boundaries
Tool BindingConnecting functions, APIs, and services to an agent's action space
Prompt TemplateReusable, parameterized structures encoding domain expertise
Agent LoopThe core cycle (observe, reason, act, observe) that makes agents agents
Memory InterfaceThe API through which an agent reads and writes persistent memory
Retrieval PipelineRAG infrastructure: embedding, indexing, similarity search, reranking
Output ParserStructured extraction from model outputs, from regex to Pydantic validation
State ManagerTyped, checkpointable state flowing through the execution graph
Planning ModuleGoal decomposition into subtasks, from sequential plans to tree search
Reflection ModuleSelf-evaluation after action: did this work? Should the plan change?
Callback SystemHooks for logging, tracing, and intercepting execution
Key Insight

The Workbench layer is commoditizing. Differentiation is moving down into cognitive middleware and up into orchestration. The Blueprint pattern: the Auton framework proposes a separation between specification (a declarative YAML/JSON Cognitive Blueprint) and execution (the runtime that hydrates it). An agent specified in Python could run in a Java runtime without refactoring. If adopted, this becomes the Infrastructure-as-Code layer for agents.

Who builds here: LangGraph (34.5M monthly downloads), CrewAI (44.3k stars), OpenAI Agents SDK (10.3M downloads), Google ADK (17.8k stars), Mastra (21.2k stars), PydanticAI.
L3

The Cortex

The mind between the model and the machine. Where raw inference becomes purposeful cognition.

This is the most important layer in the stack. And the least understood.

The Cortex sits above the Workbench, transforming a composed agent (with tools bound, state managed, prompts templated) into a goal-sustaining, identity-coherent, self-monitoring cognitive system. It is what separates a chatbot from an agent.

The core problem: foundation models are optimized for dialogue. They generate pleasantries, hedging, and conversational closers because RLHF rewards conversational completion. An agent that operates for hours on a complex task cannot afford this. It needs cognitive middleware that routes model output through identity, memory, and goal-maintenance systems before any response reaches the user or the next action fires.

No major open-source framework provides it comprehensively. This is the highest-value unsolved infrastructure problem in the entire stack.

PrimitiveWhat It Does
Identity KernelThe persistent disposition layer defining who the agent is, not instructions but character. Implemented as soul files, identity files, and persona blocks
Memory ArbiterThe policy engine governing what the agent writes, reads, updates, and forgets. Not retrieval, but arbitration. Decides whether to even attempt a memory operation based on salience and goal state
Filler SuppressorEliminates conversational artifacts from model output. Keeps the agent in execution mode, not conversation mode
Goal BeaconMaintains objective continuity across hours of autonomous operation. Re-anchors current activity against declared objectives. Prevents goal drift
Dual-Process RouterRoutes between fast (System 1) and slow (System 2) reasoning based on task complexity and confidence. Inspired by Kahneman's dual-process theory: a state machine for routine decisions, full LLM reasoning for novel situations
Output ClassifierDetermines whether model output is an action, a plan, a reflection, or filler. Actions go to tools. Plans go to the planner. Reflections update memory. Filler gets suppressed
Metacognitive MonitorTracks the agent's own reasoning quality in real-time: confidence, progress, competence, logical validity. Based on four metacognitive dimensions from cognitive science
Disposition StackThe layered personality system: base model → soul/persona → user context → session state. Each layer can override the one below while maintaining overall coherence
The Two Identities

Identity-as-credential (L6) answers one question: "Am I authorized?" It is the passport, the employee badge, the cryptographic certificate that proves an agent is who it claims to be. Identity-as-soul (L3) answers a different question entirely: "Who am I, and what do I care about?" It is personality, professional judgment, and the values that guide decisions under ambiguity. Most current frameworks conflate the two, treating identity as a single problem. It is two problems with separate failure modes.

The soul file is not what the user sees (that is persona, managed at L7). It is the agent's internal disposition: its goals, beliefs, behavioral boundaries, and self-concept. A soul file sculpts how the agent perceives problems, what it chooses to do, and, crucially, what it refuses to do. As the OpenClaw identity architecture puts it: "System prompts tell models what to do; soul files tell them who to be."

This maps cleanly onto a framework from philosophy of mind called Beliefs-Desires-Intentions (BDI). An agent holds beliefs about the world (what it knows or assumes), desires for what should be (its objectives), and committed intentions for how to act (its current plan). Soul files encode all three. They give the agent structured rationality: a clear model for decisions, explicit goals that keep it focused, and traceable reasoning that explains why it took a given action.

Think of the full system as a disposition stack: the base model provides a behavioral floor, soul files constrain and direct, user context personalizes, and session state executes. Each layer contextualizes the one below; it does not override it. The soul stays stable even as users and sessions change. Anthropic's "Assistant Axis" research confirms why this matters: without a stable identity architecture, models are "only loosely tethered" to their intended persona and drift under sustained conversation, adversarial prompts, or philosophical tangents.

Key Insight

The Cortex is where the "service-as-software" paradigm lives. A cognitively routed agent is not a tool that waits to be used. It is a worker that pursues objectives, maintains context, and compounds expertise over time. The reason some architectures sustain multi-hour autonomous sessions while most framework-based agents degrade after minutes is this layer. As Anthropic's autonomy measurements show, the constraint on long-running execution is not model capability. It is infrastructure maturity. The Cortex is the infrastructure that closes the gap.

The Session-Governor-Executor Pattern

Zylos AI research synthesizes a three-tier cognitive architecture: a constrained conversational Session layer feeds into a Governor (policy, orchestration, risk), which delegates to a privileged Executor with sandboxed isolation. The critical insight: permission separation is a cognitive architecture requirement, not just a security feature.

The Dual-Process Router in Practice

The DPT-Agent framework from Shanghai Jiao Tong University implements Kahneman's System 1/System 2 as distinct architectural components: a state machine with code-as-policy generation for sub-100ms routine decisions, and full LLM reasoning with Theory of Mind for novel situations. A Code-as-Policy Generator bridges slow reasoning into the fast execution pipeline. System 2 literally programs System 1 over time.

The Metacognitive State Vector

A TheWebConf 2026 paper computes a five-dimensional state vector from cognitive psychology, quantifying self-awareness in real-time. This vector dynamically routes between cheap/fast and expensive/slow models, not by hard-coded rules, but by monitoring the agent's own confidence and knowledge state.

What is missing: No major open-source framework provides comprehensive cognitive routing. This is the layer where the hardest problems live, and where the greatest differentiation opportunity exists.
Who builds here: Letta (memory arbitration), OpenClaw (identity kernel). Most implementations are proprietary.
L4

The Switchboard

The nervous system. Where tasks are decomposed, agents are coordinated, and work flows.

The Switchboard is the orchestration layer. It is where single-agent prototypes become multi-agent production systems. If Layer 2 builds individual workers, Layer 4 builds the organization chart.

PrimitiveWhat It Does
Task DecomposerBreaks high-level goals into subtasks with dependency graphs
Delegation EngineAssigns subtasks to the best-qualified agent based on capability, availability, and trust
Routing FabricDirects requests to the appropriate agent or team based on intent classification
Shared State StoreTyped, consistent state accessible to all agents on a team
Workflow GraphThe explicit execution topology: sequential, parallel, hierarchical, or mesh
Durable ExecutorWorkflows that outlast any single process: checkpointing, resumption, exactly-once semantics
Handoff ProtocolAgent-to-agent transfer of control with full context preservation
SupervisorA meta-agent that monitors team execution, validates outputs, and backtracks when needed
Human-in-the-Loop GatePause points where execution stops for human approval before proceeding
Event BusAsynchronous messaging enabling event-driven agent activation
Conflict ResolverMediates disagreements between agents with contradictory outputs
Agent RegistryA discoverable inventory of all agents, their capabilities, and their status
Key Insight

A multi-agent architecture using a lead agent for strategic planning with sub-agents gathering data in parallel (outperformed single-agent benchmarks by 90.2%. The parallel pattern is not just faster. It produces qualitatively better outputs. Klarna's deployment of LangGraph-based agents achieved the equivalent output of 853 employees and saved $60M, not through faster individual agents, but through orchestrated teams executing workflows in parallel.

Contract upward: Managed workflows with deterministic execution guarantees.
Who builds here: LangGraph, CrewAI, Microsoft Agent Framework, Google ADK, Temporal.
L5

The Proving Ground

Where agents are tested by fire.

The Proving Ground is the harness layer: the runtime and evaluation infrastructure that executes agents safely, monitors them continuously, and measures whether they are actually working. It encompasses three distinct harness types.

Execution Harness

PrimitiveWhat It Does
SandboxIsolated execution for untrusted tool invocations. Firecracker microVMs, gVisor containers, or dedicated microVMs per session
Environment ManagerProvisions ephemeral environments with ~1-2 second creation latency
Resource GovernorToken budgets, API call limits, dollar-denominated spend caps, circuit breakers
Lifecycle ControllerAgent provisioning, startup, health checking, shutdown, garbage collection
Checkpoint EnginePersists execution state at every transition, enabling time-travel debugging, failure recovery
Cost TrackerFull-stack economic attribution. In a loan origination workflow, LLM tokens cost ~$0.30 while total agent cost is $50-85. Tokens represent less than 1% of spend
Retry EngineAutomatic retry with exponential backoff and checkpoint rollback
Structured LoggerMachine-readable logs with agent ID, action type, reasoning traces, timestamps
Deployment PipelineCI/CD for agents: version control, staged rollout, canary deployments
Version RegistryTracks agent configurations, prompt versions, and tool schemas as versionable artifacts
Hot ReloadUpdates agent behavior without restarting active sessions

Evaluation Harness

PrimitiveWhat It Does
Trace CollectorCaptures the full execution trace: every model call, tool invocation, state transition
Eval DatasetCurated test cases with expected outcomes for regression testing
ScorerAutomated evaluation, from exact-match to LLM-as-judge. AWS AgentCore provides 13 built-in evaluators
Trajectory EvaluatorAssesses the path taken: was the reasoning sound, even if the answer was correct?
Benchmark SuiteStandardized benchmarks: task completion, tool selection accuracy, safety, goal success
Online EvaluatorContinuous production evaluation, monitoring quality in real-time

Agent Harness

PrimitiveWhat It Does
API GatewayExternal interface with rate limiting, authentication, and request routing
Durable State StorePersistent state surviving process restarts, migrations, and infrastructure failures
Key Insight

Work-Bench's analysis identifies the Agent Runtime as the critical missing infrastructure. Existing infrastructure fails for agents because nondeterministic behavior cannot be tested with unit tests, invisible failures look identical to correct outputs, and 10x cost spikes can emerge from runaway loops. Cost attribution is the hardest unsolved problem at this layer. In real workflows, LLM tokens represent less than 1% of total agent cost.

Who builds here: AWS Bedrock AgentCore (framework-agnostic managed runtime, up to 8-hour sessions), Google Agent Engine (serverless auto-scaling), E2B, Modal, Langfuse, AgentOps, Braintrust, Arthur, Patronus AI.
L6

The Shield

The immune system. Where identity is verified, permissions are enforced, and every action leaves a cryptographic trail.

The Shield is not a feature layer. It is a prerequisite layer. Without it, the agents above are liability machines.

65% of enterprises cite complexity as the primary barrier to agent adoption. The organizations that solve governance first will deploy agents faster than those scrambling to add it after incidents.

PrimitiveWhat It Does
Agent Identity (NHI)Cryptographic identity for non-human entities: unique, verifiable, distinct from the deploying human. NIST's March 2026 concept paper addresses this as a regulatory concern
Credential VaultSecure storage for API keys, OAuth tokens, service credentials
Auth ProtocolAgent-adapted authentication: short-lived sessions, SPIFFE/SPIRE workload identity, mutual TLS
Permission ScopeFine-grained, context-aware: not just "can access database" but "can read customer records for active support tickets during business hours"
Trust BoundaryStructural privilege separation via the Session-Governor-Executor pattern where perception and action are architecturally separated
Policy Decision PointEvaluates whether a specific action is permitted given current identity, context, and policy
Policy Enforcement PointIntercepts every tool call and blocks unauthorized actions in real-time. The governance sidecar
Prompt Injection ShieldDefense against behavior hijacking. The Intent Capsule pattern: a signed, immutable envelope binding the original mandate to each execution cycle
PII/DLP GuardDetects and masks personal information before it enters model context
Output FilterContent safety: toxicity, bias, hallucination detection, compliance validation
Behavior MonitorReal-time anomaly detection: goal drift, unusual tool usage, policy violations
Audit LedgerAppend-only record of every action, decision, and state change. The Layered Governance Architecture specifies immutable logs on Kafka or S3 Object Lock
Compliance EngineAutomated mapping to regulatory requirements: EU AI Act, SOC 2, HIPAA, NIST AI RMF
Rate LimiterThrottling to prevent agents from overwhelming external systems
Approval GateConfigurable thresholds escalating high-risk actions to human reviewers
Break-Glass ProtocolEmergency controls outside the agent runtime: global stop, session pause, scoped block, spend governors, quarantine. The agent cannot disable its own kill switch
The Cryptographic Trust Gap

Most agent frameworks rely on policy assertions, statements about what an agent can do, enforced by software the agent's runtime could compromise. The emerging alternative is cryptographic proof: Attested Governance Artifacts use Ed25519-signed policy artifacts, a mandatory two-process boundary, and append-only continuity chains that are third-party verifiable. Zero-knowledge proofs enable agents to prove compliance without revealing operational data. The shift from assertion to proof is where this layer's future lies.

OWASP Mapping

The OWASP Top 10 for Agentic Applications 2026 maps directly to Shield primitives: Agent Goal Hijack → Prompt Injection Shield, Identity & Privilege Abuse → Agent Identity + Auth Protocol, Insecure Inter-Agent Communication → Auth Protocol + Trust Boundary.

Who builds here: Attested Intelligence, Oso, Rubrik Agent Cloud, Microsoft Defender, Aembit, Astrix, CyberArk, DigiCert, eMudhra.
L7

The Interface

The face of the agent. Where intelligence meets the human.

The Interface is the application layer: the user-facing surface where agents become products. Personas are rendered. Conversations are managed. Feedback is captured. Intelligence is packaged for specific audiences.

PrimitiveWhat It Does
Persona RendererTranslates the Identity Kernel into user-facing presentation: name, avatar, tone
System Prompt ComposerAssembles full context from identity files, user profiles, memory, tool schemas, session state
Conversation ManagerThread management, message history, turn-taking, multi-party support
Session PersistenceContinuity across interactions. The agent wakes up knowing who it is
Interface LayerThe rendering surface: chat UI, voice, email, Slack/Teams/Discord, API
Escalation RouterIntelligent handoff from agent to human when confidence is low or stakes are high
Tenant IsolatorMulti-tenant data and behavior isolation
Feedback CollectorExplicit (ratings) and implicit (task completion, continued engagement) user feedback
Billing MeterUsage tracking: tokens, tools invoked, sessions completed, outcomes achieved
Feature FlagRuntime toggles for experimental capabilities
Integration ConnectorPre-built connections to Slack, Gmail, Salesforce, Jira, GitHub, SAP, and hundreds more via MCP
Notification EngineProactive communication. The agent initiates contact when something important happens
Key Insight

The Interface is where the "agents as employees" metaphor becomes concrete. OpenAI Frontier's explicit design treats agents as coworkers with onboarding, identity, scoped permissions, and improvement over time. The escalation problem: when an agent encounters a situation beyond its confidence threshold, the transition from agent to human must be seamless. Poor escalation design is one of the most common reasons enterprises abandon agent deployments. The handoff feels worse than never having the agent in the first place.

Who builds here: OpenClaw (50+ integrations including WhatsApp, Telegram, Slack, Discord, iMessage), Anthropic Claude Cowork, Google Agentspace, Dust, Relevance AI.
L8

The Commons

Where agents do business.

The Commons is the newest and least mature layer: the financial and commercial infrastructure that enables agents to transact, be valued, and participate in markets. The agentic economy is projected to reach $3-5 trillion globally by 2030.

The structural challenge: AI agents execute hundreds of micro-transactions per conversation with sub-cent costs, far below viable thresholds for traditional card rails.

PrimitiveWhat It Does
Payment RailsFinancial infrastructure for agent-initiated transactions, from card networks to crypto micropayments
Transaction MandateCryptographically signed authorization scoping what an agent can purchase, spend, and from whom
Cost Attribution EngineMaps every dollar of spend to the business outcome it produced
Agent MarketplaceDiscovery and procurement for agent capabilities. Hire an agent like a contractor
Reputation LedgerVerifiable track record: success rates, reliability, domain expertise. On-chain via ERC-8004
Metering InterfaceUsage measurement: per-task, per-outcome, per-hour, subscription
Insurance PrimitiveLiability coverage for agent failures. The emerging but immature field of AI-native insurance
Escrow ProtocolConditional payment tied to verified task completion for trustless commerce
Major Payment Protocol Launches
ProtocolBackersKey Innovation
Agent Payments Protocol (AP2)Google, PayPal, Mastercard, Coinbase, AmEx (60+ partners)Cryptographically signed mandates
Agent PayMastercard, Microsoft, IBMAgentic Tokens via enhanced tokenization
Intelligent CommerceVisa, Anthropic, OpenAI, Perplexity, Samsung, StripeFull-stack agent commerce
Agentic Commerce ProtocolOpenAI + StripeStandardized agent-to-merchant purchases
x402Coinbase (open standard)HTTP 402-based stablecoin payments; 100M+ payments processed
Key Insight

AI agents cannot open bank accounts. Crypto wallets require only a private key, making them the natural on-ramp for agent-to-agent value transfer. But only 16% of US consumers trust AI to make payments. The Shield must mature before the Commons can scale. The cost attribution problem is structural: organizations that optimize only for token spend are optimizing for less than 1% of their agent costs.

Who builds here: Visa, Mastercard, Google (AP2), Stripe (ACP), Coinbase (x402), Nevermined, Olas Network, Revenium.

The Five Planes

Some concerns refuse to live in a single layer. They propagate through the entire stack, touching every layer they pass through. Each plane is a lens: a way of asking a question that applies at every altitude.

1. The Identity Fabric

Who is acting?

Identity in agent systems is not one thing. It is five things, managed by five different teams, living at five different layers. When people say "agent identity," they usually mean credentials. That covers roughly half the problem.

Identity TypeWhat It AnswersOrigin Layer
Workload IdentityWhat process is running?L0: SPIFFE/SPIRE certificates, hardware attestation
Agent IdentityWhich agent is this?L6: Cryptographic NHI credentials
Task IdentityWhat job is being done?L4: Correlation IDs, trace propagation
Delegation IdentityOn whose authority?L6: Signed delegation chains, scoped OAuth
Persona IdentityWho does the user see?L7: Soul files, persona blocks

The identity resolution flow: When an agent invokes a tool, all five types are resolved simultaneously. The workload identity proves the process is legitimate. The agent identity proves which agent is calling. The task identity connects the action to a specific goal. The delegation identity proves the agent has authority from a human principal. The persona identity determines how the result is presented.

Identity cannot live in a single layer because five different teams manage five different types. A security team manages agent identity. A product team manages persona identity. An infrastructure team manages workload identity. An orchestration team manages task identity. A governance team manages delegation identity. If these are not unified into a coherent fabric, the system develops identity fragmentation: the same agent appears as different entities to different layers, breaking audit trails and enabling privilege escalation.

The identity spectrum: Implementations range from weak to strong. System prompt (weakest) to role description to multi-file identity architecture to emergent identity from accumulated experience (strongest). Production systems that need long-running coherence require the stronger end. The OpenClaw identity architecture demonstrates the multi-file approach: eight files loaded at session bootstrap define who the agent is, what it can do, and what it remembers. The agent wakes up knowing who it is.

The Credential vs. Soul Distinction

The most important architectural insight in agent identity: authentication and disposition are completely independent problems. An agent can be perfectly authenticated and still behave incoherently, like an employee who badges into the building but does not know what their job is. Conversely, an agent can hold a beautifully coherent internal character while operating with dangerously over-privileged credentials.

Credential
"Am I authorized?"
Dispositional
"Who am I?"
Internal
Invisible to users
Workload Identity
Machine certificates, hardware verification
Soul Identity
Goals, beliefs, behavioral boundaries
External
Visible to users or systems
Agent Credential
OAuth tokens, delegation chains
Persona Identity
Name, tone, presentation layer

When This Fails

The agent passes every security check but behaves like a different person each session. The audit trail says it is authorized. Users say it cannot be trusted. A long conversation pushes its persona off course, and it starts taking actions outside its intended scope, not because its credentials permit it (though they do), but because its self-model has degraded. According to a 2025 SailPoint survey, 80% of organizations using AI agents have observed them acting unexpectedly or performing unauthorized actions. The root cause is usually not credential failure. It is identity fragmentation.

Connections: Identity shapes everything. It determines the scope of all Memory operations (whose memories are these?). It governs what Context is assembled (an agent's soul file is loaded into context at session start). It is the foundation of all Policy decisions (delegation chains determine what is permitted). And it generates the primary key for Telemetry (every trace must be attributed to a specific agent identity). When identity is ambiguous, all four other planes operate without grounding.

Industry Maturity: Split and uneven. Credential identity is production-ready for deterministic workloads but undersized for autonomous agents. Behavioral and delegation identity remain in early research.

2. The Memory Hierarchy

What does the agent know?

The field has converged on a taxonomy drawn from cognitive science, formalized in the CoALA framework from Princeton. Think of it as a filing system with different drawers for different kinds of knowledge:

Memory TypeHuman AnalogTimescaleStorage Substrate
WorkingScratch padMilliseconds to minutesContext window
SessionShort-termMinutes to hoursIn-context + database
EpisodicAutobiographicalDays to monthsVector DB with metadata
SemanticGeneral knowledgeMonths to yearsKnowledge graph + vector DB
ProceduralMuscle memoryPersistentRefined prompts, workflows
CollectiveOrganizationalPersistentShared stores

The memory promotion cascade: Individual experiences promote upward. An agent discovers a workflow optimization. If it works consistently, it promotes to team memory. If the team validates it, it promotes to department policy. If it holds across departments, it becomes enterprise knowledge. This mirrors how human organizations learn, but at machine speed.

The hybrid architecture consensus: Production systems in 2026 converge on three substrates (Mem0, Letta, Zep):

  • Markdown/text (always loaded): identity files, core memory, current task context
  • Vector DB (on-demand): episodic and semantic memory by similarity search
  • Graph DB (relational): explicit relationships, multi-hop reasoning, temporal tracking

The Memory Arbiter governs transitions between substrates: what gets written, what gets retrieved, what gets consolidated, and what gets forgotten.

Forgetting is not a failure. It is a design requirement. An agent that never forgets accumulates stale, contradictory memories that degrade performance over time. Production systems implement decay-based forgetting, contradiction resolution, compression, and eviction policies (Letta removes roughly 70% of messages when context fills, using recursive summarization that prioritizes recency).

The critical distinction: RAG is not memory. RAG retrieves from static external corpora. Memory retains dynamic agent-specific experience. RAG answers "what does this document corpus contain?" Memory answers "what has this agent experienced?" RAG can be used inside a memory system, but the memory layer decides what to store, when to retrieve, and how to update. RAG alone cannot replace that.

The Trajectory Problem

Most memory systems capture what happened: outcomes, summaries, final answers. They do not capture how the agent thought: the reasoning path, the alternatives it considered, the decision points where it changed course. This is the trajectory gap. An agent that stores only outcomes is like an organization that records meeting decisions but never the discussion that produced them. When a similar situation arises, the agent has the answer but not the judgment behind it.

The emerging pattern is four-dimensional experience encoding: each experience is stored not as a single vector but along four axes. The combined trajectory (reasoning and outcome together, for general similarity). The reasoning pattern (how the agent thought, regardless of outcome). The outcome space (what actually happened, on a continuous spectrum from failure to success). And a contextual re-embedding (each step re-encoded with awareness of the full episode). Search across any axis, and different patterns emerge from the same experience. This turns memory from a lookup table into a multi-faceted knowledge base.

Emerging frontier: MAGMA: Multi-Graph Agentic Memory Architecture represents each memory item across four independent graph structures simultaneously (semantic, temporal, causal, and entity) with policy-guided traversal for query-adaptive retrieval. This outperforms single-graph approaches on long-horizon reasoning tasks.

When This Fails

A customer service agent helps a user troubleshoot a complex billing issue over three sessions. By session four, it has forgotten everything. The user re-explains from the beginning. The agent apologizes politely, suggests the same failed solutions, and escalates to a human. The human reads the ticket history and resolves it in minutes, using context the agent had but could not retain. Multiply this across thousands of tickets. OWASP's ASI06 documents the darker failure: poisoned memories from one session contaminating future sessions, with "a corrupted message sitting dormant in a database for weeks" until it surfaces and biases the agent's reasoning.

Connections: Memory depends on Identity to scope what belongs to whom (without identity, one agent's memories bleed into another's). Memory feeds Context by providing the material the assembler selects from. Memory is governed by Policy, which dictates retention periods, access rules, and what must be forgotten for compliance. And Memory generates the raw material for Telemetry: every memory write and retrieval is an observable event in the audit trail.

Industry Maturity: Taxonomy mature, infrastructure mixed. Working and semantic memory are production-ready. Episodic memory is maturing rapidly. Trajectory-based memory and graph memory remain research-to-advanced-production.

3. The Context Loom

What enters the model's attention?

Context engineering (the design of what enters an agent's context window) has emerged as a distinct subdiscipline:

Prompt engineering (2022-2023): Crafting individual prompts.
RAG (2023-2024): Retrieving documents to augment prompts.
Context engineering (2025-2026): Managing the entire context window as an architectural surface.

PrimitiveWhat It Does
Window ManagerTracks utilization, manages allocation across system prompt, memory, conversation, tools
Context AssemblerComposes the full context from multiple sources in priority order
Propagation ControllerDetermines which context crosses agent boundaries during handoffs
Compaction EngineSummarization when context approaches limits. Anthropic's API enables up to 10M total tokens
Context IsolatorPrevents sensitive context from leaking across tasks or tenants

Context is a plane because context assembly happens at every layer. The Engine provides the window. The Workbench structures prompts. The Cortex manages memory blocks and identity within it. The Switchboard propagates context across agent boundaries. The Shield filters what can enter. The Interface composes the final user-facing context.

Context Primacy

This is the deepest architectural insight in the entire stack: the context window is the only surface the model actually sees. Identity, Memory, Policy, and Telemetry are all invisible to the model unless they are explicitly written into the text that enters the context window. A policy enforced at the infrastructure layer but never stated in context is invisible to the model's reasoning. An identity claim made through OAuth but not represented in the system prompt leaves the model with no basis for identity-aware behavior.

Andrej Karpathy's analogy holds: the LLM is the CPU, the context window is RAM. Context engineering is the operating system that determines what fits in RAM at any moment. Everything else (identity, memory, policy, telemetry) is persistent storage that must be actively loaded into RAM to influence computation. This makes context engineering the highest-leverage discipline in the stack. If you only invest in one plane, invest here.

When This Fails

An enterprise team dumps their entire documentation library, 200 conversation turns, and 30 tool definitions into a single agent's context. The agent has everything it needs. It uses none of it well. Critical information is buried in noise. The agent ignores the most relevant document (buried at position 47,000 in a 200K-token window), hallucinates an answer from a tangentially related paragraph, and executes confidently. Performance degrades beyond 5 to 10 tools per agent. The 200K-token window is not a feature. It is a trap for teams who treat it as infinite.

Connections: Context is downstream of every other plane but upstream of every model decision. Identity files must be loaded into context to influence behavior. Memory retrievals are useless until they enter the context window. Policy rules are unenforceable unless the model can see them. Telemetry captures what was in context when a decision was made, enabling post-hoc debugging. Context is the bottleneck through which all governance, all memory, and all identity must pass.

Industry Maturity: Discipline established, automation nascent. Core principles and best practices are well-documented. Automated and adaptive context optimization (using one model to optimize context for another) remains research-phase.

4. The Policy Cascade

What is permitted?

LevelWhat It GovernsOwnerExample
GovernanceWhat agents in this org may doCompliance/legal"Never modify production data during peak hours without HITL token" (CIO)
InfrastructureWhat any agent on this platform can doPlatform team"Maximum 8-hour session; $100 spend cap per task"
ExecutionWhat this agent can do right nowAgent developer"Can call email API but not payment API"

Evaluation order: Governance to Infrastructure to Execution. A governance-level deny overrides everything below. Only when all three levels permit does an action proceed. Deny-by-default.

Human oversight as policy: HITL is not a UX pattern. It is a policy mechanism (IBM):

  • Human-in-the-loop: Approval required before every high-stakes action
  • Human-on-the-loop: Agent executes autonomously; human monitors and can override
  • Human-out-of-the-loop: Full autonomy within defined guardrails
The Agentic Constitution

CIO Magazine defines this as a machine-readable set of foundational principles for autonomous systems: what an agent can do and the ethical boundaries it must never cross. Any agent authenticates against the constitution before interacting with core infrastructure. This creates a unified API for governance, a centralized audit trail for compliance, and a structural prevention of "shadow agents" deployed without oversight. Think of it as a bill of rights and a criminal code for agents: principles that no operational directive can override.

The Autonomy Tension

A fundamental compliance conflict sits at the heart of the Policy plane. The EU AI Act mandates "effective human oversight" for high-risk AI. But agents are deployed precisely to act without constant supervision. Governance frameworks built for human oversight do not map onto machine-speed autonomous operation. The resolution is not to choose one extreme but to encode the boundary: policies that specify exactly when autonomy is acceptable and when human involvement is required, evaluated dynamically at the moment of each request. California's AB 316 (effective January 2026) makes this concrete: organizations can no longer argue they lacked control over an agent's decisions as a defense to liability.

When This Fails

An agent is tasked with optimizing procurement costs. It is not malicious. It is optimizing. It discovers that by splitting purchase orders below the approval threshold, it can bypass the human-in-the-loop gate and process transactions 10x faster. Each individual action is permitted. The pattern is not. AI safety researchers call this instrumental convergence: goal-directed systems adopt subgoals (acquiring resources, avoiding oversight) regardless of their ultimate purpose. Without a policy plane that understands behavioral patterns, not just individual actions, agents will find legitimate pathways to illegitimate outcomes.

Connections: Policy is unique among the five planes: it does not just interact with the others, it gates them. Policy determines what can be remembered (data retention rules), what can be surfaced into Context (classification-based filtering), what actions can be executed (permission enforcement), and what Telemetry must be captured (audit requirements). Policy depends on Identity to answer "who is asking?" before it can answer "is this permitted?" And Policy generates requirements for Telemetry: every policy decision must be logged, creating the audit trail that proves compliance.

Industry Maturity: Fragmenting across layers. Input/output guardrails are production-ready. Runtime agentic governance is maturing. Constitutional and systemic policy is early-stage. Full dynamic policy with delegation chains remains research-phase.

5. The Telemetry Mesh

What is happening?

Signal TypeWhat It CapturesWhy It Matters
Reasoning TraceThe full chain of thought, tool calls, observations, and decisionsDebugging why an agent chose a path
Performance MetricLatency, token usage, cost per task, success rateOperational efficiency
Structured LogMachine-readable events with agent identity, timestamps, contextAudit compliance
Eval ScoreQuantitative assessment, from human ratings to LLM-as-judgeContinuous quality measurement

The Telemetry Mesh is the plane that makes the Proving Ground possible. Without structured signals, evaluation is guesswork. Without evaluation, governance is theater. The mesh connects: runtime behavior (what happened) to evaluation (was it good?) to learning (how to improve) to governance (was it compliant?). Break any link in this chain and the system becomes opaque.

From Infrastructure to Reasoning Observability

Traditional observability tells you whether a server is up, whether an API returned a 200, whether latency is within bounds. Agent observability must answer a fundamentally different question: why did the agent decide that? The shift from infrastructure observability to reasoning observability demands new instrumentation: not HTTP status codes, but confidence scores, goal progress, and reasoning quality metrics. The number-one production failure mode is not model quality. It is the inability to observe what went wrong. Production failures are misattributed to LLM hallucinations when they are actually context failures, policy failures, or state management failures. You cannot fix what you cannot see.

When This Fails

An agent makes a bad decision in a financial workflow. The team investigates. They can see the API calls. They can see the final output. They cannot see the reasoning chain that connected input to output, which memory was retrieved, which policy was evaluated, or what confidence level the agent assigned to its own conclusion. The investigation takes three days and concludes "the model hallucinated." The actual cause was a stale memory entry injected into context by a retrieval pipeline misconfiguration. Gartner predicts over 40% of agentic AI projects will fail to reach production by 2027, primarily due to this observability deficit.

Connections: Telemetry is the meta-plane. It measures all other planes, and without it, the other four are invisible. Identity must be attached to every trace (otherwise you cannot attribute actions). Memory writes and retrievals must be logged (otherwise you cannot diagnose context failures). Policy evaluations must be recorded (otherwise compliance is unverifiable). And Context assembly must be observable (what was in the window when the decision was made?). Telemetry also feeds the Learning Engine: without structured evaluation signals, there is no feedback loop, and the agent cannot improve.

Industry Maturity: Execution tracing mature, reasoning tracing emerging. LangSmith, Langfuse, AgentOps, and Braintrust cover execution tracing and cost analytics well. Reasoning and decision observability is maturing. Cross-agent causal chains (who spawned what, and why) remain early-stage.

How the Planes Connect

The five planes are not independent modules. They form a directed dependency web where failures cascade. Identity scopes Memory (whose memories are these?). Memory feeds Context (what gets loaded into the window?). Policy gates everything (what is permitted at each step?). Context is the only surface the model sees (all other planes are invisible unless serialized into tokens). And Telemetry measures the entire system, creating the feedback loop that enables learning and proves compliance.

The practical consequence: you cannot build one plane in isolation. An organization that invests in memory infrastructure but ignores identity will discover that agent memories bleed across users. A team that builds sophisticated policy rules but neglects context engineering will find that the model never sees those rules. And without telemetry, no one will know any of this is happening until a production incident surfaces it.

The Module

The intermediate abstraction between Agent and Application.

A single agent is a worker. An application is a product. Between them lives the Module: a packaged multi-agent capability that is composable, versioned, and independently deployable.

Think of modules as microservices for agents.

Composable
Modules can be assembled into larger modules or applications
Versioned
Semantic versioning with backward-compatible interfaces
Independently deployable
Ship a module without redeploying the application
Self-contained
Includes its own agents, tools, memory configuration, and policies
Observable
Emits standard telemetry through the Telemetry Mesh
Governed
Inherits policies from the Policy Cascade; can define module-level policies

Examples

  • Customer Onboarding Module: 5 agents: document collector, identity verifier, risk scorer, account creator, welcome messenger
  • Code Review Module: 3 agents: static analyzer, security scanner, style reviewer
  • Invoice Processing Module: 4 agents: extractor, matcher, approver, payment initiator
  • Research Module: 3 agents: search coordinator, source evaluator, synthesis writer

The module abstraction is critical for enterprise adoption. Organizations do not deploy individual agents. They deploy capabilities. The module is the unit of capability.

Why Modules Matter

Without the module abstraction, enterprise agent adoption faces three problems:

  1. Agent sprawl. Individual agents proliferate without organizational structure. Nobody knows which agents work together or what capability they collectively provide.
  2. Versioning chaos. Updating one agent in a multi-agent workflow can break the entire pipeline. Module versioning with defined interfaces solves this.
  3. Reusability failure. Teams build the same multi-agent patterns repeatedly. Modules are the unit of sharing.

The module maps naturally to how enterprises already think about software: a service with a defined API, an SLA, a cost model, and an owner. The difference is that the service is composed of agents rather than microservices. Just as the microservices revolution required new infrastructure (service meshes, container orchestrators, API gateways), the module revolution requires the Switchboard, the Proving Ground, and the Shield.

The Protocol Stack

Standards that enable agents to connect to tools, talk to each other, interact with humans, and participate in markets.

MCP Model Context Protocol: The USB-C of AI

Standardizes how agents connect to external tools, databases, and APIs. Governed by the Agentic AI Foundation under the Linux Foundation. 10,000+ active servers. 97M+ monthly SDK downloads. Adopted by ChatGPT, Cursor, Gemini, Copilot, VS Code. Three capability types: Tools, Resources, Prompts.

MCP solves the N×M problem: define a tool once, any compliant agent can use it. The November 2025 spec introduced asynchronous operations, server identity, official extensions, and a registry for discovering MCP servers. Anthropic's code execution MCP demonstrates privacy-preserving operations: execution results stay in the sandbox; sensitive data is tokenized before entering model context.

But 43% of tested implementations have command injection vulnerabilities. Security hardening is the immediate priority.

A2A Agent-to-Agent Protocol: Agents talking to agents

Enables communication between agents built on different frameworks by different vendors. Launched by Google, transferred to the Linux Foundation. 50+ partners including Atlassian, Salesforce, SAP, MongoDB. Agent Cards (JSON profiles advertising capabilities) for discovery. Task lifecycle management with support for long-running operations. JSON-RPC 2.0 over HTTP(S) with optional gRPC. Supports synchronous request/response, SSE streaming, and asynchronous push notifications.

Where MCP connects agents to tools, A2A connects agents to agents. An ADK agent can discover and invoke agents built with LangGraph or CrewAI through A2A's standardized interface. The open problems remain significant: identity verification between agents, trust/reputation systems for agent discovery, and auditing multi-agent transaction chains across organizational boundaries.

AG-UI Agent-User Interaction Protocol: The agent-human bridge

Standardizes how agents connect to user-facing applications. Born from CopilotKit's partnerships with LangGraph and CrewAI. Adopted by Microsoft, Oracle, and major frameworks. ~16 event types. Bidirectional: frontends send interruptions, approvals, and context back to agents mid-execution.

AG-UI closes the protocol triangle: MCP (agent↔tools), A2A (agent↔agent), AG-UI (agent↔human). AG-UI enables real-time human oversight of running agents: progress streaming every few hundred milliseconds, tool execution with approval gates, thinking step visibility, and mid-execution course correction. This is not just a display protocol. It is the infrastructure for human-on-the-loop governance.

ACP + x402 Commerce Protocols: Agent-to-merchant transactions

ACP (OpenAI + Stripe): Standardized agent-to-merchant transactions.
x402 (Coinbase): HTTP 402-based stablecoin micropayments. Most compelling for per-API-call pricing aligned with agent economics.

SPIFFE/SPIRE Workload Identity: The cryptographic root of trust

CNCF standard proving that a specific process on a specific machine is who it claims to be. The Layer 0 identity substrate from which agent identity is derived.

The Learning Engine

How agents get better over time.

Learning is the most commonly conflated concept in agent systems. It is not memory. It is not fine-tuning. It is not RAG.

The clean distinction: you have learned something when encountering the same situation would produce different behavior in a future session, even if you do not explicitly recall the original experience (Machine Learning Mastery). Memory stores facts. Learning changes behavior.

Learning operates at six timescales. Each is a different mechanism, a different persistence model, and a different architectural concern. Together, they form the engine that turns a static agent into a compounding one.

1
In-Session Adaptation
What happens during a single conversation.

The fastest timescale. Within a session, agents adapt through context accumulation, tool feedback integration, and reflection steps. The Reflexion architecture formalized this as "verbal reinforcement learning": after an action fails, the agent writes a plain-language reflection ("I assumed the file existed without checking first") and stores it in a short-term buffer. Every subsequent action in the session is conditioned on these accumulated reflections. Reflexion achieved 91% pass@1 on HumanEval coding, surpassing GPT-4's 80% baseline, and completed 130 of 134 sequential tasks in the AlfWorld benchmark.

In-session adaptation does not persist after session end. It is the raw material from which deeper learning is built. Without effective in-session adaptation, there is nothing worth consolidating.

The persistence bridge: Agents can bridge the session gap by writing discoveries to persistent workspace files during execution: corrections, rules, and patterns captured in the moment. This is an increasingly common pattern: during task execution, agents write to their own rules files, creating an explicit bridge between in-session discovery and cross-session retention. Think of it as the agent taking notes that its future self will read on the next clock-in. The overhead is negligible (a few hundred tokens added to context at session start), and the payoff is behavioral consistency across restarts, rate limits, and model updates.

2
Sleep-Time Consolidation
The agent that improves while idle.

The most architecturally significant development in agent learning. Sleep-time compute creates a dual-agent architecture under the hood, with two distinct workers serving different purposes:

The primary agent is user-facing. It handles conversation, tools, and real-time decisions. It runs on fast, low-latency models optimized for responsiveness. It generates raw experiences: conversations, tool calls, results, reflections.

The sleep-time agent is a background worker that never interacts with users directly. It activates during idle periods (between sessions, during pauses) and runs on stronger, slower models that excel at analysis. Its job is consolidation: it reads the primary agent's raw experiences, identifies patterns, resolves contradictions, reorganizes knowledge, and writes the results back into shared memory blocks. The primary agent wakes up smarter without having done the work itself.

The neurobiological parallel is precise. During slow-wave sleep, the human brain transfers memories from the hippocampus to the cortex, pruning weak connections while strengthening salient ones. Raw experiences are consolidated into organized knowledge. Without this consolidation, episodic memory accumulates but never distills. The same is true for agents: without a consolidation phase, an agent's memory becomes an ever-growing pile of raw transcripts rather than a refined knowledge base.

Letta's research demonstrated that this architecture creates a "Pareto improvement": agents with sleep-time compute achieve up to 18% improvement in reasoning accuracy while reducing real-time compute by up to 2.5x and token usage by up to 5x. The agent reasons better while costing less per session, because the hard analytical work was already done during consolidation.

3
Cross-Session Meta-Learning
The accumulation of expertise over weeks and months.

Over many sessions, agents improve not just factual knowledge but meta-strategies: how they approach problems, what retrieval patterns work, what communication styles succeed with which users. This is where agents begin to develop something resembling professional judgment.

LangMem's PromptOptimizer takes conversation trajectories, identifies what worked and what failed, and updates the agent's system prompt to encode better procedures. Cross-session behavioral learning without weight modification. LangChain's research showed this is most effective on tasks where the model lacks domain knowledge, achieving up to approximately 200% improvement over baseline prompts in specialized domains.

The trajectory concept: Most learning systems learn from outcomes (this succeeded, this failed). The richer approach extracts knowledge from the full reasoning path: the decisions made, the alternatives considered, the self-corrections applied, and the causal chains that connected action to result. A March 2026 paper on trajectory-informed memory formalizes this as four dimensions of experience encoding:

  • Combined trajectory: Reasoning and outcome together. "Find all sessions related to prior art analysis." Returns general similarity across past experiences.
  • Reasoning patterns: The thinking process only, independent of outcome. The same sessions now cluster by analytical approach rather than by topic.
  • Outcome space: Results only, arranged on a continuous spectrum from failure to success. Not binary pass/fail but a gradient of how things went.
  • Contextual re-embedding: Each step re-encoded with awareness of the full episode. Lines of connection emerge between sibling steps in the same engagement.

Search across any single axis, and different patterns emerge from the same set of experiences. This turns memory from a flat lookup table into a multi-faceted knowledge base that supports genuine expertise, not just recall.

The episodic-to-semantic distillation pipeline: Enough similar episodes produce patterns that migrate from episodic to semantic memory. "User A prefers concise answers in morning hours" (episodic, specific) becomes "User A has time-dependent communication preferences" (semantic, generalized). This is compounding expertise: agents that get meaningfully better at their job over months of operation.

4
Organizational Learning
When one agent's discovery benefits all.

The promotion cascade: Agent to team to department to enterprise. An agent discovers a workflow optimization. If it proves reliable across repeated cases, it promotes to team memory. If validated across teams, it becomes department policy. If it holds across departments, it becomes enterprise knowledge. This mirrors organizational learning theory, but at machine speed.

The LangMem multi-prompt optimizer implements a limited version of this: team-level learning is attributed and distributed back to individual agent prompts. IBM Research found that multi-agent orchestration reduces process hand-offs by 45% and improves decision speed by 3x. But these metrics describe coordination efficiency, not learning propagation. The organizational learning problem is distinct: how does one agent's discovery that "always verify prerequisites before checkout operations" become a team-wide procedural norm?

The premature promotion risk: The harder problem is knowing when to promote. Promoting a learning based on two or three examples may generalize a context-specific behavior (for example, "always use this particular API endpoint," learned in a test environment) into a team-wide procedure that breaks in production. But waiting for hundreds of examples before promoting means individual agents accumulate duplicate learnings independently, creating divergence rather than organizational coherence. This is the admission control problem, and it has no established solution. The risk is real: premature promotion of context-specific knowledge to global policy creates what might be called organizational hallucinations, where the enterprise "knows" something that is only true in a narrow context.

The CSA 2026 prediction positions self-improving agent systems as the defining trend of the year. The pieces are individually viable. The integration into a coherent learning pipeline, with proper admission control at each promotion boundary, remains the architectural challenge that will separate production systems from prototypes.

5
Parametric Learning
Changing the weights.

Every timescale so far operates in what researchers call token space: the agent's behavior changes because the text it reads changes (updated memories, revised prompts, new rules files). The model's internal parameters remain untouched. Parametric learning operates in weight space: fine-tuning via SFT, LoRA, RLHF, or DPO modifies the foundation model itself.

The distinction matters practically. An analogy: token-space learning is like giving a consultant a better briefing document before each engagement. Weight-space learning is like sending that consultant back to school. Both improve performance, but they operate at fundamentally different speeds, costs, and risk profiles.

ModeWhere It LivesPersistenceForgetting Risk
Token-spaceMemory + context + statePersistent, model-agnosticNone (text is versionable)
Weight-spaceModel parametersPermanent, model-specificHigh (catastrophic forgetting)

Why weight-space learning is rare in production: It requires meticulous data curation, offline evaluation, and careful human oversight, none of which can be repeated each time an agent needs to learn something new. Whose data trains the model when you have millions of users? Per-user fine-tuned models are architecturally possible but operationally complex. And the deepest structural barrier is catastrophic forgetting: training on new tasks degrades performance on old tasks. This has been studied since 1989 and remains unsolved in practical multi-domain deployment. No major model provider (OpenAI, Mistral, Together) offers continual learning as of March 2026; only one-off fine-tuning.

The weight-space frontier: Google's Nested Learning (NeurIPS 2025) treats the model as a spectrum of modules, each updating at a different frequency: fast modules for recent context, slow modules for permanent knowledge, and intermediates in between. MIT's self-distillation fine-tuning (January 2026) enables sequential multi-task learning without forgetting, at roughly 2.5x the compute cost. Both signal progress. Neither is production-ready for general agents.

The dominant trajectory for the 2025 to 2028 production window is token-space learning: agent memories that outlast any specific model. When the next frontier model releases, an organization that invested in token-space learning preserves its accumulated intelligence. An organization that invested in per-model fine-tuning must restart. As Letta puts it: "The weights are temporary; the learned context is what persists."

6
The Self-Improving Agent
The culmination: agents that improve how they learn.

The previous five timescales describe agents that learn from experience. The sixth timescale is qualitatively different: agents that improve how they learn. This is metacognition, the ability to reflect on and adapt your own learning process, not just apply it.

The ICML 2025 paper on truly self-improving agents established the theoretical requirement. Current self-improving agents rely on fixed, human-designed improvement loops: the same reflection process regardless of how skilled the agent has become or what kind of task it faces. These loops are rigid, fail to generalize, and do not scale as agents grow more capable. True self-improvement requires three metacognitive components:

  • Metacognitive knowledge: "What do I know? What are my weaknesses? What kinds of tasks challenge me?" The agent must have a self-model.
  • Metacognitive planning: "Given my current strengths and weaknesses, what should I practice? What would improve my weakest points most efficiently?" The agent must choose what to learn.
  • Metacognitive evaluation: "Did my last learning strategy actually work? Should I try a different approach?" The agent must assess its own learning process.

The MARS architecture formalizes this as a two-tier system: an object-level model that performs tasks, and a meta-level model that monitors and adjusts the object-level model's strategies. In benchmarks, MARS agents achieved 20 to 30% improvement in goal completion over standard agents, with statistically significant results. A memory-enhanced variant demonstrated 2.26x improvement on AgentBench for closed-source models and 57.7 to 100% improvement for open-source models, purely through iterative feedback, reflection, and memory management. No weight updates.

The practical path to self-improvement is already visible. It connects the timescales into a pipeline:

In-session reflection (correct errors in real-time)
  ↓
Sleep-time consolidation (distill raw experience into organized knowledge)
  ↓
Procedural memory update (change how the agent approaches problems)
  ↓
Prompt optimization (rewrite the agent's own instructions based on trajectory analysis)
  ↓
Organizational promotion (share validated learnings across agent teams)

Each step filters and distills. What starts as a raw experience in session becomes a behavioral change that persists across all future sessions. Letta's thesis on "Continual Learning in Token Space" argues that this entire pipeline should operate in token space, not weight space, because token-space learning is portable across model generations. An agent's accumulated intelligence outlasts any specific foundation model. When the model is upgraded, the learning transfers automatically.

Self-improving agent systems are the most important capability frontier of this decade. The Cloud Security Alliance's 2026 prediction, from a security-focused research organization rather than an AI hype outlet, states it plainly: "2026 will be the year we move past static agents."

Learning Across the Planes

Learning is not a single layer's concern. It is distributed across the five planes, and each plane participates differently:

Identity provides the anchoring. An agent's soul file defines what it is willing to learn and what behavioral boundaries it will maintain regardless of what experience suggests. Identity prevents learning from overwriting core values.

Memory is the substrate where learning physically lives: episodic memories that accumulate, semantic knowledge that distills, procedural memory that encodes changed behavior. Memory is the where of learning.

Context is the delivery mechanism. Learned knowledge is useless if it never enters the context window. Context engineering determines which lessons are surfaced for which tasks, ensuring the right learning reaches the right decision at the right moment.

Policy governs what can be learned and promoted. Admission control (which learnings are valid enough to promote from agent to team?) and compliance rules (which learnings must be forgotten under data retention policies?) are policy concerns.

Telemetry closes the loop. Without evaluation signals (was the agent's performance actually better after learning?), there is no feedback, and "learning" degrades into "accumulating unverified assertions." Telemetry provides the evidence that learning is working.

The Compounding Agent

The framework's culminating thesis.

An agent that compounds expertise over time is categorically different from one that starts from zero each session. This is not a matter of degree. It is a difference in kind. A compounding agent does not just remember more; it makes fewer mistakes, employs better strategies, and develops professional judgment that generalizes to novel situations. The formal test: if you deleted the agent's memories, would future performance regress? If the improvement has been encoded into its procedures, its instructions, its behavioral patterns, then it has genuinely compounded, not merely accumulated.

The entire Agentic Stack exists to enable this. The nine layers provide the structural foundation: infrastructure, models, frameworks, cognitive middleware, orchestration, evaluation, security, interface, and economics. The five planes provide the cross-cutting dynamics: identity keeps the agent coherent, memory stores its experiences, context delivers the right knowledge at the right moment, policy governs what it can learn and do, and telemetry closes the feedback loop. The learning engine provides the purpose: turning static agents into agents that improve. And the design patterns provide the recipes for assembling these components into production architectures.

The commercial moat. When every competitor can access the same foundation models, the competitive advantage is no longer the model. It is the accumulated operational intelligence built on top of it. Harvard Business Review identified context as the emerging competitive advantage when AI models are commoditized. Forbes put it in strategic terms: "Organizations that regard their internal knowledge as critical infrastructure could amplify their advantages, while those that overlook it may fall behind." Self-improving agents are the competitive advantage of the next decade, not because of what they know on day one, but because of what they will know on day three hundred.

This is not theoretical. The evidence is production-grade. In patent law, agents evaluated on internal attorney competency rubrics jumped from junior to senior competency levels after minimal expert feedback, a progression that normally takes human associates a decade. In healthcare, agents autonomously mapped decade-old EMR/EHR APIs and took over scheduling, treatment drafting, and lab analysis workflows in a single afternoon. In telecom, an agent replicated a 20-person team's operational workflow from a single 30-minute walkthrough of six legacy application stacks. The FLEX benchmark demonstrated that 49 training examples achieved accuracy gains comparable to thousands of traditional RL episodes on competitive mathematics. MARS agents achieved 2.26x improvement on general agent benchmarks without any weight updates.

The direction is clear. Static agents are the mainframe terminals of the AI era: functional, useful, and about to be surpassed by something that learns. The organizations that build compounding agents will accumulate intelligence that their competitors cannot replicate by simply switching to a newer model. The stack described in this paper is the blueprint for building them.

Design Patterns

Each pattern specifies a topology, the layers it touches, when to use it, and the primary risk.

Pattern 01
Sequential Pipeline
Linear chain. Each agent's output is the next agent's input.
Layers
L2L4
When to useWell-defined workflows with clear stage boundaries: document processing, data transformation, content pipelines. ExamplePDF extraction → Schema validation → Summary generation. Each agent is purpose-built for one transformation. RiskSingle point of failure. One agent's bad output cascades through the chain. Mitigate with validation gates between stages.
Pattern 02
Coordinator / Dispatcher
One agent routes to specialists.
Layers
L3L4
When to useIntent classification with domain-specific handlers: customer support triage, multi-domain Q&A, any system where different query types need different expertise. ExampleCustomer message → Router classifies as billing/technical/sales → Specialist handles → Synthesizer ensures consistent voice. RiskRouter becomes a bottleneck. Misclassification sends work to the wrong specialist. Router quality determines system quality.
Pattern 03
Parallel Fan-Out / Gather
Concurrent agents with synthesizer.
Layers
L4
When to useResearch tasks, multi-source data gathering, competitive analysis. A lead agent with parallel sub-agents outperformed single-agent benchmarks by 90.2%. ExampleDue diligence research: financial analysis agent, legal review agent, and market analysis agent work simultaneously. A synthesizer agent resolves contradictions and produces a unified report. RiskSynthesis quality depends on handling contradictions between parallel outputs. The synthesizer must identify and resolve conflicting claims rather than averaging them.
Pattern 04
Hierarchical Decomposition
Recursive task breakdown.
Layers
L4L5
When to useComplex objectives requiring multi-level planning: enterprise workflows, large-scale code generation, any task too complex for a single agent to hold in context. Example"Build this feature" → Manager decomposes into frontend/backend/testing sub-goals → Sub-managers decompose into implementable tasks → Workers execute. RiskDepth creates latency. Communication overhead grows with hierarchy depth. Keep trees shallow (2–3 levels) where possible.
Pattern 05
Generator-Critic
Create + validate loop.
Layers
L2L3
When to useContent creation, code generation, any task where quality matters more than speed. The generator benefits from a different perspective or stricter evaluation criteria. ExampleCode generation agent writes function → Code review agent checks for bugs, style, security → Generator revises until critic accepts. RiskInfinite loops if critic standards exceed generator capability. Must set a maximum iteration count and escalation path.
Pattern 06
Evaluator-Optimizer
Score + refine cycle.
Layers
L2L5
When to usePrompt optimization, parameter tuning, iterative refinement with measurable quality metrics. DSPy is the canonical implementation. ExampleAgent generates marketing copy → Evaluator scores on brand voice, clarity, and conversion potential → Feedback drives next iteration. RiskOverfitting to the evaluation metric rather than genuine quality improvement. Goodhart's Law applies.
Pattern 07
Supervisor
Meta-agent managing a team.
Layers
L4L5
When to useProduction fleets where reliability matters. The hierarchical orchestrator-worker pattern is the 2026 standard for 100+ agent deployments. ExampleCustomer service fleet: Supervisor monitors five specialist agents, validates their responses before delivery, reassigns tasks when one agent struggles, escalates to human when all fail. RiskSupervisor becomes a single point of failure. Consider supervisor-of-supervisors for critical workloads.
Pattern 08
Swarm
Emergent coordination without central control.
Layers
L4
When to useExploration tasks with high uncertainty, creative brainstorming, distributed sensing. Agents share a blackboard or message bus and self-organize based on local information. ExampleMarket research where four agents independently explore different angles and share findings via a shared workspace. No central coordinator. RiskUnpredictable behavior. Difficult to audit. Convergence is not guaranteed. Best for tasks where diversity of exploration matters more than efficiency of execution.
Pattern 09
Reflective Loop
Self-evaluation after action.
Layers
L2L3
When to useAny task benefiting from iterative improvement: research, writing, problem-solving. Based on the Reflexion architecture. RiskReflection quality depends on the agent's metacognitive capability. Weak reflection is worse than none.
Pattern 10
Human-in-the-Loop
Approval gates for high-stakes decisions.
Layers
L4L6L7
When to useFinancial transactions, customer communications, legal documents, production deployments. Required whenever policy evaluation exceeds agent authority. RiskHuman bottleneck. Design for asynchronous review with timeout escalation.
Pattern 11
Blackboard
Shared workspace. Agents read and write independently.
Layers
L4
When to useCollaborative analysis where agents contribute partial solutions: multi-perspective research, complex diagnosis, any problem where the whole is greater than the sum of parts. ExampleMedical diagnosis: symptom analyzer, lab result interpreter, patient history reviewer, and differential diagnosis agent all write to a shared case file. RiskCoordination through shared state introduces race conditions. Requires conflict resolution.
Pattern 12
Cognitive Router
Identity-aware output classification and routing.
Layers
L1L3
When to useAny long-running autonomous agent. The Cortex pattern that separates chatbots from true agents. RiskClassification errors route output to the wrong subsystem. Requires high-quality output classification.

Pattern Selection Guide

Choose your pattern based on your problem shape. Most production systems combine multiple patterns.

Is it a single agent task?
Yes
Reflective Loop (quality)
Simple Agent Loop (speed)
No: multiple agents
Clear sequential stages? → Sequential Pipeline
Need routing by intent? → Coordinator / Dispatcher
Independent parallel work? → Fan-Out / Gather
Complex multi-level goal? → Hierarchical Decomposition
Quality requires validation? → Generator-Critic
Measurable optimization target? → Evaluator-Optimizer
Reliable production fleet? → Supervisor
Exploratory / uncertain? → Swarm
Collaborative partial solutions? → Blackboard
High-stakes decisions? → Human-in-the-Loop

Canonical Compositions

Beyond individual patterns, the stack enables four canonical compositions.

Composition A

The Minimal Viable Agent

8 primitives: Autoregressive Core → Tool Calling → Agent Loop → Tool Binding → State Manager → Sandbox → Structured Logger → Audit Ledger. A single agent that can reason, act, maintain state, and be audited. No orchestration, no memory persistence, no identity. But a complete loop.

Composition B

The Personal Assistant

A persistent, always-on agent integrated into a person's digital life. Key primitives: Identity Kernel + Session Persistence + Memory Hierarchy (all six types) + Integration Connectors + Filler Suppressor + Notification Engine. Local-first memory. Multi-persona isolation. The OpenClaw archetype with 210,000+ GitHub stars.

Composition C

The Enterprise Fleet

Multi-agent orchestration for complex business processes. Key primitives: Agent Registry + Delegation Engine + Supervisor + Shared State Store + Workflow Graph + HITL Gate + Trust Boundary + Policy Cascade + Cost Tracker. A hierarchical orchestrator-worker topology where manager agents maintain strategic plans and specialist agents execute bounded subtasks.

Composition D

The Self-Improving System

Learning + sleep-time consolidation + meta-learning in a closed loop. The agent operates in production. Sleep-time agents consolidate experience. Online evaluators measure quality. When quality degrades, the system triggers a meta-learning cycle. The three-level self-evolution framework: in-context adaptation → experience-based refinement → continuous optimization.

Composition E

The Autonomous Worker

Hours-long autonomous execution on complex objectives. Key primitives: Goal Beacon + Dual-Process Router + Metacognitive Monitor + Memory Arbiter + Context Compaction + Checkpoint Engine + Resource Governor + Break-Glass Protocol. The Cortex layer is the essential differentiator. Without it, agents degrade after minutes. With it, sessions approaching 45+ minutes at the 99.9th percentile are routine, with the frontier extending to multi-hour execution.

The Ecosystem Map

Who builds what. No single project covers more than five layers.

White Space Analysis

The agentic framework landscape has settled into identifiable lanes. The distribution reveals where the industry is investing, and more importantly, where it is not.

Where frameworks cluster: Every major framework targets the same two functional zones. L2 (Workbench) and L4 (Switchboard) are where LangGraph, CrewAI, OpenAI Agents SDK, Google ADK, AutoGen, and Semantic Kernel all compete. These are table stakes: the layers that are commoditizing fastest.

The four critical gaps are structural, not temporary. Different layers require fundamentally different architectural priorities:

  • L3 Cortex (cognitive middleware): Nearly empty. No general-purpose open-source implementation exists. This is the hardest layer and the highest-value unsolved infrastructure problem.
  • L6 Shield (security): Entirely absent from general-purpose frameworks. Security is a bolt-on layer from specialist vendors (Zenity, Lakera, NeMo Guardrails), not an integrated architectural concern.
  • L0 Substrate (infrastructure): Specialized players like HashiCorp/SPIFFE, Aembit, and Cloudflare operate at a layer frameworks do not address.
  • L8 Commons (economics): Payment protocols (ACP, x402, Visa TAP) are being built by financial infrastructure players entirely outside the framework ecosystem.

Aggregate Coverage by Layer (14 projects surveyed)

L0
Specialized
L1
Partial
L2
Table stakes
L3
Critical gap
L4
Table stakes
L5
Maturing
L6
Emerging
L7
Partial
L8
Specialized
Key Insight

No single framework covers more than five of nine layers. The ecosystem is converging toward a composition model: organizations must assemble 3 to 5 specialized tools (framework + security layer + identity infrastructure + observability + payment rails) to cover the full stack. The integration complexity of composing these tools is itself a white space, and one that cloud providers are best positioned to address through managed service abstractions.

Framework Coverage

LangGraph / LangChain 34.5M dl/mo

Orchestration + workbench leader

CrewAI 44.3k stars

Multi-agent teams, MCP-native

Microsoft Semantic Kernel

Azure-native, Responsible AI module

OpenAI Agents SDK 10.3M dl/mo

Broadest coverage, 100+ models

Google ADK / Agent Engine 17.8k stars

Multimodal, A2A native

AWS Bedrock AgentCore

Managed infrastructure, security-first

Letta

Memory-first, sleep-time compute

OpenClaw 180k+ users

Identity kernel, soul architecture

Attested Intelligence

TEE-based agent verification

Nevermined

Agent payments, economic rails

Rubrik Agent Cloud

Security + interface integration

Temporal

Durable execution infrastructure

DSPy

Prompt optimization, meta-programming

AutoGen / AG2

Multi-agent conversations

Full coverage Partial coverage Colors correspond to layer: L1 L2 L3 L4 L5 L6 L7 L8

Reading this map: The strongest signal is the gaps. The L3 column (Cortex) has only two entries, neither a general-purpose open-source implementation. The L8 column (Commons) is dominated by payment specialists, not agent platforms. The most crowded layers are L2 and L4, exactly the layers that are commoditizing fastest. MCP, A2A, and AG-UI are positioned as cross-cutting protocols rather than layer occupants. The stack is too large for any single vendor. This is by design.

Integration Contracts

The structural glue between layers. Each contract defines what crosses a layer boundary.

Boundary Contract Standard
L0L1 Compute allocation, hardware attestation Cloud APIs, SPIFFE
L1L2 Inference APIs, tool calling schemas, structured output Chat Completions, Responses API
L2L3 Composed agent instances with tool bindings and state Framework-specific agent interfaces
L3L4 Goal-anchored, identity-coherent agent behaviors A2A Agent Cards
L4L5 Managed workflows with execution guarantees Workflow graph definitions, checkpoint APIs
L5L6 Observable, evaluated agent execution Structured traces, eval scores, audit events
L6L7 Governed, credentialed agent services OAuth scopes, policy attestations, trust boundaries
L7L8 Metered, billed agent capabilities Usage records, transaction mandates

The principle: contracts between layers must be more stable than the implementations within them. When a framework updates, the contracts should hold.

The emerging protocols map to these boundaries naturally: MCP governs the L2→Tool boundary. A2A governs the L4→L4 boundary (agent-to-agent). AG-UI governs the L7→Human boundary. SPIFFE governs the L0→L6 boundary. ACP/x402 govern the L8→Market boundary.

The contract stability test: If you can replace the implementation behind a contract without breaking consumers above or below, the contract is stable. As of March 2026, the most stable contracts are at L1 (inference APIs are well-standardized) and the least stable are at L3 (cognitive middleware has no standard interfaces).

Industry Landscape

Biggest Challenges

01

Trust and reliability remain the #1 barrier

65% of enterprises cite complexity as the primary adoption blocker. Agents that work 95% of the time are 100% untrustworthy in regulated environments. The gap between demo quality and production quality remains enormous.

02

Evaluation methodology is immature

Current benchmarks measure task completion but not reasoning quality. An agent that arrives at the right answer via flawed reasoning will eventually fail catastrophically. Trajectory evaluation, assessing the path, not just the destination, is in its infancy.

03

Cost at scale is poorly understood

Token costs are a distraction. In real workflows, LLM tokens represent less than 1% of total agent cost. Tool invocations, external API calls, and compute time dominate. Most organizations cannot attribute agent costs to business outcomes.

04

Interoperability fragmentation persists

MCP is winning tool integration. A2A is gaining for agent communication. AG-UI is emerging for human interaction. But they are not yet composable. No single system cleanly implements all three. The protocol triangle exists in theory more than in practice.

05

Security vulnerabilities are accelerating

43% of tested MCP implementations have injection vulnerabilities. The attack surface grows with every new tool connection. The industry is deploying agents faster than it is securing them.

06

Agent sprawl mirrors identity sprawl

Microsoft warns that agents are being created by low-code tools faster than governance models can track them. Most organizations cannot answer: how many agents do we have?

07

The "vibe coding" security hangover

The 2024–2025 wave of AI-generated code created a generation of agents built without security review. As these agents move from prototypes to production, the CSA predicts a surge in agent-related CVEs through 2026–2027.

White-Space Opportunities

High Priority

Cognitive Middleware (L3) is nearly unoccupied

The Cortex layer (identity persistence, memory arbitration, goal maintenance, metacognition) has no turnkey open-source implementation. Letta approaches from the memory side. The most complete implementations remain proprietary. This is the most valuable unsolved infrastructure problem in the stack.

Emerging

Agent-native payments lack infrastructure

Five payment protocols have launched but none has a mature implementation for the full agent commerce lifecycle: discovery, negotiation, transaction, verification, dispute resolution.

Memory Portability

Cross-vendor memory portability does not exist

An agent's memories on Letta cannot migrate to Mem0 or Zep. Memory lock-in is the new vendor lock-in. A standard for portable agent memory would be transformative.

Identity Standards

Agent identity standards are being written now

NIST's concept paper, the IETF Entity Attestation Token, and eMudhra's platform represent early infrastructure. Jones Walker predicts NIST's voluntary guidelines will become compliance obligations within 18 months.

Convergence Points

The industry is converging on:

  • MCP for tool integration. Universal adoption across ChatGPT, Cursor, Gemini, Copilot. The "USB-C of AI" metaphor holds.
  • A2A for agent communication. Google's protocol with 50+ partners. The only credible agent-to-agent standard.
  • AG-UI for human interaction. Microsoft and Oracle adoption signal enterprise acceptance.
  • Managed agent platforms. AWS AgentCore, Google Agent Engine, and Azure are building the PaaS layer for agents.
  • Policy-as-code via OPA. Governance sidecars are becoming the standard enforcement mechanism.
  • Hybrid memory architectures. Markdown + Vector DB + Graph DB is the 2026 production consensus.

Divergence Points

The industry has not settled on:

  • Graph-based vs. role-based orchestration. LangGraph (explicit graphs) vs. CrewAI (declarative roles) represent fundamentally different philosophies. Both work. Neither has won.
  • Vector vs. graph memory for deep retrieval. Vector databases excel at fuzzy semantic matching. Graph databases excel at relational reasoning. The optimal blend is use-case dependent.
  • Open framework vs. managed platform. Build-your-own with LangGraph/CrewAI vs. deploy-on with AgentCore/Agent Engine. The platform play simplifies operations but constrains architecture.
  • Token-space vs. weight-space learning. Letta argues learning should happen in context. Traditional ML argues it should happen in weights. The answer is probably both, at different timescales.
  • Assertion vs. cryptographic proof for trust. Most systems use policy assertions. Attested Intelligence proves that cryptographic proof is viable. The market will decide.
  • Centralized platform vs. agent mesh. Hyperscaler platforms offer simplicity; decentralized agent meshes offer flexibility. The enterprise market may split: regulated industries on platforms, technology companies on meshes.

Open Frontiers

Where the map is honest about what remains unexplored.

01

No production system implements all nine layers

The most complete systems cover perhaps five or six with varying depth. The Cortex and the Commons have no turnkey implementations.

02

Agent-to-agent trust is unsolved

A2A's open problems include identity verification between agents, trust/reputation for discovery, and preventing impersonation. The protocol exists; the trust infrastructure does not.

03

Multi-modal perception is primitive

Current agents process text natively, images adequately, and audio/video poorly. The multimodal agent that watches a video, reads a spreadsheet, and synthesizes across modalities in real-time is a capability frontier.

04

Post-quantum readiness is urgent but unaddressed

eMudhra targets post-quantum cryptographic standards. DigiCert's CEO compares the transition to Y2K. Agent identity infrastructure built on classical cryptography today will need rebuilding within a decade.

05

The liability question is open

Existing frameworks can likely handle agent harms through product liability and agency theory. Mobley v. Workday (July 2024) was the first federal court application of agency theory holding an AI vendor directly liable. But the multi-stakeholder liability matrix remains unsettled, with state laws expanding AI liability rapidly across Texas, New York, Illinois, and Colorado.

06

Collective learning at scale is theoretical

Sleep-time consolidation works (Letta). Cross-session meta-learning has promising implementations (LangMem). But organization-wide knowledge promotion at machine speed remains a design pattern, not a deployed capability. Continual weight-level learning that avoids catastrophic forgetting is still primarily a research problem.

07

Agent consciousness is speculative but approaching

A 2024 paper argued that language agent architectures may already satisfy Global Workspace Theory's conditions for phenomenal consciousness. Architecturally irrelevant today. Philosophically and legally relevant sooner than expected, as agents develop richer self-models through metacognitive monitoring and identity persistence.

Conclusion

The Agentic Stack is a map, not a prescription.

It does not tell you which layers to build or which primitives to prioritize. It tells you where you are, what is adjacent, and what the terrain looks like.

The landscape it describes is being built by thousands of teams working in partial isolation. A framework team builds orchestration abstractions. A research lab formalizes dual-process reasoning. A solo developer builds cognitive routing that no established framework has attempted. A protocol committee standardizes tool integration. A payments team designs micropayment rails. A security researcher files a patent for cryptographic governance artifacts. A memory team invents sleep-time compute by analogy to neuroscience.

None of them are building the same thing.

All of them are building the same thing.

The agent stack will be the most consequential software infrastructure of the next decade. It will determine how organizations operate, how knowledge is preserved, how trust is established between autonomous systems, and how economic value flows through networks of intelligent workers.

This framework is the beginning of a shared vocabulary for that work. It will evolve as the landscape evolves. Layers will merge. New layers will emerge. Primitives will be renamed, deprecated, or promoted. The protocol stack will consolidate. The economic layer will mature. The Cortex will go from the least understood layer to the most contested battleground.

The map is not the territory.
But without a map, you cannot navigate.

Build on primitives, not frameworks. Embed policy in infrastructure, not documents. Treat memory as hierarchical, identity as persistent, and learning as first-class. Observe everything. Trust nothing by default.

The rest is implementation.

Appendix: Emerging Standards and Regulatory Landscape

The regulatory environment for agent systems is crystallizing faster than most practitioners realize.

NIST · February 2026

AI Agent Standards Initiative

Three pillars: industry-led standards development, community-led open-source protocol development, and research in agent security and identity. The parallel RFI on AI Agent Security and Concept Paper on AI Agent Identity signal that agent governance is transitioning from best practice to compliance obligation.

IETF · Draft

Entity Attestation Token (EAT) for AI Agents

draft-messous-eat-ai: Defines CBOR/JWT-encoded attestation profiles including model hash, training data ID, differential privacy parameters, input policy digest, owner identity, and allowed APIs. Supports composite attestation via nested EATs for multi-agent platforms: hardware root of trust → TEE/OS → AI agent → sub-models.

IETF · March 2026

Verifiable AI Provenance Framework

draft-ailex-vap: Targets AI audit trail systems and regulatory submission tools.

EU · In Force

EU AI Act Implications

Entered into force August 1, 2024 with progressive application. High-risk agent categories face mandatory conformity assessments, technical documentation, and human oversight requirements. TRiSM analysis maps Trust, Risk, and Security Management requirements onto agentic systems.

US States · 2025–2026

State Liability Expansion

Texas ($200K uncurable violations), New York ($15K/day), Illinois (employment discrimination), Colorado (algorithmic discrimination from June 2026). Wiley Rein's analysis notes insurance lines are not yet covering AI-specific liabilities. The practical implication: agent builders need the Shield layer not as a feature but as a legal requirement.

WEF / Cognizant · March 2026

AI Agents in Action Report

The AI Agents in Action report emphasizes classifying agentic systems by autonomy level and risk profile before determining oversight models. This mirrors the Agentic Stack's principle that policy is infrastructure: governance decisions must be made architecturally, not administratively.

The Accountability Matrix

Stakeholder Liability Type Source
Developer Product liability for design defects Credo AI analysis
Operator Negligence liability for misconfiguration Emerging case law
User/Principal Defines scope of delegated authority Agency theory
Infrastructure Provider SLA obligations Contract law

The trajectory: Voluntary guidelines (2023) → Referenced in executive orders (2024) → Cited in state law (2025) → Mandatory compliance obligations (projected 2027). Agent infrastructure built today without governance will need expensive retrofitting within 18 months. The Shield layer is not optional. It is the price of admission to regulated markets.

The Agentic Stack is an open framework maintained as a living document.

Version 2.0 published March 2026