From 'Trust Us' to 'Verify Us': Anthropic, Confidential Inference, and the Next Trust Problem

Jason Clinton’s OC3 2026 talk on confidential computing and scaling laws pushed me to finally write about Anthropic’s Confidential Inference Systems paper, a joint publication with Irregular (formerly Pattern Labs), released June 2025. It’s one of the cleaner public treatments of what it actually means to make AI trust verifiable rather than asserted. It also points directly at a gap the industry hasn’t begun to close: trust in agentic systems.

This post unpacks the paper’s core contributions, where its claims need qualification, and where I think the conversation has to go next: from confidential inference to attested agent identity.

The Trust Problem in AI Today

When you send a prompt to an AI service, you’re trusting an enormous stack: the model provider, the cloud infrastructure operator, the networking layer, the data center staff. Every layer can see your data unless active technical measures prevent it.

Today’s answer is mostly contractual. Privacy policies. SOC 2 Type II reports. Data Processing Agreements. These are meaningful, but they’re promises, not proofs. A breach, an insider threat, a subpoena, or a misconfiguration can invalidate all of them simultaneously.

The confidential computing community has been arguing for a decade that the right answer is cryptographic assurance: hardware-rooted proof that data was processed in a specific, verified environment. The Anthropic paper is significant because it shows what that argument looks like when applied rigorously to production AI inference. And because Anthropic is large enough that if they ship this, it sets a floor for what “serious” AI privacy architecture looks like.

The Three-Party Trust Model

The paper establishes a clean mental model with three principals:

Model owner (Anthropic): needs to protect model weights from theft, exfiltration, or misuse. Frontier model weights represent enormous R&D investment and carry dual-use risk: wrong hands, wrong capabilities.

Data owner / user: needs prompt privacy. Enterprise use cases like proprietary code, legal strategy, and medical records demand inputs that no infrastructure operator can read.

Cloud / service provider: hosts the model as a service, and needs to give credible assurances to both parties without either having to simply trust them.

The architecturally important position: the cloud provider is untrusted by default. Even a well-intentioned operator shouldn’t be a required trust assumption. The goal is a system that holds even if the cloud operator is compromised, malicious, or subpoenaed. Most AI services today are designed assuming a trusted operator, with access controls layered on top. This design inverts that.

The Architecture: A Minimal Trusted Surface

The challenge: not all hardware accelerators, including GPUs and custom AI chips, fully support confidential computing yet. NVIDIA’s H100 Hopper CC mode exists, but comes with real constraints: single-tenant VM access requirements, roughly 15-30% performance overhead, and limited attestation granularity. The paper’s hybrid design is an architectural response to those limitations, not an ideal state.

The paper is careful to distinguish two accelerator stories. In the ideal case, the accelerator itself natively supports confidential computing: encrypted inputs arrive at the chip, cleartext lives only in private accelerator memory, debugging and measurement features are shut off during execution, and the accelerator can attest its own state.

The fallback is the bridged design Anthropic emphasizes: a CPU-side TEE mediates access to an accelerator that is not fully confidential on its own. That’s practical today, and the paper is explicit that it is a compromise rather than an end state. My read is that unless hardware controls constrain accelerator access to the TEE, the CPU-to-accelerator hop remains the weak seam.

The solution is a small, secure trusted loader: a minimal program that sits between the inference stack and the hardware accelerator, and is the only component that ever handles cleartext. Everything else in the system is “untrusted,” meaning it can change, be misconfigured, or be adversarial without affecting the security guarantee.

The loader does three things:

Accepts encrypted data, decrypts it inside the protected environment, passes it to the accelerator
Invokes inference, gets results back
Re-encrypts results before they leave the trusted boundary

To the rest of the inference stack, the loader presents itself as a virtual accelerator, an abstraction that hides the security plumbing. The inference server talks to it like a GPU. The loader only accepts programs signed by Anthropic’s secure CI/CD pipeline, which requires multi-engineer review before deployment.

The same loader architecture also solves a second problem most people miss: model weight security. Weights are stored encrypted, decrypted only inside the loader, never released from it. Weight exfiltration requires defeating hardware attestation, not just stealing a file.

The paper also gets specific about model-weight protection. The RAND report on Securing AI Model Weights defines five security levels for frontier models. The top two, SL4 and SL5, are meant for models whose theft risk looks more like a nation-state problem than a conventional enterprise security problem.

At that level, ordinary at-rest encryption is not enough. The paper points toward TEE-isolated weight keys, logged KMS use, and even controlled availability: keeping the key service cold unless a legitimate model load is happening. It also gestures at confidential provisioning, where unencrypted weights are destroyed immediately after provisioning, or training/finalization happens inside a TEE.

The important idea is simple: sensitive weights need protection while they are being loaded and used, not only while they are sitting on disk.

How Attestation Actually Works

The trust chain runs through hardware attestation. In modern confidential computing platforms (AMD SEV-SNP, Intel TDX, NVIDIA H100 CC mode), attestation is generated by dedicated hardware security processors, including AMD’s PSP, Intel TXT’s root of trust, and NVIDIA’s GPU security controller. In the architecture described, a virtual TPM within the confidential VM records cryptographic measurements of the boot stack, forming the basis of the attestation report.

The flow:

Boot measurement: At each boot stage, the security processor records a cryptographic hash. By the time the loader is running, a measurement of the complete software stack exists.
Attestation report: The loader produces a signed report: “I am running this exact code, with debug disabled, in hardware-isolated memory. Here is the proof.”
Keyserver validation: The loader presents its attestation to a keyserver. If the measurements match expected values, keys are released. If not (wrong code, debug on, unexpected config), no keys.
Encrypted data path: User data is encrypted before it reaches Anthropic’s servers, under a public key tied to the attested loader. Only a loader that passes attestation can decrypt it.

One distinction the paper makes that product marketing often blurs: TLS to the provider is not the same as end-to-end encryption to the enclave. In the weaker mode, the service provider terminates encryption and forwards plaintext into the TEE. That can still improve model security, but it does not give the user confidential input handling. In the stronger mode, the client validates an attestation document, uses the enclave’s signed ephemeral public key or an attestation-gated KMS policy, and encrypts so that only the enclave can decrypt. If someone says “confidential inference,” this is the first distinction worth pressing on.

One important caveat: this guarantee holds against a specific threat model: a compromised or malicious cloud operator, a misconfigured hypervisor, or an insider threat at the infrastructure layer. It does not protect against physical hardware attacks, supply chain compromise of the hardware itself, or side-channel exploitation (cache-timing, power analysis, the class of attacks Spectre/Meltdown exposed). Those are acknowledged, documented out-of-scope limitations. They are not flaws in the design, but boundaries that matter when evaluating what you’re actually buying.

There’s also an architectural tension the paper surfaces without fully resolving: the keyserver/KMS is a critical trust anchor. It outlines three places this logic can live, each with a different tradeoff:

Cloud-native KMS: operationally easiest, but it reintroduces account and policy compromise.
External KMS run by the model owner: reduces dependence on the cloud provider, but adds network exposure and attestation-validation complexity.
In-enclave KMS: minimizes key exposure and is where the paper seems to lean for SL4/SL5-style protection, but it is operationally hard.

My own view is that the missing piece is still transparency: publish expected measurements and key-release policy in a Sigstore/Rekor-style log so verification doesn’t quietly collapse back into “trust Anthropic.”

The Design Principle Worth Internalizing

The most generalizable insight: minimize the trusted computing base (TCB). Don’t try to make the entire system trustworthy; that’s not achievable at scale. Make the trusted surface as small as possible, verify it rigorously, and let everything else be untrusted.

The loader is intentionally tiny. It doesn’t implement inference logic. It doesn’t understand model architecture. It decrypts, invokes, re-encrypts. That simplicity is a security property: fewer lines of code, fewer vulnerabilities, and more auditable behavior.

The paper also goes further than just saying “use reproducible builds.” It lays out three paths for gaining confidence in the measured binary: fully reproducible builds, analysis of the compiled enclave, or a pre-built third-party/open-source enclave whose assurances are published and verifiable.

The common point is that source audit alone is not enough. Someone other than the builder has to be able to convince themselves that the measured enclave binary really matches the reviewed code. Reproducibility is still the cleanest answer, but I think it should terminate in a transparency log where expected measurements are publicly committed, so external parties can verify independently rather than taking Anthropic’s word on what should be attested.

Just as importantly, the paper refuses to treat confidential computing as magic. It separates systemic risks, such as flaws in enclave hardware, hypervisors, accelerator isolation, vendor supply chains, or the underlying cryptography, from introduced risks created by the implementation itself: unauthenticated enclave launch, debug-mode execution, weak attestation validation, or an overgrown build pipeline. That distinction matters because it tells builders where platform trust ends and where their own engineering failures begin.

Where I Think This Goes Next

This is the point where I move from summarizing the paper to extending it. The paper solves confidential inference. It doesn’t address confidential agency.

AI systems are increasingly not just answering but acting. Calling APIs. Writing to databases. Sending messages. Executing financial transactions. Making decisions on behalf of users across trust boundaries they didn’t explicitly review. When AI moves from assistant to agent, the trust problem doesn’t go away; it compounds.

The attestation model in this paper answers: “Can I trust that this inference environment is handling my data correctly?” It doesn’t answer: “Can I trust that this agent, acting on my behalf, is who it claims to be? That its context hasn’t been tampered with? That the specific action it’s about to take was actually authorized by me, right now, and can be proven after the fact?”

These are the questions of attested agent identity, and today’s answer to all of them is “trust the API key and the system prompt.” That’s the same “trust us” story inference was telling before TEEs. The surface is different; the structural problem is the same.

For agentic systems, this gap may become even more consequential than model exfiltration or prompt injection. Those are attacks against the AI system itself. Agentic identity failure is an attack against the authorization layer, the mechanism that determines what actions an AI is permitted to take on a human’s behalf. As agents become capable of consequential actions, including financial transactions, code deployment, and contract execution, the gap between “the API key was present” and “the human actually authorized this action, in this context, with this agent version” becomes expensive. Not just embarrassing. Expensive.

There’s existing work nearby: SPIFFE/SPIRE for workload identity, W3C Decentralized Identifiers for verifiable identity, and AI workload identity research from Microsoft and others. These are useful building blocks, but they don’t cleanly solve the agent-specific problem: dynamic context, multi-boundary workflows, sub-agent delegation, and real-time proof to relying parties.

At OC3 2026, I spoke about hardening MCP servers with confidential computing, specifically the attack surface that exists when a tool-calling protocol like MCP has no mechanism for the receiving server to verify which agent is calling, under whose authorization, or whether the execution environment matches any verifiable claim. That gap exists in every major agentic framework today.

The reference architecture I’ve been developing connects: confidential VMs → hardware-rooted agent identity issuance (building on attestation primitives like those in this paper) → signed, policy-bound tool invocations → verifiable audit trails that survive multi-hop agent chains. The Anthropic paper provides the foundation layer: the hardware attestation primitives and the TCB minimization principles. Extending them up to the agentic stack is the next problem.

What This Means for Builders

If you’re designing AI infrastructure, a few things should shift:

Treat SOC 2 as a starting point, not a trust anchor. Contracts fail silently. A breach, subpoena, or insider invalidates them instantly. Hardware attestation fails loudly. You know precisely when and why it failed. The architecture should prefer the latter.

Assume the operator is untrusted. Design systems where even a compromised cloud provider cannot access sensitive data. This is achievable today on modern TEE platforms. H100 CC mode has tradeoffs, but the primitives exist.

Your TCB is your attack surface. Everything in your trust boundary is something that can be exploited or manipulated. The Anthropic loader model is the right instinct: minimize ruthlessly, verify everything in the boundary, and accept that the rest of the stack will sometimes be wrong.

Start thinking about agent attestation before you need it. The agentic identity problem doesn’t have a production-ready solution yet. But the organizations building agentic infrastructure today are accumulating technical debt in this layer. The patterns from inference attestation, including hardware roots of trust, minimal TCB, and transparent key management, apply. The design work is worth starting now.

Further reading:

The Trust Problem in AI Today#

The Three-Party Trust Model#

The Architecture: A Minimal Trusted Surface#

How Attestation Actually Works#

The Design Principle Worth Internalizing#

Where I Think This Goes Next#

What This Means for Builders#