Blog
llm-securityprompt-injectionai-agentspolicy-engine

Indirect prompt injection defense with source-trust classification

Tool output and untrusted content can hijack agents. Source-trust levels let policies treat tool_output and untrusted_content as higher risk automatically.

May 9, 20267 min read

Indirect prompt injection hides instructions in web pages, emails, or tool output. The model obeys attacker text and calls dangerous tools. Sanctum uses source-trust classification so policies treat untrusted content differently from the human operator.

Source-trust levels

Seven levels including untrusted_content and tool_output. Policies can require verify for any action proposed when source trust is low — regardless of what the model “wants.”

Deterministic defense

Combine source trust with blast-radius scoring: external destinations, physical world flags, and estimated value drive REQUIRE_VERIFICATION before execution.

MCP action gate · vs guardrails

More: all posts · runtime trust layer · open Sanctum Console

Build AI humans can trust.

Open the cloud console to manage runtimes and policies, or self-host the open-source runtime from GitHub.