llm-securityprompt-injectionai-agentspolicy-engine

Indirect Prompt Injection Defense: Source-Trust for Agents

Treat tool output as untrusted. Source-trust levels let policies score tool_output and untrusted_content as higher risk — automatically.

May 9, 20267 min read

Indirect prompt injection hides instructions in web pages, emails, or tool output. The model obeys attacker text and calls dangerous tools. Sanctum uses source-trust classification so policies treat untrusted content differently from the human operator.

Source-trust levels

Seven levels including untrusted_content and tool_output. Policies can require verify for any action proposed when source trust is low — regardless of what the model “wants.”

Deterministic defense

Combine source trust with blast-radius scoring: external destinations, physical world flags, and estimated value drive REQUIRE_VERIFICATION before execution.

MCP action gate · vs guardrails

Guides: agentic AI risk · MCP security · runtime authorization · HITL approvals · coding agents · get started
More: all posts · AI trust layer · open Sanctum Console

Indirect Prompt Injection Defense: Source-Trust for Agents

Source-trust levels

Deterministic defense

Give every agent action a trust boundary.

Give every agent action
a trust boundary.