Indirect prompt injection defense with source-trust classification
Tool output and untrusted content can hijack agents. Source-trust levels let policies treat tool_output and untrusted_content as higher risk automatically.
Indirect prompt injection hides instructions in web pages, emails, or tool output. The model obeys attacker text and calls dangerous tools. Sanctum uses source-trust classification so policies treat untrusted content differently from the human operator.
Source-trust levels
Seven levels including untrusted_content and tool_output. Policies can require verify for any action proposed when source trust is low — regardless of what the model “wants.”
Deterministic defense
Combine source trust with blast-radius scoring: external destinations, physical world flags, and estimated value drive REQUIRE_VERIFICATION before execution.
More: all posts · runtime trust layer · open Sanctum Console
