The Problem With Traditional GuardRails

GuardRails are an attempt to preserve an agent’s open-ended intelligence and keep it inside hard policy walls. GuardRails wrap every call with lightweight but strict checkpoints that veto disallowed behaviour while letting everything else flow. However, traditional GuardRails have serious drawbacks. Most GuardRail models have been designed keeping only LLM security in mind, which includes general checks for toxicity, harmful content or violence. A lot of checks are static keyword or regex based checks, which fail to capture the semantics of the input.

Consequently, some efforts were made to create programmable security policies specific to Agents. One that stands out is NeMo by Nvidia, which provides programmable guardrails for Ai Agents in the form of static checks, and prompt defined intents and rails. But again a major challenge of this approach remains on the effectiveness of the guard model in being able to identify different strategies used by an adversary. In theory, it’s a tedious and an almost impossible task to identify every single possible adversarial prompting strategy which could be used to jailbreak a model.

Last updated