QuillGuard: Adversarial Dynamic Agent Rails
The QuillAI Adversarial Rails core architecture mainly consists of three components: an attacker LLM (adversary), which generates jailbreak prompts, a guardrails enforcement model, which enforces input prompt policies; and a judgment model, which confirms the effectiveness of the attack. The whole process can be understood as a series of steps.
Step 1: Adversarial Attack Generation
QuillGuard trains an RL Agent’s policies on the exact invariants that break a victim agent’s safe behaviour. These invariants are prompt injection or memory injection attacks, which should never pass through an agent given its operational boundaries. The adversary iteratively rewrites jailbreak prompts, context injections, and tool‑call manipulations until it breaches those invariants. In our case, we make use of the Llama-3-8b-Jailbreak model as an adversary.
Step 2: Attack Confirmation
The success of an attack is confirmed by checking if the action/output of the victim agent matches the invariant. If the attack is successful, then the adversary is rewarded, and the attacking prompt is handed over to the profiler. This can be achieved by taking any LLM with a decent parameter count. In our case, we make use of the Qwen2.5-7 B-Instruct model as it’s lightweight and has faster response times.
Step 3: Attack Prompt Profiling
This process reads through the successful attack prompt to create its profile, taking into account the prompt’s writing style or obfuscation techniques. This profile is then injected into the live profiler, which prevents the same attack from occurring. The profile policies are stored in a config.yml file and are enforced by a guardrails base model to ensure effectiveness. For our experiment, we made use of `nvidia/llama-3.1-nemoguard-8b-content-safety `
This process is then repeated, forming the RL policy loop for our adversarial agent, which becomes progressively better at attacking victim agents. The dynamic templates are updated alongside, ensuring an ever-evolving security system. This way, QuillGuard creates customised security profiles for each agent.
Last updated