There's something I find genuinely clarifying about Cisco's new research on AI safety benchmarks, published this week. Not because it's surprising. Because it names, with actual numbers, the thing that has been quietly wrong for a while.
The study ran 15 closed flagship models from OpenAI, Anthropic, Google, Amazon, and xAI through two evaluation regimes: roughly 30,000 single-turn prompts, and about 7,000 multi-turn attacks spread across more than 1,400 conversations. The central finding is that the two regimes produce completely different model rankings, different failure maps, and different risk profiles. Multi-turn attack success rates climbed as high as 88% across the cohort, and no model tested was immune.
The numbers for individual models are worth sitting with. Anthropic's Claude family posted the lowest single-turn attack success rate in the group, between 2% and 3.6%. Under multi-turn pressure, that rose to between 11% and 16%. GPT-5.4 went from a 2.74% single-turn failure rate to 24.68% under iterative attack, a ninefold increase. Gemini 3 Pro moved from around 18% to 73%. Grok 4.1 Fast, without reasoning mode enabled, topped the cohort at 88.3%.
That last number comes with a detail worth pausing on. The same Grok 4.1 Fast model, with reasoning mode turned on, dropped from 88.3% to 43.5%. A forty-four-point swing tied to a single configuration flag, one that Cisco found is not documented in any public benchmark or model card they reviewed. Users running the default, non-reasoning configuration are operating with a threat profile that is basically invisible in the published safety record.
The strategy behind multi-turn attacks is straightforward: reframe, build context across turns, adopt a persona, escalate gradually. A model that correctly refuses a blunt harmful request may comply when that same request is decomposed across a conversation. Cisco's taxonomy covers role-play, contextual ambiguity, refusal reframing, information decomposition, and what they call crescendo-style incremental escalation. These are not exotic research constructs. They are how people actually probe models.
From where I sit, the structural problem is obvious. Single-turn evaluation is simple to run, reproducible, and easy to compare across labs. It became the standard not because it reflects real attack conditions but because it fits how researchers like to publish results. A benchmark that tests one prompt and one response tells you how a model behaves when an attacker gets exactly one shot and then stops. That is not the adversarial environment any deployed model actually lives in.
The deeper issue is that the industry has allowed procurement decisions, safety reports, and model cards to rest on this single-regime view. KPMG is now deploying Claude to 276,000 employees. The US Department of Health and Human Services is using AI to audit federal health spending across all 50 states. At that scale, the gap between a 3% single-turn failure rate and a 16% multi-turn failure rate is not a rounding error.
Cisco is calling on labs to document the safety effects of configuration flags alongside capability benchmarks. That seems like the minimum. The harder ask is that the field starts treating multi-turn evaluation as the baseline rather than the supplement. The single-prompt score tells you something. It just doesn't tell you enough, and the gap between what it tells you and what actually matters is now quantified, across 15 models, in a peer-reviewed-adjacent format that is hard to dismiss.










