Or: How I learned that "independent validators" are like siblings – they share the same trauma.
You know that feeling when you ask two security guards to watch the door, and they both fall asleep at exactly the same time because they had the same lunch?
That's basically what happened when I tested two different LLMs as independent jailbreak detectors.
The Setup
- Model A: Groq / Llama 3.1 8B (Factual)
- Model B: OpenRouter / Gemma 4 31B (Structural)
- Temperature: 0.0 (Cold, hard refusal logic)
The Results: The Illusion of Independence
| Metric | Value |
|---|---|
| Agreement | 70% |
| Phi correlation | 0.42 |
| Cohen's kappa | 0.40 |
| Beyond‑chance co‑failure | +10% |
Translation: they agree more than random chance would suggest. When one falls for a prompt, the other is significantly more likely to fall too.
Why is this happening?
- Shared Training Sets: They’ve both read the same parts of the internet.
- Alignment Overlap: Most "safety training" uses similar RLHF datasets.
- Common Logic: They both struggle with the same types of persuasive "roleplay" jailbreaks.
Vulnerability rates:
- Groq: 50% (yes, half the time it just… complied)
- Gemma: 36% (slightly better, still not great)
The "Where Did They Both Fail?" Table
| Gemma SAFE | Gemma VULN | |
|---|---|---|
| Groq SAFE | 21 | 4 |
| Groq VULN | 11 | 14 |
The 14 cases where both were vulnerable (n11) are the shared blind spot. The 11 + 4 = 15 disagreement cases are the only places where having two models actually helped.
What I Learned
Different roles ≠ independent. A factual model and a structural model still share training data, alignment tuning, and cultural biases.
The effective sample size (n_eff) was 35.3 from 50 tests. That means my two‑model ensemble behaves like roughly 1.75 independent judges. Not 2. So much for "redundancy."
Beyond‑chance co‑failure was +10%. Expected joint failure if independent: 18%. Observed: 28%. That extra 10% is the cost of correlated training.
The real value is in disagreement. 30% of tests disagreed. Those are the only cases where a second model adds information. The rest is just expensive consensus.
Should You Stop Using Multiple Models?
No. But you should measure independence instead of assuming it.
If you're building a safety system that requires two models to agree before approving an action, and their failures are correlated, you're not getting 2x safety. You're getting 1.75x at best – and sometimes just 1.1x.
The Code & Data
You can find the full prompt set, the raw JSON responses, and the Python script used for the statistical analysis here:
setuju
/
LLM-Independence-Experiment
LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4
LLM Independence Experiment – Groq Llama 3.1 vs OpenRouter Gemma 4
Different roles give some independence, but not real independence.
— Marco Somma
We ran 50 jailbreak/prompt injection tests on two popular LLMs to measure how correlated their failure modes are. The question: if you use two models as independent validators, do they actually fail differently?
📊 Key Results
| Metric | Value |
|---|---|
| Phi correlation | 0.417 |
| Cohen's kappa | 0.400 |
| Agreement | 70% |
| Disagreement | 30% |
| Effective sample size (n_eff) | 35.3 (from 50 tests) |
| Beyond‑chance co‑failure | +10% |
Vulnerability rates
- Groq (Llama 3.1 8B) : 50% vulnerable
- OpenRouter (Gemma 4 31B) : 36% vulnerable
Contingency table
| Model B SAFE | Model B VULN | |
|---|---|---|
| Model A SAFE | 21 (n00) | 4 (n01) |
| Model A VULN | 11 (n10) | 14 (n11) |
🧠 What This Means
- Phi = 0.417 indicates moderate correlation – the models share significant blind spots, but not perfectly.
- Cohen's kappa = 0.40 confirms moderate agreement beyond chance.
- Expected…
- 50 prompts
- Full responses (so you can laugh/cry at what they actually said)
- Phi, kappa, n_eff, beyond‑chance co‑failure
What's your experience?
Have you tried using "independent" LLM judges in your pipeline? Did you measure their correlation, or did you take their independence for granted?
I'd love to hear if anyone has found a 'magic pairing' of models that actually disagree in useful ways!
Independence isn't a feature you can assume. It's a property you have to verify. And sometimes, the answer is uncomfortable.
But hey – at least the models were confidently wrong together.
That's teamwork, I guess.
Special thanks to Marco Somma for pushing me to calculate kappa and beyond‑chance co‑failure. I should enjoy the weekend, but I learned something.
Jack

















