ReplicatorBench Shows LLM Agents Fail to Replicate Half of Social Science Studies

Key Takeaways

The ReplicatorBench framework reveals that top-tier LLM agents successfully replicate around half of social science research papers from a verified dataset.
Agents struggle most with data cleaning and translating ambiguous methodology descriptions into working statistical code.
Autonomous replication could cut costs dramatically compared to human-led systematic reviews — but only if agentic reliability improves through better long-horizon planning. A new benchmark called ReplicatorBench puts a hard number on something researchers have long debated: AI agents can autonomously replicate roughly half of verifiable social science studies — at a cost of $2 to $15 per paper, compared to thousands of dollars for a human researcher. The catch is that the other half fail, often due to sloppy data handling rather than statistical complexity. That gap between “impressive” and “trustworthy” is where the real work lies.

Replication is the backbone of scientific integrity. In psychology and economics, reproducing a single paper can take a human researcher dozens of hours — deciphering old methodology, tracking down datasets, debugging legacy code. ReplicatorBench automates that evaluation: give an agent the original paper and raw data, then see if it can reproduce the exact statistical results. Testing frontier models like GPT-4o and Claude 3.5 Sonnet on this task gives a clear read on where autonomous research agents actually stand.

Criteria for Evaluating AI Research Agents

To make sense of what LLM agents offer here, you need to measure them against human-led replication on the dimensions that actually matter:

Methodological Rigor: Can the agent accurately interpret complex, often ambiguous paper instructions and translate them into valid statistical procedures?
Operational Cost: What does a single replication actually cost — API fees for AI, or salary and stipends for human researchers?
Temporal Efficiency: How long from initial prompt or assignment to a finished replication report?
Error Transparency: When a replication fails, can you tell whether the original paper was flawed or the agent made an error?

LLM Agents: The Autonomous Approach via ReplicatorBench

The agentic workflow tested in ReplicatorBench is straightforward: a reasoning engine, a code execution sandbox (Python or R), and the ability to parse PDFs. Agents were tasked with replicating around 100 studies already verified as reproducible by humans — a clean ground truth to measure against.

Results show agents succeed on roughly half of attempts. The bottleneck isn’t statistical complexity — it’s data preprocessing. Social science datasets are messy, and the cleaning steps are often described vaguely in a paper’s Methods section. Agents frequently generate incorrect variable names or miss outlier-removal protocols mentioned in passing. What they do have going for them is speed: an agent can attempt a replication in under 15 minutes. A human researcher might take a week.

Cost per replication runs $2 to $15 in API fees depending on the model and number of iterations — orders of magnitude cheaper than the thousands typically spent on human-led systematic reviews. The main risk is false negatives: an agent fails to replicate a valid study due to its own coding errors, potentially leading researchers to wrongly flag the original as flawed.

Human Researchers: The Traditional Gold Standard

Human-led replication holds its position for a reason: tacit knowledge. An experienced researcher knows that a survey question from 2012 probably needs a specific inflation adjustment even when the paper doesn’t spell out the formula. Humans can also debug the original authors’ intent — emailing investigators for missing data files or clarification on ambiguous code comments.

But the human approach doesn’t scale. The replication crisis exists largely because there’s no funding or professional incentive to re-test thousands of existing studies. Researchers also bring their own biases — they may work harder to replicate a high-profile study, or be more motivated to find flaws in a rival’s work. And for all the talk of transparency, the mental steps a human takes are often just as opaque as a model’s output, buried under pages of narrative text.

Analysis of Success Rates and Failure Modes

ReplicatorBench data shows a clear split based on task complexity. For studies using simple linear regressions or standard t-tests on clean datasets, agent success rates approach 80%. For tasks requiring long-horizon reasoning — multi-stage data cleaning followed by complex instrumental variable analysis — success rates drop sharply, in some cases below 30%.

The most common failure modes:

Instruction Following: Agents miss a single sentence buried in a 30-page PDF specifying a data subset, producing a mismatch in final p-values.
Library Dependency Errors: Agents attempt to use deprecated Python or R libraries, or fail to resolve environment conflicts during code execution.
Statistical Misinterpretation: The agent runs the code correctly but identifies the wrong output number as the paper’s main result.

The engine is powerful; the steering isn’t there yet. A 50% success rate is a long way from the 95% or higher threshold that scientific verification actually demands. Closing that gap — through better long-horizon planning and agentic reliability — is the real frontier. If you’re building research automation pipelines, this is worth tracking alongside developments in how orchestration tools handle high-volume agentic workflows.

Comparative Summary: Cost, Speed, and Reliability

The trade-off is scale versus certainty. Here’s how the two approaches compare based on current research findings:

Feature
LLM Agent (ReplicatorBench)
Human Researcher

Success Rate
Around 50%
Around 90% (with effort)

Time per Study
10–20 minutes
20–60 hours

Cost per Study
$2–$15
$1,500–$5,000

Scalability
Massive (parallel execution)
Limited (human labour pool)

The practical use case right now is triage: run agents across a thousand papers, flag the ones that look problematic and hand the highest-priority cases to human researchers. That’s a genuinely useful workflow, even at a 50% success rate.

Enterprise and Academic Recommendations

For research institutions, government agencies and pharma companies looking to integrate agentic AI into verification workflows, a hybrid model is the only defensible path right now. Relying solely on current-generation agents introduces too much noise and too many incorrect conclusions.

The architecture that makes sense is a human-in-the-loop replication pipeline. The agent does the heavy lifting — parsing PDFs, identifying data sources, writing the initial analysis script. A human researcher then spends 30 minutes reviewing the agent’s code and output rather than 30 hours doing the work from scratch. You preserve human oversight while capturing the bulk of the efficiency gains. The output review bottleneck is a known challenge in these workflows — worth planning for before you scale.

Developers building these agents should prioritise traceability over raw accuracy. The goal isn’t just a final number — it’s a step-by-step audit log of every decision made during data cleaning. That transparency is what builds the trust needed to move from research assistant to autonomous peer.

There’s also a structural fix needed on the publishing side. If agents are to replicate results reliably, journals need to move toward standardised data formats and machine-readable method descriptions. Many of the failures in ReplicatorBench reflect how vaguely humans communicate methodology in writing — not just the limitations of the models themselves. For more on AI agents and automation tools, visit our AI Agents section.

Originally published at https://autonainews.com/replicatorbench-shows-llm-agents-fail-to-replicate-half-of-social-science-studies/