Overview
Systematic reviews are the gold standard of evidence synthesis — but they take an average of 12-18 months to complete. The most time-consuming stages are title-abstract screening, full-text screening, data extraction, and risk of bias assessment.
Since 2023, a growing body of research has evaluated whether large language models (LLMs) like GPT-4, Claude, and Gemini can reliably automate these stages. The findings are encouraging: LLMs consistently achieve high sensitivity and accuracy, often matching or exceeding human performance — especially when multiple models are used in ensemble.
Below we summarize 8 peer-reviewed studies spanning screening, data extraction, risk of bias assessment, and end-to-end automation.
96.7%
Screening sensitivity
Cao et al. 2025
93.1%
Extraction accuracy
Cao et al. 2025
92.4%
Data point accuracy
Jensen et al. 2025
12 reviews in 2 days
vs. ~12 work-years
Cao et al. 2025
Screening Studies
Title-abstract and full-text screening is the most studied application of LLMs in systematic reviews. Across multiple independent studies, GPT-4 class models achieve 90%+ accuracy, outperform traditional ML screening tools, and can reduce manual screening workload by orders of magnitude.
Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening
Luo R, Sastimoglu Z, Faisal AI, Deen MJ
- Evaluated GPT-3.5 Turbo across 24,534 studies in three systematic reviews
- Title-abstract screening accuracy ranged from 80-95%
- Full-text screening showed strongest performance gains, with improved sensitivity and Matthews Correlation Coefficient
- Demonstrated reduction from year-long review timelines to hours while maintaining accuracy
Transforming Literature Screening: The Emerging Role of Large Language Models in Systematic Reviews
Delgado-Chaves FM et al.
- Evaluated 18 different LLMs across three systematic reviews
- LLMs enable substantial reduction in manual screening workload
- Performance depends on the interplay between inclusion/exclusion criteria and the specific LLM
- Refining criteria formulation with LLM support improves screening performance
Evaluating the Effectiveness of Large Language Models in Abstract Screening: A Comparative Analysis
Li M, Sun J, Tan X
- Compared GPT-4, GPT-3.5, Gemini, Llama, and Claude across three benchmark datasets
- GPT-4 achieved balanced sensitivity and specificity with accuracy consistently at or above 90%
- LLMs outperformed traditional ML tools (ASReview, Abstrackr) in sensitivity and AUC
- LLMs show promise as autonomous AI reviewers or collaborative partners with human experts
Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4's Efficacy in Screening and Extracting Data
Khraisha Q et al.
- GPT-4 achieved human-level accuracy in full-text screening with highly reliable prompts
- Cross-language screening capability validated across peer-reviewed and grey literature
- Performance at title-abstract level was affected by chance agreement and dataset imbalance
- Full-text screening showed more robust, 'human-like' performance levels
Data Extraction
Data extraction — pulling study characteristics, outcomes, and population details from full-text papers — is traditionally one of the most labor-intensive phases. Recent studies show LLMs can serve as reliable second raters with accuracy above 88%.
ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic Reviews
Jensen MM, Danielsen MB, Riis J, Kristjansen KA, Andersen S, Okubo Y, Jorgensen MG
- GPT-4o extracted 484 data points across 11 papers with 92.4% accuracy (95% CI: 89.5-94.5%)
- False data produced in only 5.2% of cases (95% CI: 3.4-7.4%)
- Reproducibility between two sessions was 94.1%
- Qualified as a reliable second reviewer for systematic review data extraction
Novel AI Applications in Systematic Review: GPT-4 Assisted Data Extraction, Analysis, Review of Bias
Oami T et al.
- GPT-4 showed 88.6% concordance with original human review for data extraction
- Less than 5% of discrepancies were due to inaccuracies or omissions
- Four specialized GPT-4 models developed for study characteristics, outcomes, bias domains, and RoB evaluation
- Risk of bias assessment showed fair-moderate intra-rater agreement (ICC = 0.518)
Risk of Bias Assessment
Risk of bias assessment requires nuanced judgment about study design and conduct. LLMs show moderate agreement with human reviewers — strongest for identifying low-risk studies and weaker for detecting high-risk domains. AI-assisted RoB works best as a starting point for human review.
Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions
Rose CJ, Bidonde J, Ringsten M, Glanville J, Potrebny T, Cooper C, Muller AE, Bergsund HB, Meneses-Echavez JF, Berg RC
- Analyzed 75 RCTs from Cochrane reviews using RoB 2.0
- Human-ChatGPT agreement: 50.7% overall (significantly above chance at 37.1%)
- Strongest domain: randomization process (78.7% agreement, kappa = 0.537)
- ChatGPT intrarater reliability: 74.7% agreement (kappa = 0.621), showing consistent self-agreement
End-to-End Automation
The most ambitious studies evaluate LLMs across the entire systematic review pipeline — from search to analysis. The results suggest that AI-assisted workflows can compress months of work into days while maintaining or improving quality.
Automation of Systematic Reviews with Large Language Models
Cao L, Arora A et al.
- otto-SR achieved 96.7% sensitivity and 97.9% specificity in screening (vs. human: 81.7% sensitivity, 98.1% specificity)
- Data extraction accuracy of 93.1% (vs. human: 79.7%)
- Reproduced and updated 12 Cochrane reviews in 2 days — approximately 12 work-years of traditional effort
- High interrater reliability for risk of bias: ROB2 (0.98), Newcastle-Ottawa (0.95), QUADAS-2 (0.74)
What This Means for Your Reviews
The evidence is clear: LLMs are not a replacement for human judgment, but they are a powerful accelerant. The strongest results come from multi-model ensemble approaches — using multiple independent AI models with human oversight — which is exactly how ReviewBrewery works.
ReviewBrewery uses a three-model ensemble (GPT-4o, Claude Sonnet, and Gemini Flash) with tiebreaker resolution and full human-in-the-loop control. Every AI decision can be reviewed, overridden, and audited. This approach is directly informed by the research above.
Multi-model ensemble: why it matters
Single-model approaches are vulnerable to systematic blind spots. By screening each paper with three independent models and using a tiebreaker for conflicts, ensemble methods achieve higher sensitivity than any single model alone — reducing the risk of missing relevant studies.
We are committed to building on peer-reviewed evidence and contributing to the research ourselves. As new studies are published, we will update this page to reflect the latest findings.
Try ReviewBrewery free