Can LLMs be used for systematic review screening?

Yes. Multiple peer-reviewed studies show that large language models like GPT-4, Claude, and Gemini can screen abstracts with sensitivity above 95% and accuracy above 90%, often outperforming traditional machine-learning tools like ASReview and Abstrackr.

How accurate is AI data extraction for systematic reviews?

Studies show GPT-4o achieves 92-93% accuracy in data extraction tasks, with 94% reproducibility between sessions. AI can serve as a reliable second rater alongside a human reviewer.

Can AI assess risk of bias in systematic reviews?

Yes, with caveats. GPT-4o shows moderate agreement with human reviewers on Cochrane ROB2 assessments (weighted kappa 0.51) and 99% specificity for identifying low-risk studies. AI-assisted RoB is best used as a starting point for human review.

Research — Evidence for AI-Assisted Systematic Reviews

Overview

Systematic reviews are the gold standard of evidence synthesis — but they take an average of 12-18 months to complete. The most time-consuming stages are title-abstract screening, full-text screening, data extraction, and risk of bias assessment.

Since 2023, a growing body of research has evaluated whether large language models (LLMs) like GPT-4, Claude, and Gemini can reliably automate these stages. The findings are encouraging: LLMs consistently achieve high sensitivity and accuracy, often matching or exceeding human performance — especially when multiple models are used in ensemble.

Below we summarize 8 peer-reviewed studies spanning screening, data extraction, risk of bias assessment, and end-to-end automation.

96.7%

Screening sensitivity

Cao et al. 2025

93.1%

Extraction accuracy

Cao et al. 2025

92.4%

Data point accuracy

Jensen et al. 2025

12 reviews in 2 days

vs. ~12 work-years

Cao et al. 2025

Screening Studies

Title-abstract and full-text screening is the most studied application of LLMs in systematic reviews. Across multiple independent studies, GPT-4 class models achieve 90%+ accuracy, outperform traditional ML screening tools, and can reduce manual screening workload by orders of magnitude.

ScreeningmedRxiv, 2024

Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening

Luo R, Sastimoglu Z, Faisal AI, Deen MJ

Evaluated GPT-3.5 Turbo across 24,534 studies in three systematic reviews
Title-abstract screening accuracy ranged from 80-95%
Full-text screening showed strongest performance gains, with improved sensitivity and Matthews Correlation Coefficient
Demonstrated reduction from year-long review timelines to hours while maintaining accuracy

View paper (DOI)

ScreeningPNAS, 2025

Transforming Literature Screening: The Emerging Role of Large Language Models in Systematic Reviews

Delgado-Chaves FM et al.

Evaluated 18 different LLMs across three systematic reviews
LLMs enable substantial reduction in manual screening workload
Performance depends on the interplay between inclusion/exclusion criteria and the specific LLM
Refining criteria formulation with LLM support improves screening performance

View paper (DOI)

ScreeningSystematic Reviews (Springer), 2024

Evaluating the Effectiveness of Large Language Models in Abstract Screening: A Comparative Analysis

Li M, Sun J, Tan X

Compared GPT-4, GPT-3.5, Gemini, Llama, and Claude across three benchmark datasets
GPT-4 achieved balanced sensitivity and specificity with accuracy consistently at or above 90%
LLMs outperformed traditional ML tools (ASReview, Abstrackr) in sensitivity and AUC
LLMs show promise as autonomous AI reviewers or collaborative partners with human experts

View paper (DOI)

ScreeningResearch Synthesis Methods (Wiley), 2024

Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4's Efficacy in Screening and Extracting Data

Khraisha Q et al.

GPT-4 achieved human-level accuracy in full-text screening with highly reliable prompts
Cross-language screening capability validated across peer-reviewed and grey literature
Performance at title-abstract level was affected by chance agreement and dataset imbalance
Full-text screening showed more robust, 'human-like' performance levels

View paper (DOI)

Data Extraction

Data extraction — pulling study characteristics, outcomes, and population details from full-text papers — is traditionally one of the most labor-intensive phases. Recent studies show LLMs can serve as reliable second raters with accuracy above 88%.

Data ExtractionPLOS ONE, 2025

ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic Reviews

Jensen MM, Danielsen MB, Riis J, Kristjansen KA, Andersen S, Okubo Y, Jorgensen MG

GPT-4o extracted 484 data points across 11 papers with 92.4% accuracy (95% CI: 89.5-94.5%)
False data produced in only 5.2% of cases (95% CI: 3.4-7.4%)
Reproducibility between two sessions was 94.1%
Qualified as a reliable second reviewer for systematic review data extraction

View paper (DOI)

Data ExtractionBMJ Evidence-Based Medicine, 2025

Novel AI Applications in Systematic Review: GPT-4 Assisted Data Extraction, Analysis, Review of Bias

Oami T et al.

GPT-4 showed 88.6% concordance with original human review for data extraction
Less than 5% of discrepancies were due to inaccuracies or omissions
Four specialized GPT-4 models developed for study characteristics, outcomes, bias domains, and RoB evaluation
Risk of bias assessment showed fair-moderate intra-rater agreement (ICC = 0.518)

View paper (DOI)

Risk of Bias Assessment

Risk of bias assessment requires nuanced judgment about study design and conduct. LLMs show moderate agreement with human reviewers — strongest for identifying low-risk studies and weaker for detecting high-risk domains. AI-assisted RoB works best as a starting point for human review.

Risk of BiasCochrane Evidence Synthesis and Methods, 2025

Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions

Rose CJ, Bidonde J, Ringsten M, Glanville J, Potrebny T, Cooper C, Muller AE, Bergsund HB, Meneses-Echavez JF, Berg RC

Analyzed 75 RCTs from Cochrane reviews using RoB 2.0
Human-ChatGPT agreement: 50.7% overall (significantly above chance at 37.1%)
Strongest domain: randomization process (78.7% agreement, kappa = 0.537)
ChatGPT intrarater reliability: 74.7% agreement (kappa = 0.621), showing consistent self-agreement

View paper (DOI)

End-to-End Automation

The most ambitious studies evaluate LLMs across the entire systematic review pipeline — from search to analysis. The results suggest that AI-assisted workflows can compress months of work into days while maintaining or improving quality.

End-to-EndmedRxiv, 2025

Automation of Systematic Reviews with Large Language Models

Cao L, Arora A et al.

otto-SR achieved 96.7% sensitivity and 97.9% specificity in screening (vs. human: 81.7% sensitivity, 98.1% specificity)
Data extraction accuracy of 93.1% (vs. human: 79.7%)
Reproduced and updated 12 Cochrane reviews in 2 days — approximately 12 work-years of traditional effort
High interrater reliability for risk of bias: ROB2 (0.98), Newcastle-Ottawa (0.95), QUADAS-2 (0.74)

View paper (DOI)

What This Means for Your Reviews

The evidence is clear: LLMs are not a replacement for human judgment, but they are a powerful accelerant. The strongest results come from multi-model ensemble approaches — using multiple independent AI models with human oversight — which is exactly how ReviewBrewery works.

ReviewBrewery uses a three-model ensemble (GPT-4o, Claude Sonnet, and Gemini Flash) with tiebreaker resolution and full human-in-the-loop control. Every AI decision can be reviewed, overridden, and audited. This approach is directly informed by the research above.

Multi-model ensemble: why it matters

Single-model approaches are vulnerable to systematic blind spots. By screening each paper with three independent models and using a tiebreaker for conflicts, ensemble methods achieve higher sensitivity than any single model alone — reducing the risk of missing relevant studies.

We are committed to building on peer-reviewed evidence and contributing to the research ourselves. As new studies are published, we will update this page to reflect the latest findings.

Try ReviewBrewery free

The Evidence Behind AI-Assisted Reviews

Overview

Screening Studies

Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening

Transforming Literature Screening: The Emerging Role of Large Language Models in Systematic Reviews

Evaluating the Effectiveness of Large Language Models in Abstract Screening: A Comparative Analysis

Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4's Efficacy in Screening and Extracting Data

Data Extraction

ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic Reviews

Novel AI Applications in Systematic Review: GPT-4 Assisted Data Extraction, Analysis, Review of Bias

Risk of Bias Assessment

Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions

End-to-End Automation

Automation of Systematic Reviews with Large Language Models

What This Means for Your Reviews