The Evidence Behind AI-Assisted Reviews

Can large language models reliably screen, extract, and assess systematic review literature? Here is what the peer-reviewed research says.

Try ReviewBrewery free

Overview

Systematic reviews are the gold standard of evidence synthesis — but they take an average of 12-18 months to complete. The most time-consuming stages are title-abstract screening, full-text screening, data extraction, and risk of bias assessment.

Since 2023, a growing body of research has evaluated whether large language models (LLMs) like GPT-4, Claude, and Gemini can reliably automate these stages. The findings are encouraging: LLMs consistently achieve high sensitivity and accuracy, often matching or exceeding human performance — especially when multiple models are used in ensemble.

Below we summarize 8 peer-reviewed studies spanning screening, data extraction, risk of bias assessment, and end-to-end automation.

96.7%

Screening sensitivity

Cao et al. 2025

93.1%

Extraction accuracy

Cao et al. 2025

92.4%

Data point accuracy

Jensen et al. 2025

12 reviews in 2 days

vs. ~12 work-years

Cao et al. 2025

Screening Studies

Title-abstract and full-text screening is the most studied application of LLMs in systematic reviews. Across multiple independent studies, GPT-4 class models achieve 90%+ accuracy, outperform traditional ML screening tools, and can reduce manual screening workload by orders of magnitude.

ScreeningmedRxiv, 2024

Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening

Luo R, Sastimoglu Z, Faisal AI, Deen MJ

  • Evaluated GPT-3.5 Turbo across 24,534 studies in three systematic reviews
  • Title-abstract screening accuracy ranged from 80-95%
  • Full-text screening showed strongest performance gains, with improved sensitivity and Matthews Correlation Coefficient
  • Demonstrated reduction from year-long review timelines to hours while maintaining accuracy
View paper (DOI)
ScreeningPNAS, 2025

Transforming Literature Screening: The Emerging Role of Large Language Models in Systematic Reviews

Delgado-Chaves FM et al.

  • Evaluated 18 different LLMs across three systematic reviews
  • LLMs enable substantial reduction in manual screening workload
  • Performance depends on the interplay between inclusion/exclusion criteria and the specific LLM
  • Refining criteria formulation with LLM support improves screening performance
View paper (DOI)
ScreeningSystematic Reviews (Springer), 2024

Evaluating the Effectiveness of Large Language Models in Abstract Screening: A Comparative Analysis

Li M, Sun J, Tan X

  • Compared GPT-4, GPT-3.5, Gemini, Llama, and Claude across three benchmark datasets
  • GPT-4 achieved balanced sensitivity and specificity with accuracy consistently at or above 90%
  • LLMs outperformed traditional ML tools (ASReview, Abstrackr) in sensitivity and AUC
  • LLMs show promise as autonomous AI reviewers or collaborative partners with human experts
View paper (DOI)
ScreeningResearch Synthesis Methods (Wiley), 2024

Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4's Efficacy in Screening and Extracting Data

Khraisha Q et al.

  • GPT-4 achieved human-level accuracy in full-text screening with highly reliable prompts
  • Cross-language screening capability validated across peer-reviewed and grey literature
  • Performance at title-abstract level was affected by chance agreement and dataset imbalance
  • Full-text screening showed more robust, 'human-like' performance levels
View paper (DOI)

Data Extraction

Data extraction — pulling study characteristics, outcomes, and population details from full-text papers — is traditionally one of the most labor-intensive phases. Recent studies show LLMs can serve as reliable second raters with accuracy above 88%.

Data ExtractionPLOS ONE, 2025

ChatGPT-4o Can Serve as the Second Rater for Data Extraction in Systematic Reviews

Jensen MM, Danielsen MB, Riis J, Kristjansen KA, Andersen S, Okubo Y, Jorgensen MG

  • GPT-4o extracted 484 data points across 11 papers with 92.4% accuracy (95% CI: 89.5-94.5%)
  • False data produced in only 5.2% of cases (95% CI: 3.4-7.4%)
  • Reproducibility between two sessions was 94.1%
  • Qualified as a reliable second reviewer for systematic review data extraction
View paper (DOI)
Data ExtractionBMJ Evidence-Based Medicine, 2025

Novel AI Applications in Systematic Review: GPT-4 Assisted Data Extraction, Analysis, Review of Bias

Oami T et al.

  • GPT-4 showed 88.6% concordance with original human review for data extraction
  • Less than 5% of discrepancies were due to inaccuracies or omissions
  • Four specialized GPT-4 models developed for study characteristics, outcomes, bias domains, and RoB evaluation
  • Risk of bias assessment showed fair-moderate intra-rater agreement (ICC = 0.518)
View paper (DOI)

Risk of Bias Assessment

Risk of bias assessment requires nuanced judgment about study design and conduct. LLMs show moderate agreement with human reviewers — strongest for identifying low-risk studies and weaker for detecting high-risk domains. AI-assisted RoB works best as a starting point for human review.

Risk of BiasCochrane Evidence Synthesis and Methods, 2025

Using a Large Language Model (ChatGPT-4o) to Assess the Risk of Bias in Randomized Controlled Trials of Medical Interventions

Rose CJ, Bidonde J, Ringsten M, Glanville J, Potrebny T, Cooper C, Muller AE, Bergsund HB, Meneses-Echavez JF, Berg RC

  • Analyzed 75 RCTs from Cochrane reviews using RoB 2.0
  • Human-ChatGPT agreement: 50.7% overall (significantly above chance at 37.1%)
  • Strongest domain: randomization process (78.7% agreement, kappa = 0.537)
  • ChatGPT intrarater reliability: 74.7% agreement (kappa = 0.621), showing consistent self-agreement
View paper (DOI)

End-to-End Automation

The most ambitious studies evaluate LLMs across the entire systematic review pipeline — from search to analysis. The results suggest that AI-assisted workflows can compress months of work into days while maintaining or improving quality.

End-to-EndmedRxiv, 2025

Automation of Systematic Reviews with Large Language Models

Cao L, Arora A et al.

  • otto-SR achieved 96.7% sensitivity and 97.9% specificity in screening (vs. human: 81.7% sensitivity, 98.1% specificity)
  • Data extraction accuracy of 93.1% (vs. human: 79.7%)
  • Reproduced and updated 12 Cochrane reviews in 2 days — approximately 12 work-years of traditional effort
  • High interrater reliability for risk of bias: ROB2 (0.98), Newcastle-Ottawa (0.95), QUADAS-2 (0.74)
View paper (DOI)

What This Means for Your Reviews

The evidence is clear: LLMs are not a replacement for human judgment, but they are a powerful accelerant. The strongest results come from multi-model ensemble approaches — using multiple independent AI models with human oversight — which is exactly how ReviewBrewery works.

ReviewBrewery uses a three-model ensemble (GPT-4o, Claude Sonnet, and Gemini Flash) with tiebreaker resolution and full human-in-the-loop control. Every AI decision can be reviewed, overridden, and audited. This approach is directly informed by the research above.

Multi-model ensemble: why it matters

Single-model approaches are vulnerable to systematic blind spots. By screening each paper with three independent models and using a tiebreaker for conflicts, ensemble methods achieve higher sensitivity than any single model alone — reducing the risk of missing relevant studies.

We are committed to building on peer-reviewed evidence and contributing to the research ourselves. As new studies are published, we will update this page to reflect the latest findings.

Try ReviewBrewery free
Research — Evidence for AI-Assisted Systematic Reviews | ReviewBrewery