Introduction

Evidence syntheses require independent reviewers to extract data and assess the risk of bias (ROB)1, resulting in a labor-intensive and time-consuming process2, especially particularly in complementary and alternative medicine (CAM) research3,4. CAM has gained prominence due to its efficacy and safety profiles, leading to increased adoption by both clinicians and patients5,6. However, the lack of high-quality evidence necessitates efficient syntheses to support clinical practice7. Challenges include CAM’s complex, discipline-specific terminology and multilingual literature, complicating data extraction8.

Recent advances in generative artificial intelligence have produced outstanding large language models (LLMs) capable of analyzing vast text corpora, capturing complex contexts, and adapting to specialized domains, making them suitable for evidence synthesis9. Preliminary studies suggest LLMs’ potential in systematic reviews and meta-analyses10,11,12,13. However, their application in CAM is limited due to difficulties in creating CAM-specific prompts, maintaining precision in terminology, and handling diverse languages and study designs14. Moreover, the potential for LLM-assisted methods, where AI and human expertise work in tandem, remains largely unexplored. This study aimed to develop structured prompts for guiding LLMs in extracting both basic and CAM-specific data and assessing ROB in randomized controlled trials (RCTs) on CAM interventions. We compared LLM-only and LLM-assisted methods to conventional approaches, seeking to enhance efficiency and quality in data extraction and ROB assessment, ultimately supporting clinical practice and guidelines.

We randomly selected 107 RCTs (Supplementary Table 1) from 12 Cochrane reviews15,16,17,18,19,20,21,22,23,24,25,26, spanning 1979–2024, with 27.1% in English and 72.9% in Chinese; 44.9% were published post-2013. Studies focused on mind-body practices (41.1%), herbal decoctions (34.6%), and natural products (24.3%). Based on OCR recognizability, 94.4% of RCTs had higher recognizability (≥70% of text and data accurately detected), while 5.6% had lower recognizability (<70%).

Two LLMs—Claude-3.5-sonnet and Moonshot-v1-128k—were employed to extract data and assess ROB. Supplementary Notes 1 to 4 document all responses from two models. Supplementary Tables 2 to 7 present the analysis results of LLM-only and LLM-assisted extractions and assessments.

From 107 RCTs, both models produced 12,814 extractions. As shown in Fig. 1, Claude-3.5-sonnet showed superior overall accuracy (96.2%, 95% CI: 95.8–96.5%) compared to Moonshot-v1-128k (95.1%, 95% CI: 94.7–95.5%), with a statistically significant difference (RD: 1.1%, 95% CI: 0.6–1.6%; p < 0.001). Claude outperformed in the Baseline Characteristics domain, while both models had similar accuracy across other domains. For Moonshot-v1-128k, the highest correctness rate was in the Outcomes domain (97.6%), and the lowest was in the Methods domain (90.9%). Errors in Moonshot-v1-128k’s extractions often resulted from incorrectly labeling data as “Not reported.” Commonly missed information included start/end dates (44 RCTs), baseline balance descriptions (39), number analyzed (40), demographics (22), theoretical basis (21), treatment frequency (8), all outcomes in 3 RCTs, and specific outcome data in 45 RCTs. However, Moonshot-v1-128k successfully extracted CAM-specific data, such as traditional Chinese medicine terminology. The inter-model agreement rate between Claude and Moonshot-v1-128k was 93.8%, with 83.3% of Claude’s errors also present in Moonshot-v1-128k’s results.

Fig. 1: Comparison of accuracy between Moonshot-v1-128k and Claude-3.5-sonnet in extracting data and assessing ROB.
figure 1

This figure compares the accuracy of two language models, Claude-3.5-sonnet and Moonshot-v1-128k, in data extraction and risk-of-bias (ROB) assessments across multiple domains in 107 RCTs. Claude-3.5-sonnet demonstrated higher overall accuracy in data extraction (96.2%, 95% CI: 95.8% to 96.5%) than Moonshot-v1-128k (95.1%, 95% CI: 94.7% to 95.5%), with a statistically significant difference of 1.1% (p < 0.001). In the ROB assessment, Claude-3.5-sonnet also achieved slightly higher accuracy (96.9% vs. 95.7%), though the difference was not statistically significant. The greatest difference in domain-specific accuracy was observed in the Baseline Characteristics for data extraction, where Claude-3.5-sonnet outperformed Moonshot-v1-128k.

Investigators refined Moonshot-v1-128k’s extractions, achieving a corrected accuracy of 97.9% (95% CI: 97.7–98.2%), higher than the expected 95.3% for conventional methods (RD: 2.6%, 95% CI: 2.2–3.1%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only extractions was 2.8% (95% CI: 2.4–3.2%; p < 0.001; Table 1). Accuracy improvements were most notable in the Methods domain (+7.4%) and the Data and Analysis domain (+6.0%). Subgroup analyses revealed that higher PDF recognizability positively impacted Moonshot-v1-128k’s accuracy (p interaction = 0.023) but had no significant effect on LLM-assisted accuracy (p interaction = 0.100). Claude achieved higher accuracy in extracting data from English RCTs compared to Chinese RCTs (p interaction = 0.000).

Fig. 2: Comparison of accuracy and efficiency of conventional, LLM-only, and LLM-assisted methods in extracting data and assessing ROB using Moonshot-v1-128k.
figure 2

This figure presents a comparison of the accuracy (correct rate) and efficiency (time spent) of three methods for data extraction and risk of bias (ROB) assessment: conventional, LLM-only, and LLM-assisted. For data extractions, the conventional method had an estimated accuracy of 95.3% and took 86.9 min per RCT. The LLM-only method achieved an accuracy of 95.1% and took only 96 s per RCT, while the LLM-assisted method had the highest accuracy at 97.9% and took 14.7 min per RCT. For ROB assessments, the conventional method had an estimated accuracy of 90.0% and took 10.4 min per RCT. The LLM-only method achieved an accuracy of 95.7% and took only 42 s per RCT, while the LLM-assisted method had the highest accuracy at 97.3% and took 5.9 min per RCT. These results demonstrate that LLM-assisted methods can achieve higher accuracy than conventional methods while being substantially more efficient.

Table 1 Accuracy of LLM-only and LLM-assisted data extractions

Both models conducted 1,070 ROB assessments. As shown in Fig. 1, Claude achieved 96.9% accuracy (95% CI: 95.7–97.9%), slightly higher than Moonshot-v1-128k’s 95.7% (95% CI: 94.3–96.8%), though the difference was not statistically significant (RD: 1.2%, 95% CI: –0.4–2.8%). Moonshot-v1-128k’s lowest accuracy was in the Sequence generation domain (87.9%), while other domains ranged from 94.4% to 100.0%. Sensitivities in Selective outcome reporting and Other bias were relatively low (0.50 and 0.40), with corresponding F-scores of 0.67 and 0.44, but other domains had F-scores between 0.97 and 1.00. Of the 46 incorrect assessments, 62.1% were due to missing supporting information, while 37.9% involved correct data extraction but erroneous judgments. Cohen’s kappa values indicated substantial to almost perfect agreement in most domains, except for Selective outcome reporting (0.66) and Other bias (0.42), likely due to high true negative rates (>93%). The inter-model agreement between Claude-3.5-sonnet and Moonshot-v1-128k was almost perfect (Cohen’s kappa = 0.88), with 66.7% of Claude’s errors also present in Moonshot-v1-128k’s results.

After refinement based on Moonshot-v1-128k’s assessments, the mean correctness rate of LLM-assisted ROB assessments increased to 97.3% (95% CI: 96.1–98.2%), significantly surpassing the expected 90.0% accuracy of conventional methods (RD: 7.3%, 95% CI: 6.2–8.3%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only assessments was 1.6% (95% CI: 0.0–3.2%; p = 0.05; Table 2), indicating that human review corrected some errors, leading to improved accuracy. The PABAK (Prevalence-Adjusted Bias-Adjusted Kappa) among the four investigators was 0.88, signifying an almost perfect agreement. The Sequence generation domain exhibited the greatest improvement in accuracy (+8.4%), and all errors in the Allocation sequence concealment domain were rectified, achieving a 100% correctness rate. Subgroup analysis revealed that Claude-3.5-sonnet achieved significantly higher accuracy in assessing ROB for English-language RCTs compared to Chinese-language RCTs (p interaction = 0.000). Conversely, LLM-assisted assessments showed higher accuracy for RCTs published in Chinese (p interaction = 0.023), suggesting that the investigators’ native language influenced assessment accuracy.

Table 2 Accuracy of LLM-only and LLM-assisted risk-of-bias assessments

For both data extraction and ROB assessment, the LLM models demonstrated significant time savings compared to conventional methods. Data extraction took an average of 96 s per RCT with Moonshot-v1-128k and 82 s with Claude-3.5-sonnet, while refinement extended Moonshot-v1-128k-assisted extractions to 14.7 min per RCT—still much faster than the 86.9 min required by traditional approaches. Similarly, ROB assessments averaged 42 s per RCT with Moonshot-v1-128k and 41 s with Claude, with Moonshot-v1-128k-assisted assessments, including refinement, taking just 5.9 min per RCT compared to 10.4 min for conventional methods.

Overall, both Claude-3.5-sonnet and Moonshot-v1-128k demonstrated high accuracy, with LLM-assisted methods significantly outperforming conventional approaches in both accuracy and efficiency (Fig. 2). Claude-3.5-sonnet achieved slightly higher accuracy than Moonshot-v1-128k for data extraction (96.2% vs. 95.1%) and ROB assessment (96.9% vs. 95.7%), though the difference was statistically significant only for data extraction. Errors in both models often stemmed from failing to identify reported data, such as start/end dates and participant numbers, rather than misinterpreting extracted information. ROB assessment errors were most frequent in the Sequence generation domain, where inconsistent judgments arose despite correct justifications. For example, Moonshot-v1-128k accurately identified the randomization method in Mao, 2014 (Supplementary Note 2) but incorrectly classified it, suggesting challenges in applying rule-based criteria.

LLM-assisted methods proved particularly effective in addressing these issues. Human reviewers identified and corrected common error patterns, significantly improving accuracy, especially in the Methods domain (+7.4%) for data extraction and the Sequence generation domain (+8.4%) for ROB assessment. By addressing these recurring issues, reviewers not only enhanced the reliability of individual assessments but also provided insights into systematic weaknesses in LLM outputs. This process highlighted the critical role of human expertise in working with LLMs, as reviewers could identify specific areas needing improvement and ensure the conclusions were accurate and reliable.

Efficiency gains were substantial. For data extraction, time per RCT decreased from 86.9 min to 14.7 min, while ROB assessment times dropped from 10.4 min to 5.9 min. The combination of time savings and improved accuracy significantly enhances evidence synthesis, particularly in complex domains like CAM that demand specialized knowledge.

Subgroup analyses revealed that higher PDF recognizability improved Moonshot-v1-128k’s extraction accuracy (p interaction = 0.023), but Claude-3.5-sonnet outperformed in extracting data from English-language RCTs compared to Chinese-language ones (p interaction = 0.000). This subgroup difference may be attributed to variations in the training datasets of the models, underscoring the need to consider document quality and language characteristics when applying LLMs to evidence synthesis.

This study aligns with previous research demonstrating high accuracy for LLMs like Claude 2 and GPT-4 in data extraction when guided by structured prompts27,28,29. Our findings build on earlier work by optimizing prompts to enhance domain-specific judgment logic, step-wise reasoning, and few-shot learning, leading to significantly higher accuracy in ROB assessments compared to prior studies30. Additionally, incorporating confidence estimates and justifications for each domain allowed investigators to more effectively identify errors.

Strengths of this study include validated prompts, a diverse sample of RCTs, and the inclusion of less experienced reviewers to estimate practical effects. Limitations include potential language-dependent biases, as all reviewers were native Chinese speakers, and reliance on benchmark estimates for conventional methods, which may not fully reflect current practices across diverse settings. The RCTs included in our study had first authors affiliated with institutions in 12 countries and regions, but 83 studies (77.6%) were from mainland China. While we analyzed the publication language of the RCTs, it is possible that the primary language background of researchers was predominantly Chinese, which may pose challenges for generalizing these findings. While this study demonstrated the feasibility and effectiveness of LLM-assisted methods using Claude-3.5-sonnet, future research should explore the application of other high-ranking models to ensure broader generalizability. Incorporating models from authoritative rankings or widely recognized sources could validate whether the approach is robust across different architectures and datasets, enhancing its applicability in diverse research contexts.

Methods

This study was conducted between November 3, 2023, and September 30, 2024, adhering to the AAPOR reporting guideline31. The Medical Ethics Review Committee of Lanzhou University’s School of Public Health exempted the study from requesting approval since all data originated from published studies.

This study employed two LLMs: Moonshot-v1-128k, an open-access model developed by Moonshot AI32, and Claude-3.5-sonnet, developed by Anthropic33. Our hypothesis posits that a two-step process involving (1) data extraction and ROB assessment by LLMs guided by structured prompts, followed by (2) verification and refinement by a single researcher, would yield results non-inferior in accuracy and superior in efficiency to conventional methods requiring two independent researchers.

Tools and Prompts for Data Extraction and Risk-of-Bias Assessment

We drafted prompts for data extraction and ROB assessment using a few-shot learning strategy. For data extraction, we followed the Cochrane Handbook (Supplementary Table 8)34, aiming to include general core items in the following domains: Methods, Participants, Intervention groups, Outcomes, Data and analysis, and Others. Additionally, we extended these prompts to include CAM-specific elements following Xia et al.35, such as Chinese diagnostic patterns, therapeutic principles and methods, herbal formula compositions, non-pharmacological therapy details, and theoretical basis.

For ROB assessment, we drafted a prompt following a modified version of the Cochrane 1.0 instrument (ROB 1; Supplementary Note 5)36. It comprises ten domains: Sequence generation, Allocation sequence concealment, Blinding (including blinding of patients, healthcare providers, data collectors, outcome assessors, and data analysts), Missing outcome data, Selective outcome reporting, and Other bias. Each domain can be rated as “definitely or probably yes” (low risk of bias) or “probably or definitely no” (high risk of bias). We refined the prompts through iterative pilot tests with five selected RCTs until the two LLMs could continuously and entirely correctly extract data and assess ROB. Supplementary Notes 6 and 7 present the finalized prompts, consisting of five core components: instruction and role setting, general, specific, output formatting, and supplementary guidelines.

Selection of samples and investigators

We estimated the accuracy of human reviewers for data extraction to be 95.3%, with a mean time per RCT of 86.9 min (inclusive of time for data extraction, verification, and consensus) according to Buscemi et al.37. The accuracy of human reviewers using the ROB 1 tool for assessment was assumed to be 90% according to Arno et al.38, with a mean time per RCT of 10.4 min. Assuming an expected accuracy of 95% for both LLM-assisted extractions and ROB assessments, and a non-inferiority threshold of 0.10, we determined a sample size of 104 RCTs, based on a 2.5% type I error rate and 90% power39. We searched the Cochrane Database of Systematic Reviews using MeSH terms for complementary therapies and keywords related to randomized controlled trials to identify Cochrane reviews (Supplementary Box 1). We included Cochrane reviews that included at least 5 RCTs with evidence synthesis, provided detailed data extraction, and performed ROB assessments, with no language restrictions. Retracted studies or those with unavailable full texts were excluded. Using Excel software, we generated random numbers and selected reviews from the eligible pool. For each selected review, we employed a stratified random sampling approach. Reviews including 10 or fewer RCTs had all their studies included, while for those with more than 10 RCTs, we randomly selected 10 using an Excel-generated randomization sequence. This process continued sequentially across reviews until we attained our target sample of 104 RCTs.

Four Chinese-speaking investigators with limited experience (one to 1.5 years in evidence synthesis) participated in the manual reviews. Before starting, they underwent one month of standardized training to ensure consistency.

Study Procedures

Figure 3 illustrates the main study process, led and supervised by a senior researcher (LG).

Fig. 3: Flow diagram of the study process.
figure 3

The study process began with the random selection of 107 randomized controlled trials (RCTs) on complementary and alternative medicine interventions. An investigator checked the PDF files of these RCTs to ensure their eligibility and completeness. Two large language models (LLMs)—Moonshot-v1-128k and Claude-3.5-sonnet—were employed, with tailored prompts iteratively tested and refined until both LLMs could achieve continuous, correct extractions and assessments across five consecutive RCTs. A single investigator then independently conducted extractions and risk-of-bias (ROB) assessments for each RCT using these finalized prompts. Simultaneously, four reviewers, after one month of training, used the lower-accuracy LLM outputs as initial references. Each reviewer independently checked and, if necessary, modified the LLM results to create four separate LLM-assisted assessments and extractions per RCT. To validate these results, two methodologists compared the LLM-only, LLM-assisted, and conventional manual assessments and reached a consensus to establish a benchmark reference. The study was conducted between November 2023 and September 2024, adhering to AAPOR guidelines and exempt from ethics approval by Lanzhou University’s Medical Ethics Review Committee due to its use of publicly available data.

Extraction and Assessment by the Large Language Model Independently

Two investigators (HHL and LYH) conducted extractions and assessments for all RCTs using Moonshot-v1-128k in mainland China from March 5 to 15, 2024, and Claude-3.5-sonnet in Canada from September 15 to 30, 2024. For each RCT, the investigator checked the recognizability of the portable document format (PDF) files, employing optical character recognition (OCR) software to convert them to text and quantifying the proportion of unrecognizable text and data. The investigator uploaded each PDF and prompt simultaneously to both LLMs and then exported their complete outputs. We accessed both models through their respective APIs, set the temperature parameter to 0 to ensure the models strictly followed the prompt, used a wired 100 Mbps network, and repeated any outputs invalidated by unrelated issues, such as network or server failures.

Extraction and Assessment by Reviewers with Large Language Model Assistance

Based on the outputs from the LLM with relatively lower accuracy (either Moonshot-v1-128k or Claude-3.5-sonnet), four reviewers (CYB, WLZ, JYL, and DNX) independently extracted data and assessed ROB for all RCTs. The reviewers first examined the LLM-derived results and then either agreed with them or modified them to form four LLM-assisted extraction and assessment results. The reviewers had the option to consult the original articles at their discretion throughout this process. The reviewers recorded the total time spent on each RCT, which included the time taken by the LLMs to generate the initial results plus the time spent by the reviewers to verify and modify those results.

Outcome measures

We evaluated the performance of both LLM-only and LLM-assisted methods in terms of accuracy and efficiency, comparing them against conventional manual methods as well as assessing the accuracy improvement from LLM-only to LLM-assisted methods. Two methodologists (HHL, LG) independently compared the results from three sources: the direct output from the LLMs (LLM-only), the results after human review and modification of the LLM output (LLM-assisted), and their own manual extraction and assessment. Through careful comparison and discussion of these three sets of results, the methodologists collaboratively established a reference standard for each extraction and assessment. Any discrepancies were resolved through consensus, ensuring a robust and accurate benchmark against which to evaluate the LLM-based methods.

For data extraction, we considered extractions incorrect if they had substantial omissions or erroneous extractions, while minor differences in phrasing or formatting that did not affect the content’s accuracy were disregarded. For ROB assessment, “definitely/probably yes” is categorized as having a low risk of bias and “definitely/probably no” as having a high risk of bias, considering the overall intent and implications of assessors’ judgments rather than strictly adhering to semantic distinctions in response options. To assess efficiency, we measured the total time spent on each RCT, which included the LLM’s generation time and the investigators’ verification and modification time.

Data analysis

We conducted data analysis using R version 4.3.340. We quantified accuracy as the percentage of correct evaluations (correctness rate). We calculated the rate difference (RD) with 95% confidence intervals (CIs) to compare the overall correct correctness rates between LLM-assisted and conventional methods, considering an RD > 0.10 as an indication of superiority and an RD < -0.10 as an indication of inferiority. The RDs between LLM-assisted and LLM-only methods were also calculated for both overall and for each domain. For ROB, we also calculated domain-specific and overall sensitivity, specificity, and F-score, considering “high risk” as positive and “low risk” as negative. To assess consistency, we calculated the agreement rates between the LLM-only extractions across models. For ROB assessments, Cohen’s kappa measured agreement between LLM-only results from each model and between these results and the reference standard. For both data extraction and ROB assessment, we utilized the prevalence-adjusted bias-adjusted kappa (PABAK) to assess inter-rater agreement among the four investigators’ LLM-assisted results. We chose PABAK over Fleiss’s Kappa due to the high prevalence of correct extractions and assessments, which can yield paradoxically low kappa values despite high observed agreement.

We conducted subgroup analyses of potential influencing factors, including PDF recognizability (dichotomized as higher recognizability, with ≥70% of text and data elements accurately detected by OCR in their original layout and context, versus lower recognizability, with <70% accurately detected), publication language (English versus non-English), and year of publication (prior to 2013 versus 2013 and later). These subgroup analyses were performed based on a priori hypotheses that studies with higher PDF recognizability, published in English, and more recently published would be associated with greater accuracy.