Language models for data extraction and risk of bias assessment in complementary medicine

Lai, Honghao; Liu, Jiayi; Bai, Chunyang; Liu, Hui; Pan, Bei; Luo, Xufei; Hou, Liangying; Zhao, Weilong; Xia, Danni; Tian, Jinhui; Chen, Yaolong; Zhang, Lu; Estill, Janne; Liu, Jie; Liao, Xing; Shi, Nannan; Sun, Xin; Shang, Hongcai; Bian, Zhaoxiang; Yang, Kehu; Huang, Luqi; Ge, Long

doi:10.1038/s41746-025-01457-w

Download PDF

Article
Open access
Published: 31 January 2025

Language models for data extraction and risk of bias assessment in complementary medicine

Honghao Lai ORCID: orcid.org/0000-0001-7913-6207^1,2,
Jiayi Liu^1,2,
Chunyang Bai³,
Hui Liu⁴,
Bei Pan⁴,
Xufei Luo^4,5,6,
Liangying Hou^4,7,
Weilong Zhao^1,2,
Danni Xia^1,2,
Jinhui Tian^4,6,8,
Yaolong Chen^4,6,8,
Lu Zhang⁹,
Janne Estill^4,10,
Jie Liu¹¹,
Xing Liao¹²,
Nannan Shi¹²,
Xin Sun¹³,
Hongcai Shang¹⁴,
Zhaoxiang Bian¹⁵,
Kehu Yang^4,6,8,
Luqi Huang^16,17 &
Long Ge ORCID: orcid.org/0000-0002-3555-1107^1,2,8
On behalf of ADVANCED Working Group

npj Digital Medicine volume 8, Article number: 74 (2025) Cite this article

4021 Accesses
2 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Large language models (LLMs) have the potential to enhance evidence synthesis efficiency and accuracy. This study assessed LLM-only and LLM-assisted methods in data extraction and risk of bias assessment for 107 trials on complementary medicine. Moonshot-v1-128k and Claude-3.5-sonnet achieved high accuracy (≥95%), with LLM-assisted methods performing better (≥97%). LLM-assisted methods significantly reduced processing time (14.7 and 5.9 min vs. 86.9 and 10.4 min for conventional methods). These findings highlight LLMs’ potential when integrated with human expertise.

SPIRIT-CONSORT-TM: a corpus for assessing transparency of clinical trial protocol and results publications

Article Open access 28 February 2025

High-precision information retrieval for rapid clinical guideline updates

Article Open access 27 April 2025

The TRIPOD-LLM reporting guideline for studies using large language models

Article 08 January 2025

Introduction

Evidence syntheses require independent reviewers to extract data and assess the risk of bias (ROB)¹, resulting in a labor-intensive and time-consuming process², especially particularly in complementary and alternative medicine (CAM) research^3,4. CAM has gained prominence due to its efficacy and safety profiles, leading to increased adoption by both clinicians and patients^5,6. However, the lack of high-quality evidence necessitates efficient syntheses to support clinical practice⁷. Challenges include CAM’s complex, discipline-specific terminology and multilingual literature, complicating data extraction⁸.

Recent advances in generative artificial intelligence have produced outstanding large language models (LLMs) capable of analyzing vast text corpora, capturing complex contexts, and adapting to specialized domains, making them suitable for evidence synthesis⁹. Preliminary studies suggest LLMs’ potential in systematic reviews and meta-analyses^10,11,12,13. However, their application in CAM is limited due to difficulties in creating CAM-specific prompts, maintaining precision in terminology, and handling diverse languages and study designs¹⁴. Moreover, the potential for LLM-assisted methods, where AI and human expertise work in tandem, remains largely unexplored. This study aimed to develop structured prompts for guiding LLMs in extracting both basic and CAM-specific data and assessing ROB in randomized controlled trials (RCTs) on CAM interventions. We compared LLM-only and LLM-assisted methods to conventional approaches, seeking to enhance efficiency and quality in data extraction and ROB assessment, ultimately supporting clinical practice and guidelines.

We randomly selected 107 RCTs (Supplementary Table 1) from 12 Cochrane reviews^{15,16,17,18,19,20,21,22,23,24,25,26}, spanning 1979–2024, with 27.1% in English and 72.9% in Chinese; 44.9% were published post-2013. Studies focused on mind-body practices (41.1%), herbal decoctions (34.6%), and natural products (24.3%). Based on OCR recognizability, 94.4% of RCTs had higher recognizability (≥70% of text and data accurately detected), while 5.6% had lower recognizability (<70%).

Two LLMs—Claude-3.5-sonnet and Moonshot-v1-128k—were employed to extract data and assess ROB. Supplementary Notes 1 to 4 document all responses from two models. Supplementary Tables 2 to 7 present the analysis results of LLM-only and LLM-assisted extractions and assessments.

From 107 RCTs, both models produced 12,814 extractions. As shown in Fig. 1, Claude-3.5-sonnet showed superior overall accuracy (96.2%, 95% CI: 95.8–96.5%) compared to Moonshot-v1-128k (95.1%, 95% CI: 94.7–95.5%), with a statistically significant difference (RD: 1.1%, 95% CI: 0.6–1.6%; p < 0.001). Claude outperformed in the Baseline Characteristics domain, while both models had similar accuracy across other domains. For Moonshot-v1-128k, the highest correctness rate was in the Outcomes domain (97.6%), and the lowest was in the Methods domain (90.9%). Errors in Moonshot-v1-128k’s extractions often resulted from incorrectly labeling data as “Not reported.” Commonly missed information included start/end dates (44 RCTs), baseline balance descriptions (39), number analyzed (40), demographics (22), theoretical basis (21), treatment frequency (8), all outcomes in 3 RCTs, and specific outcome data in 45 RCTs. However, Moonshot-v1-128k successfully extracted CAM-specific data, such as traditional Chinese medicine terminology. The inter-model agreement rate between Claude and Moonshot-v1-128k was 93.8%, with 83.3% of Claude’s errors also present in Moonshot-v1-128k’s results.

**Fig. 1: Comparison of accuracy between Moonshot-v1-128k and Claude-3.5-sonnet in extracting data and assessing ROB.**

Investigators refined Moonshot-v1-128k’s extractions, achieving a corrected accuracy of 97.9% (95% CI: 97.7–98.2%), higher than the expected 95.3% for conventional methods (RD: 2.6%, 95% CI: 2.2–3.1%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only extractions was 2.8% (95% CI: 2.4–3.2%; p < 0.001; Table 1). Accuracy improvements were most notable in the Methods domain (+7.4%) and the Data and Analysis domain (+6.0%). Subgroup analyses revealed that higher PDF recognizability positively impacted Moonshot-v1-128k’s accuracy (p interaction = 0.023) but had no significant effect on LLM-assisted accuracy (p interaction = 0.100). Claude achieved higher accuracy in extracting data from English RCTs compared to Chinese RCTs (p interaction = 0.000).

**Fig. 2: Comparison of accuracy and efficiency of conventional, LLM-only, and LLM-assisted methods in extracting data and assessing ROB using Moonshot-v1-128k.**

Table 1 Accuracy of LLM-only and LLM-assisted data extractions

Full size table

Both models conducted 1,070 ROB assessments. As shown in Fig. 1, Claude achieved 96.9% accuracy (95% CI: 95.7–97.9%), slightly higher than Moonshot-v1-128k’s 95.7% (95% CI: 94.3–96.8%), though the difference was not statistically significant (RD: 1.2%, 95% CI: –0.4–2.8%). Moonshot-v1-128k’s lowest accuracy was in the Sequence generation domain (87.9%), while other domains ranged from 94.4% to 100.0%. Sensitivities in Selective outcome reporting and Other bias were relatively low (0.50 and 0.40), with corresponding F-scores of 0.67 and 0.44, but other domains had F-scores between 0.97 and 1.00. Of the 46 incorrect assessments, 62.1% were due to missing supporting information, while 37.9% involved correct data extraction but erroneous judgments. Cohen’s kappa values indicated substantial to almost perfect agreement in most domains, except for Selective outcome reporting (0.66) and Other bias (0.42), likely due to high true negative rates (>93%). The inter-model agreement between Claude-3.5-sonnet and Moonshot-v1-128k was almost perfect (Cohen’s kappa = 0.88), with 66.7% of Claude’s errors also present in Moonshot-v1-128k’s results.

After refinement based on Moonshot-v1-128k’s assessments, the mean correctness rate of LLM-assisted ROB assessments increased to 97.3% (95% CI: 96.1–98.2%), significantly surpassing the expected 90.0% accuracy of conventional methods (RD: 7.3%, 95% CI: 6.2–8.3%; p < 0.001; Fig. 2). The RD between LLM-assisted and LLM-only assessments was 1.6% (95% CI: 0.0–3.2%; p = 0.05; Table 2), indicating that human review corrected some errors, leading to improved accuracy. The PABAK (Prevalence-Adjusted Bias-Adjusted Kappa) among the four investigators was 0.88, signifying an almost perfect agreement. The Sequence generation domain exhibited the greatest improvement in accuracy (+8.4%), and all errors in the Allocation sequence concealment domain were rectified, achieving a 100% correctness rate. Subgroup analysis revealed that Claude-3.5-sonnet achieved significantly higher accuracy in assessing ROB for English-language RCTs compared to Chinese-language RCTs (p interaction = 0.000). Conversely, LLM-assisted assessments showed higher accuracy for RCTs published in Chinese (p interaction = 0.023), suggesting that the investigators’ native language influenced assessment accuracy.

Table 2 Accuracy of LLM-only and LLM-assisted risk-of-bias assessments

Full size table

For both data extraction and ROB assessment, the LLM models demonstrated significant time savings compared to conventional methods. Data extraction took an average of 96 s per RCT with Moonshot-v1-128k and 82 s with Claude-3.5-sonnet, while refinement extended Moonshot-v1-128k-assisted extractions to 14.7 min per RCT—still much faster than the 86.9 min required by traditional approaches. Similarly, ROB assessments averaged 42 s per RCT with Moonshot-v1-128k and 41 s with Claude, with Moonshot-v1-128k-assisted assessments, including refinement, taking just 5.9 min per RCT compared to 10.4 min for conventional methods.

Overall, both Claude-3.5-sonnet and Moonshot-v1-128k demonstrated high accuracy, with LLM-assisted methods significantly outperforming conventional approaches in both accuracy and efficiency (Fig. 2). Claude-3.5-sonnet achieved slightly higher accuracy than Moonshot-v1-128k for data extraction (96.2% vs. 95.1%) and ROB assessment (96.9% vs. 95.7%), though the difference was statistically significant only for data extraction. Errors in both models often stemmed from failing to identify reported data, such as start/end dates and participant numbers, rather than misinterpreting extracted information. ROB assessment errors were most frequent in the Sequence generation domain, where inconsistent judgments arose despite correct justifications. For example, Moonshot-v1-128k accurately identified the randomization method in Mao, 2014 (Supplementary Note 2) but incorrectly classified it, suggesting challenges in applying rule-based criteria.

LLM-assisted methods proved particularly effective in addressing these issues. Human reviewers identified and corrected common error patterns, significantly improving accuracy, especially in the Methods domain (+7.4%) for data extraction and the Sequence generation domain (+8.4%) for ROB assessment. By addressing these recurring issues, reviewers not only enhanced the reliability of individual assessments but also provided insights into systematic weaknesses in LLM outputs. This process highlighted the critical role of human expertise in working with LLMs, as reviewers could identify specific areas needing improvement and ensure the conclusions were accurate and reliable.

Efficiency gains were substantial. For data extraction, time per RCT decreased from 86.9 min to 14.7 min, while ROB assessment times dropped from 10.4 min to 5.9 min. The combination of time savings and improved accuracy significantly enhances evidence synthesis, particularly in complex domains like CAM that demand specialized knowledge.

Subgroup analyses revealed that higher PDF recognizability improved Moonshot-v1-128k’s extraction accuracy (p interaction = 0.023), but Claude-3.5-sonnet outperformed in extracting data from English-language RCTs compared to Chinese-language ones (p interaction = 0.000). This subgroup difference may be attributed to variations in the training datasets of the models, underscoring the need to consider document quality and language characteristics when applying LLMs to evidence synthesis.

This study aligns with previous research demonstrating high accuracy for LLMs like Claude 2 and GPT-4 in data extraction when guided by structured prompts^27,28,29. Our findings build on earlier work by optimizing prompts to enhance domain-specific judgment logic, step-wise reasoning, and few-shot learning, leading to significantly higher accuracy in ROB assessments compared to prior studies³⁰. Additionally, incorporating confidence estimates and justifications for each domain allowed investigators to more effectively identify errors.

Strengths of this study include validated prompts, a diverse sample of RCTs, and the inclusion of less experienced reviewers to estimate practical effects. Limitations include potential language-dependent biases, as all reviewers were native Chinese speakers, and reliance on benchmark estimates for conventional methods, which may not fully reflect current practices across diverse settings. The RCTs included in our study had first authors affiliated with institutions in 12 countries and regions, but 83 studies (77.6%) were from mainland China. While we analyzed the publication language of the RCTs, it is possible that the primary language background of researchers was predominantly Chinese, which may pose challenges for generalizing these findings. While this study demonstrated the feasibility and effectiveness of LLM-assisted methods using Claude-3.5-sonnet, future research should explore the application of other high-ranking models to ensure broader generalizability. Incorporating models from authoritative rankings or widely recognized sources could validate whether the approach is robust across different architectures and datasets, enhancing its applicability in diverse research contexts.

Methods

This study was conducted between November 3, 2023, and September 30, 2024, adhering to the AAPOR reporting guideline³¹. The Medical Ethics Review Committee of Lanzhou University’s School of Public Health exempted the study from requesting approval since all data originated from published studies.

This study employed two LLMs: Moonshot-v1-128k, an open-access model developed by Moonshot AI³², and Claude-3.5-sonnet, developed by Anthropic³³. Our hypothesis posits that a two-step process involving (1) data extraction and ROB assessment by LLMs guided by structured prompts, followed by (2) verification and refinement by a single researcher, would yield results non-inferior in accuracy and superior in efficiency to conventional methods requiring two independent researchers.

Tools and Prompts for Data Extraction and Risk-of-Bias Assessment

We drafted prompts for data extraction and ROB assessment using a few-shot learning strategy. For data extraction, we followed the Cochrane Handbook (Supplementary Table 8)³⁴, aiming to include general core items in the following domains: Methods, Participants, Intervention groups, Outcomes, Data and analysis, and Others. Additionally, we extended these prompts to include CAM-specific elements following Xia et al.³⁵, such as Chinese diagnostic patterns, therapeutic principles and methods, herbal formula compositions, non-pharmacological therapy details, and theoretical basis.

For ROB assessment, we drafted a prompt following a modified version of the Cochrane 1.0 instrument (ROB 1; Supplementary Note 5)³⁶. It comprises ten domains: Sequence generation, Allocation sequence concealment, Blinding (including blinding of patients, healthcare providers, data collectors, outcome assessors, and data analysts), Missing outcome data, Selective outcome reporting, and Other bias. Each domain can be rated as “definitely or probably yes” (low risk of bias) or “probably or definitely no” (high risk of bias). We refined the prompts through iterative pilot tests with five selected RCTs until the two LLMs could continuously and entirely correctly extract data and assess ROB. Supplementary Notes 6 and 7 present the finalized prompts, consisting of five core components: instruction and role setting, general, specific, output formatting, and supplementary guidelines.

Selection of samples and investigators

We estimated the accuracy of human reviewers for data extraction to be 95.3%, with a mean time per RCT of 86.9 min (inclusive of time for data extraction, verification, and consensus) according to Buscemi et al.³⁷. The accuracy of human reviewers using the ROB 1 tool for assessment was assumed to be 90% according to Arno et al.³⁸, with a mean time per RCT of 10.4 min. Assuming an expected accuracy of 95% for both LLM-assisted extractions and ROB assessments, and a non-inferiority threshold of 0.10, we determined a sample size of 104 RCTs, based on a 2.5% type I error rate and 90% power³⁹. We searched the Cochrane Database of Systematic Reviews using MeSH terms for complementary therapies and keywords related to randomized controlled trials to identify Cochrane reviews (Supplementary Box 1). We included Cochrane reviews that included at least 5 RCTs with evidence synthesis, provided detailed data extraction, and performed ROB assessments, with no language restrictions. Retracted studies or those with unavailable full texts were excluded. Using Excel software, we generated random numbers and selected reviews from the eligible pool. For each selected review, we employed a stratified random sampling approach. Reviews including 10 or fewer RCTs had all their studies included, while for those with more than 10 RCTs, we randomly selected 10 using an Excel-generated randomization sequence. This process continued sequentially across reviews until we attained our target sample of 104 RCTs.

Four Chinese-speaking investigators with limited experience (one to 1.5 years in evidence synthesis) participated in the manual reviews. Before starting, they underwent one month of standardized training to ensure consistency.

Study Procedures

Figure 3 illustrates the main study process, led and supervised by a senior researcher (LG).

**Fig. 3: Flow diagram of the study process.**

Extraction and Assessment by the Large Language Model Independently

Two investigators (HHL and LYH) conducted extractions and assessments for all RCTs using Moonshot-v1-128k in mainland China from March 5 to 15, 2024, and Claude-3.5-sonnet in Canada from September 15 to 30, 2024. For each RCT, the investigator checked the recognizability of the portable document format (PDF) files, employing optical character recognition (OCR) software to convert them to text and quantifying the proportion of unrecognizable text and data. The investigator uploaded each PDF and prompt simultaneously to both LLMs and then exported their complete outputs. We accessed both models through their respective APIs, set the temperature parameter to 0 to ensure the models strictly followed the prompt, used a wired 100 Mbps network, and repeated any outputs invalidated by unrelated issues, such as network or server failures.

Extraction and Assessment by Reviewers with Large Language Model Assistance

Based on the outputs from the LLM with relatively lower accuracy (either Moonshot-v1-128k or Claude-3.5-sonnet), four reviewers (CYB, WLZ, JYL, and DNX) independently extracted data and assessed ROB for all RCTs. The reviewers first examined the LLM-derived results and then either agreed with them or modified them to form four LLM-assisted extraction and assessment results. The reviewers had the option to consult the original articles at their discretion throughout this process. The reviewers recorded the total time spent on each RCT, which included the time taken by the LLMs to generate the initial results plus the time spent by the reviewers to verify and modify those results.

Outcome measures

We evaluated the performance of both LLM-only and LLM-assisted methods in terms of accuracy and efficiency, comparing them against conventional manual methods as well as assessing the accuracy improvement from LLM-only to LLM-assisted methods. Two methodologists (HHL, LG) independently compared the results from three sources: the direct output from the LLMs (LLM-only), the results after human review and modification of the LLM output (LLM-assisted), and their own manual extraction and assessment. Through careful comparison and discussion of these three sets of results, the methodologists collaboratively established a reference standard for each extraction and assessment. Any discrepancies were resolved through consensus, ensuring a robust and accurate benchmark against which to evaluate the LLM-based methods.

For data extraction, we considered extractions incorrect if they had substantial omissions or erroneous extractions, while minor differences in phrasing or formatting that did not affect the content’s accuracy were disregarded. For ROB assessment, “definitely/probably yes” is categorized as having a low risk of bias and “definitely/probably no” as having a high risk of bias, considering the overall intent and implications of assessors’ judgments rather than strictly adhering to semantic distinctions in response options. To assess efficiency, we measured the total time spent on each RCT, which included the LLM’s generation time and the investigators’ verification and modification time.

Data analysis

We conducted data analysis using R version 4.3.3⁴⁰. We quantified accuracy as the percentage of correct evaluations (correctness rate). We calculated the rate difference (RD) with 95% confidence intervals (CIs) to compare the overall correct correctness rates between LLM-assisted and conventional methods, considering an RD > 0.10 as an indication of superiority and an RD < -0.10 as an indication of inferiority. The RDs between LLM-assisted and LLM-only methods were also calculated for both overall and for each domain. For ROB, we also calculated domain-specific and overall sensitivity, specificity, and F-score, considering “high risk” as positive and “low risk” as negative. To assess consistency, we calculated the agreement rates between the LLM-only extractions across models. For ROB assessments, Cohen’s kappa measured agreement between LLM-only results from each model and between these results and the reference standard. For both data extraction and ROB assessment, we utilized the prevalence-adjusted bias-adjusted kappa (PABAK) to assess inter-rater agreement among the four investigators’ LLM-assisted results. We chose PABAK over Fleiss’s Kappa due to the high prevalence of correct extractions and assessments, which can yield paradoxically low kappa values despite high observed agreement.

We conducted subgroup analyses of potential influencing factors, including PDF recognizability (dichotomized as higher recognizability, with ≥70% of text and data elements accurately detected by OCR in their original layout and context, versus lower recognizability, with <70% accurately detected), publication language (English versus non-English), and year of publication (prior to 2013 versus 2013 and later). These subgroup analyses were performed based on a priori hypotheses that studies with higher PDF recognizability, published in English, and more recently published would be associated with greater accuracy.

Data availability

All data generated or analyzed during this study are included in this published article and its supplementary information files. Additional details or specific datasets from the analysis can be made available by the corresponding author upon reasonable request.

Code availability

The custom code used for prompt development and data extraction/assessment processes involving Moonshot-v1-128k and Claude-3.5-sonnet LLMs is available from the corresponding author upon reasonable request. The analysis scripts, along with the LLM interaction pipelines, were developed using open-source R packages.

References

Page, M. J. et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 372, n71 (2021).
Article PubMed PubMed Central Google Scholar
Veginadu, P., Calache, H., Gussy, M., Pandian, A. & Masood, M. An overview of methodological approaches in systematic reviews. J. Evid. Based Med. 15, 39–54 (2022).
Article PubMed PubMed Central Google Scholar
Shekelle, P. G. et al. Challenges in systematic reviews of complementary and alternative medicine topics. Ann. Intern Med 142, 1042–1047 (2005).
Article PubMed Google Scholar
Wider, B. & Boddy, K. Conducting systematic reviews of complementary and alternative medicine: common pitfalls. Eval. Health Prof. 32, 417–430 (2009).
Article PubMed Google Scholar
Tangkiatkumjai, M., Boardman, H. & Walker, D.-M. Potential factors that influence usage of complementary and alternative medicine worldwide: a systematic review. BMC Complement. Med. Ther. 20, 363 (2020).
Article PubMed PubMed Central Google Scholar
Phutrakool, P. & Pongpirul, K. Acceptance and use of complementary and alternative medicine among medical specialists: a 15-year systematic review and data synthesis. Syst. Rev. 11, 10 (2022).
Article PubMed PubMed Central Google Scholar
Fischer, F. H. et al. High prevalence but limited evidence in complementary and alternative medicine: guidelines for future research. BMC Complement. Altern. Med. 14, 46 (2014).
Article PubMed PubMed Central Google Scholar
Neimann Rasmussen, L. & Montgomery, P. The prevalence of and factors associated with inclusion of non-English language studies in Campbell systematic reviews: a survey and meta-epidemiological study. Syst. Rev. 7, 129 (2018).
Article PubMed PubMed Central Google Scholar
Lin, Z. How to write effective prompts for large language models. Nat Hum Behav https://doi.org/10.1038/s41562-024-01847-2 (2024).
Li, Z. et al. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc ocae061 https://doi.org/10.1093/jamia/ocae061 (2024).
Nashwan, A. J. & Jaradat, J. H. Streamlining Systematic Reviews: Harnessing Large Language Models for Quality Assessment and Risk-of-Bias Evaluation. Cureus 15, e43023 (2023).
PubMed PubMed Central Google Scholar
Hasan, B. et al. Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment. BMJ Evidence-Based Medicine https://doi.org/10.1136/bmjebm-2023-112597 (2024).
Lai, H. et al. Assessing the Risk of Bias in Randomised Controlled Trials with Large Language Models: A Feasibility Study. JAMA Netw Open.
Kumar, P. Large language models (LLMs): survey, technical frameworks, and future challenges. Artif. Intell. Rev. 57, 260 (2024).
Article Google Scholar
Chen, X., Deng, L., Jiang, X. & Wu, T. Chinese herbal medicine for oesophageal cancer. Cochrane Database Syst. Rev. 2016, CD004520 (2016).
PubMed PubMed Central Google Scholar
Chan, E. S. et al. Traditional Chinese herbal medicine for vascular dementia. Cochrane Database Syst. Rev. 12, CD010284 (2018).
PubMed Google Scholar
Zhang, H. W., Lin, Z. X., Cheung, F., Cho, W. C.-S. & Tang, J.-L. Moxibustion for alleviating side effects of chemotherapy or radiotherapy in people with cancer. Cochrane Database Syst. Rev. 11, CD010559 (2018).
PubMed Google Scholar
Ngai, S. P. C., Jones, A. Y. M. & Tam, W. W. S. Tai Chi for chronic obstructive pulmonary disease (COPD). Cochrane Database Syst. Rev. 2016, CD009953 (2016).
PubMed PubMed Central Google Scholar
Mu, J. et al. Acupuncture for chronic nonspecific low back pain. Cochrane Database Syst. Rev. 12, CD013814 (2020).
PubMed Google Scholar
Lim, C. E. D., Ng, R. W. C., Cheng, N. C. L., Zhang, G. S. & Chen, H. Acupuncture for polycystic ovarian syndrome. Cochrane Database Syst. Rev. 2019, CD007689 (2019).
PubMed Central Google Scholar
Hartley, L. et al. Qigong for the primary prevention of cardiovascular disease. Cochrane Database Syst. Rev. 2015, CD010390 (2015).
PubMed PubMed Central Google Scholar
Zhu, X., Liew, Y. & Liu, Z. L. Chinese herbal medicine for menopausal symptoms. Cochrane Database Syst. Rev. 3, CD009023 (2016).
PubMed Google Scholar
Zhou, K., Zhang, J., Xu, L. & Lim, C. E. D. Chinese herbal medicine for subfertile women with polycystic ovarian syndrome. Cochrane Database Syst. Rev. 6, CD007535 (2021).
PubMed Google Scholar
Kong, D. Z. et al. Xiao Chai Hu Tang, a herbal medicine, for chronic hepatitis B. Cochrane Database Syst. Rev. 2019, CD013090 (2019).
PubMed PubMed Central Google Scholar
Flower, A., Wang, L.-Q., Lewith, G., Liu, J. P. & Li, Q. Chinese herbal medicine for treating recurrent urinary tract infections in women. Cochrane Database Syst. Rev. 2015, CD010446 (2015).
PubMed PubMed Central Google Scholar
Deng, H. & Xu, J. Wendan decoction (Traditional Chinese medicine) for schizophrenia. Cochrane Database Syst. Rev. 6, CD012217 (2017).
PubMed Google Scholar
Konet, A. et al. Performance of two large language models for data extraction in evidence synthesis. Res Synth Methods https://doi.org/10.1002/jrsm.1732 (2024).
Khraisha, Q., Put, S., Kappenberg, J., Warraitch, A. & Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth. Methods 15, 616–626 (2024).
Article PubMed Google Scholar
Gartlehner, G. et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res Synth. Methods 15, 576–589 (2024).
Article PubMed Google Scholar
Lai, H. et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw. Open 7, e2412687 (2024).
Article PubMed PubMed Central Google Scholar
Pitt, S. C., Schwartz, T. A. & Chu, D. AAPOR Reporting Guidelines for Survey Studies. JAMA Surg. 156, 785–786 (2021).
Article PubMed Google Scholar
Kimi.ai. https://kimi.moonshot.cn/.
Home. Anthropic https://docs.anthropic.com/en/home.
Chapter 5: Collecting data. https://training.cochrane.org/handbook/current/chapter-05.
Xia, Y. et al. A precision-preferred comprehensive information extraction system for clinical articles in traditional Chinese Medicine. Int. J. Intell. Syst. 37, 4994–5010 (2022).
Article Google Scholar
Tool to Assess Risk of Bias in Randomized Controlled Trials DistillerSR. DistillerSR https://www.distillersr.com/resources/methodological-resources/tool-to-assess-risk-of-bias-in-randomized-controlled-trials-distillersr.
Buscemi, N., Hartling, L., Vandermeer, B., Tjosvold, L. & Klassen, T. P. Single data extraction generated more errors than double data extraction in systematic reviews. J. Clin. Epidemiol. 59, 697–703 (2006).
Article PubMed Google Scholar
Arno, A. et al. Accuracy and Efficiency of Machine Learning-Assisted Risk-of-Bias Assessments in ‘Real-World’ Systematic Reviews: A Noninferiority Randomized Controlled Trial. Ann. Intern Med 175, 1001–1009 (2022).
Article PubMed Google Scholar
Flight, L. & Julious, S. A. Practical guide to sample size calculations: non-inferiority and equivalence trials. Pharm. Stat. 15, 80–89 (2016).
Article PubMed Google Scholar
The Comprehensive R Archive Network. https://cran.rstudio.com/.

Download references

Acknowledgements

This study was jointly supported by the Fundamental Research Funds for the Central Universities (No. lzujbky-2024-oy11), the National Natural Science Foundation of China (No. 82204931) and the Scientific and Technological Innovation Project of the China Academy of Chinese Medical Sciences (No. CI2021A05502).

Author information

Authors and Affiliations

Department of Health Policy and Health Management, School of Public Health, Lanzhou University, Lanzhou, China
Honghao Lai, Jiayi Liu, Weilong Zhao, Danni Xia & Long Ge
Evidence-Based Social Science Research Center, School of Public Health, Lanzhou University, Lanzhou, China
Honghao Lai, Jiayi Liu, Weilong Zhao, Danni Xia & Long Ge
School of Nursing, Southern Medical University, Guangzhou, China
Chunyang Bai
Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
Hui Liu, Bei Pan, Xufei Luo, Liangying Hou, Jinhui Tian, Yaolong Chen, Janne Estill, Kehu Yang, Haodong Li, Ye Wang, Huayu Zhang, Di Zhu, Dongrui Peng & Yishan Qin
Research Unit of Evidence-Based Evaluation and Guidelines, Chinese Academy of Medical Science, School of Basic Medical Sciences, Lanzhou University, Lanzhou, China
Xufei Luo
Key Laboratory of Evidence-Based Medicine of Gansu Province, Lanzhou University, Lanzhou, China
Xufei Luo, Jinhui Tian, Yaolong Chen & Kehu Yang
Department of Health Research Methods, Evidence, and Impact, McMaster University, Ontario, Canada
Liangying Hou
WHO Collaborating Center for Guideline Implementation and Knowledge Translation, Lanzhou, China
Jinhui Tian, Yaolong Chen, Kehu Yang & Long Ge
Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China
Lu Zhang, Zeming Li & Zhenhua Yang
Institute of Global Health, University of Geneva, Geneva, Switzerland
Janne Estill
Department of Oncology, Guang’ anmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China
Jie Liu
Institute of Basic Research of Clinical Medicine, China Academy of Chinese Medical Sciences, Beijing, China
Xing Liao & Nannan Shi
Chinese Evidence-Based Medicine Center, West China Hospital, Sichuan University, Chengdu, China
Xin Sun
Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing, China
Hongcai Shang
School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR, China
Zhaoxiang Bian & Xuan Yu
China Center for Evidence Based Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, China
Luqi Huang
National Resource Center for Chinese Materia Medica, China Academy of Chinese Medical Sciences, Beijing, China
Luqi Huang
Children’s Hospital of Chongqing Medical University, Chongqing, China
Fan Wang, Yueyan Li, Shilin Tang & Hanxiang Liu

Authors

Honghao Lai
View author publications
You can also search for this author inPubMed Google Scholar
Jiayi Liu
View author publications
You can also search for this author inPubMed Google Scholar
Chunyang Bai
View author publications
You can also search for this author inPubMed Google Scholar
Hui Liu
View author publications
You can also search for this author inPubMed Google Scholar
Bei Pan
View author publications
You can also search for this author inPubMed Google Scholar
Xufei Luo
View author publications
You can also search for this author inPubMed Google Scholar
Liangying Hou
View author publications
You can also search for this author inPubMed Google Scholar
Weilong Zhao
View author publications
You can also search for this author inPubMed Google Scholar
Danni Xia
View author publications
You can also search for this author inPubMed Google Scholar
Jinhui Tian
View author publications
You can also search for this author inPubMed Google Scholar
Yaolong Chen
View author publications
You can also search for this author inPubMed Google Scholar
Lu Zhang
View author publications
You can also search for this author inPubMed Google Scholar
Janne Estill
View author publications
You can also search for this author inPubMed Google Scholar
Jie Liu
View author publications
You can also search for this author inPubMed Google Scholar
Xing Liao
View author publications
You can also search for this author inPubMed Google Scholar
Nannan Shi
View author publications
You can also search for this author inPubMed Google Scholar
Xin Sun
View author publications
You can also search for this author inPubMed Google Scholar
Hongcai Shang
View author publications
You can also search for this author inPubMed Google Scholar
Zhaoxiang Bian
View author publications
You can also search for this author inPubMed Google Scholar
Kehu Yang
View author publications
You can also search for this author inPubMed Google Scholar
Luqi Huang
View author publications
You can also search for this author inPubMed Google Scholar
Long Ge
View author publications
You can also search for this author inPubMed Google Scholar

Consortia

On behalf of ADVANCED Working Group

Honghao Lai
, Jiayi Liu
, Chunyang Bai
, Hui Liu
, Bei Pan
, Xufei Luo
, Liangying Hou
, Weilong Zhao
, Danni Xia
, Jinhui Tian
, Yaolong Chen
, Lu Zhang
, Janne Estill
, Jie Liu
, Xing Liao
, Nannan Shi
, Xin Sun
, Hongcai Shang
, Zhaoxiang Bian
, Kehu Yang
, Luqi Huang
, Long Ge
, Haodong Li
, Ye Wang
, Huayu Zhang
, Di Zhu
, Dongrui Peng
, Fan Wang
, Yueyan Li
, Shilin Tang
, Hanxiang Liu
, Zeming Li
, Zhenhua Yang
, Xuan Yu
& Yishan Qin

Contributions

H.H.L., J.Y.L., and L.G. conceptualized the study, developed the structured prompts for large language models (LLMs), and drafted the manuscript. H.H.L., J.Y.L., C.Y.B., H.L., B.P., X.F.L., and D.N.X. refined and optimized the prompts for data extraction and risk-of-bias (ROB) assessments. H.H.L. ran the LLM-based data extraction and ROB assessments. H.H.L., J.Y.L., and W.L.Z. performed statistical analyses. L.Y.H., J.H.T., Y.L.C., X.L., J.L., N.N.S., X.S., and L.G. assisted in the collection and analysis of randomized controlled trials (RCTs) data, including reviewing the extraction and assessment results from LLMs and human reviewers. J.E. and L.Z. provided guidance on systematic review methodologies and contributed to the interpretation of the accuracy and efficiency results. H.C.S., Z.X.B., and L.Q.H. reviewed the study design, provided domain-specific expertise in complementary and alternative medicine (CAM), and contributed to the interpretation of CAM-specific results. K.H.Y. and L.Q.H. supervised the entire research process, overseeing both the technical aspects of LLM application and the clinical relevance of the findings. L.G. was responsible for overall project administration, ensuring integration across the study’s various components, and contributed to the final approval of the manuscript. All authors reviewed, edited, and approved the final version of the manuscript for submission.

Corresponding authors

Correspondence to Luqi Huang or Long Ge.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lai, H., Liu, J., Bai, C. et al. Language models for data extraction and risk of bias assessment in complementary medicine. npj Digit. Med. 8, 74 (2025). https://doi.org/10.1038/s41746-025-01457-w

Download citation

Received: 02 July 2024
Accepted: 15 January 2025
Published: 31 January 2025
DOI: https://doi.org/10.1038/s41746-025-01457-w