Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

Skip to main content

Evaluating the agreement between ChatGPT-4 and validated questionnaires in screening for anxiety and depression in college students: a cross-sectional study

Abstract

Background

The Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence-based web application, has demonstrated substantial potential across various knowledge domains, particularly in medicine. This cross-sectional study assessed the validity and possible usefulness of the ChatGPT-4 in assessing anxiety and depression by comparing two questionnaires.

Methods

This study tasked ChatGPT-4 with generating a structured interview questionnaire based on the validated Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder Scale-7 (GAD-7). These new measures were referred to as GPT-PHQ-9 and GPT-GAD-7. This study utilized Spearman correlation analysis, Intra-class correlation coefficients (ICC), Youden’s index, receiver operating characteristic (ROC) and Bland–Altman plots to evaluate the consistency between scores from a ChatGPT-4 adapted questionnaire and those from a validated questionnaire.

Results

A total of 200 college students participated. Cronbach’s α indicated acceptable reliability for both GPT-PHQ-9 (α = 0.75) and GPT-GAD-7 (α = 0.76). ICC values were 0.80 for PHQ-9 and 0.70 for GAD-7. Spearman’s correlation showed moderate associations with PHQ-9 (p = 0.63) and GAD-7 (p = 0.68). ROC curve analysis revealed optimal cutoffs of 9.5 for depressive symptoms and 6.5 for anxiety symptoms, both with high sensitivity and specificity.

Conclusions

The questionnaire adapted by ChatGPT-4 demonstrated good consistency with the validated questionnaire. Future studies should investigate the usefulness of the ChatGPT designed questionnaire in different populations.

Peer Review reports

Introduction

Large-scale language models (LLMs), such as the Chat Generation Pre-Training Transformer (ChatGPT), have become seminal tools in the field of artificial intelligence. Designed to generate text that approximates human language, these models utilize advanced Natural Language Processing (NLP) and Deep Learning (DL) techniques to process large datasets and generate nuanced, context-aware text output [12]. One of the largest publicly accessible autoregressive language models is ChatGPT, developed by OpenAI. ChatGPT gained significant traction after its public release on November 30, 2022, gaining approximately 100 million monthly active users within two months and setting a record as the fastest-growing consumer application ever [22]. ChatGPT was developed by OpenAI and is currently being used in a variety of applications.

ChatGPT goes far beyond general-purpose conversations. As a Chatbot powered by LLMs, ChatGPT is designed to facilitate human-like conversational interactions. It has been used in domains ranging from customer service to medical research, and its ability to generate high-quality, human-like text opens up new avenues of exploration [20]. Recent studies have explored its utility in answering examination questions for medical students [20, 21, 23, 24] and composing basic medical reports [13, 25, 30, 32]. Currently, ChatGPT is available in three principal versions: the widely accessible ChatGPT-3.5, the advanced subscription-based ChatGPT-4, and the newly introduced cost-efficient ChatGPT-4o. ChatGPT-4, the flagship version, sets a new standard in text generation, offering superior linguistic fluency and an enhanced user experience compared to its predecessor, ChatGPT-3.5 [10]. While the newly launched ChatGPT-4o provides functionality comparable to ChatGPT-4, it is optimized for responsiveness and cost-effectiveness, making it particularly suitable for real-time and large-scale applications [33]. Among these versions, ChatGPT-4 outperforms its peers, including ChatGPT-3.5 and ChatGPT-4o, in processing complex text and generating detailed, high-quality responses, making it the preferred choice for tasks that require in-depth analysis and nuanced understanding. While other Generative Artificial Intelligence tools, such as Bard and Claude, have emerged in recent years, ChatGPT remains the dominant choice in terms of user acceptance and market reach [4]. Moreover, studies have demonstrated that ChatGPT-4 is more accurate than other Generative Artificial Intelligence tools, like Bard and Claude, in identifying mental health-related issues, highlighting its superior performance in the field of mental health assessments [15]. Given the superior performance of ChatGPT-4 in processing complex text and its proven effectiveness in healthcare applications, ChatGPT-4 was selected as the subsequent research tool for this study.

In healthcare, ChatGPT has begun to show great potential, especially in the development and administration of medical questionnaires. Questionnaires are widely recognized as an effective tool for screening mental health conditions due to their simplicity and ease of use. However, traditional questionnaires often utilize a “one-size-fits-all” approach, limiting their validity in different cultural and demographic contexts. Research has demonstrated the need to validate these tools for specific populations, such as age groups [17, 34] or populations [7, 40].The ability of the ChatGPT to dynamically generate and adapt questionnaires provides an opportunity to address these limitations. By leveraging its natural language processing capabilities, ChatGPT targets and adapts questionnaires based on scales with good reliability and validity, thereby potentially improving the accuracy and personalization of assessments. Preliminary studies have demonstrated the feasibility of using ChatGPT-adapted questionnaires in clinical settings. For example, a study explored the application of ChatGPT in designing a questionnaire for low back pain assessment [11]. This questionnaire was compared with validated tools, including the Oswestry Disability Index (ODI) and the Quebec Back Pain Disability Scale (QBPDS), showing significant correlations in domains such as quality of life and medical counseling. Furthermore, the study highlighted that ChatGPT-adapted questionnaires offer a standardized method of assessment that minimizes variability associated with manual administration. This standardization improves uniformity of assessment criteria and supports the collection of reliable data over time, thereby facilitating early detection of potential problems and more timely diagnosis and intervention across diverse populations.

However, the application of the ChatGPT in healthcare remains underexplored, particularly in mental health screening. Studies have shown that between 21.6% and 37.6% of Chinese university students suffer from depression [37, 42], while more than 30% of students in Asia as a whole have anxiety symptoms [14, 18, 31]. Early identification and intervention is therefore crucial to manage these symptoms, and most of the existing screening tools are generalizable scales, with fewer scales specifically designed to screen for anxiety and depression in college students.The ability to dynamically customize questionnaires with the ChatGPT provides a very novel direction for questionnaire screening. It has the potential to create culturally sensitive and user-friendly assessment tools, which may increase student acceptance and engagement in mental health screening. By generating questionnaires that meet the unique needs of specific populations, the ChatGPT could facilitate early identification of anxiety and depression and ensure that college students follow up with timely interventions.

Research also suggests that concerns remain about ChatGPT’s ability to comprehend questions as well as possible inaccuracies [41] and biases in generated content [12]. These may reduce the accuracy of questionnaire assessment. Therefore, the aim of this study was to utilize the ChatGPT-4, incorporating the elements and scoring criteria of existing anxiety and depression questionnaires, to create a structured interview questionnaire tailored for university students. By comparing the scoring results of the ChatGPT questionnaire with those of a validated standardized questionnaire, this study aimed to assess the validity of an artificial intelligence-assisted assessment tool and to provide new possibilities for future questionnaire assessment methods.

Methods

Study design

This is an observational cross-sectional study. The study recruited college students from two universities in China from November 2023 to March 2024. All data were collected anonymously, and the questionnaire investigators were not aware of the patients’ responses. The study was conducted in accordance with the principles outlined in the Helsinki Declaration, with each participant receiving written informed consent.

Questionnaire development

In this study, we developed a structured interview questionnaire designed to assess anxiety and depression levels among college students in their daily lives. The questionnaire was created using the ChatGPT-4, with the Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder-7 (GAD-7) scales entries and scales, which are widely used to screen for anxiety and depression, as the basis for its generation. The PHQ-9 and GAD-7 scale entries and their scoring criteria were entered into the ChatGPT by the researcher under the guidance of a computer expert, whose role was to help the researcher optimize the interactive instructions with the ChatGPT. Subsequently, ChatGPT generated relevant questions based on the detailed information of the input entries that fit the daily life scenarios of college students, covering academic studies, social interactions, dormitory life, and career planning. Most of these generated contextual questions were based on entries from the PHQ-9 and GAD-7 scales, and the questions were written in Chinese to ensure that the questionnaire was applicable to participants with Chinese language background.

Participants answered these ChatGPT-adapted contextual questions via the ChatGPT webpage on their computer, and ChatGPT then automatically scored each entry according to the criteria of the PHQ-9 and GAD-7 scales. To ensure comparability of the assessment results, all participants also completed the traditional PHQ-9 and GAD-7 online questionnaires in order to compare the results of the ChatGPT-based assessment with the validated scale results. To avoid confusion with traditional questionnaire results, all ChatGPT-adapted questionnaire scores were labeled in this study with the prefix “GPT-” to distinguish them from the original questionnaire scores. A complete example of the ChatGPT-adapted screening questionnaire, as well as its English version, can be found in Supplementary Document 1.

Scale validation

The PHQ-9 can effectively screen depressive symptoms, accompanied by satisfactory sensitivity and specificity [27, 35]. The PHQ-9 scale was verified in Chinese college students, and the Cronbach’s α coefficient was 0.874, indicating a high level of reliability. Its structural validity was confirmed through exploratory and confirmatory factor analyses, with a two-factor model (affective and somatic symptoms) yielding excellent fit indices (CFI = 0.984, TLI = 0.974, RMSEA = 0.060) [39]. The questionnaire is composed of 9 items and Subjects rated the frequency of the 9 symptoms over the past 2 weeks on a 4-point Likert scale (0 = not at all, 1 = several days, 2 = more than half of the days, and 3 = nearly every day). The total score of the PHQ-9 ranged from 0 to 27, and higher scores indicated greater severity of depressive symptoms. The GAD-7 is a reliable self-report scale for assessing anxiety severity, demonstrating satisfactory psychometric properties in adults (Zhang et al., 2021). The GAD-7 scale was validated in Chinese medical postgraduate students, with a Cronbach’s α coefficient of 0.93, indicating excellent internal consistency. Confirmatory factor analysis supported a unidimensional structure, showing strong fit indices (CFI = 0.97, NFI = 0.96, RMSEA = 0.05). The scale comprises seven items evaluating symptoms of worry and somatic tension over the previous two weeks, rated on a 4-point Likert scale (0 = not at all, 1 = several days, 2 = more than half the days, 3 = nearly every day). Total scores range from 0 to 21, with higher scores indicating greater anxiety severity.

Statistical analysis

Analyses were performed using RStudio 2023.12, with a significance threshold set at a p of < 0.05. After college students completed the questionnaire consisting of four scenarios adapted by ChatGPT-4, scores for each item were generated by the model. Spearman correlation analysis was conducted to explore the relationship between the ChatGPT questionnaires and the validated questionnaires. Additionally, Cronbach’s alpha coefficient was calculated to assess the internal consistency of the scales based on these scores. Intra-class correlation coefficients (ICC) were computed for each item and the total score to evaluate the agreement between the ChatGPT questionnaire and the validated questionnaire. The ICC estimates and their 95% confidence intervals (CIs) range from 0 to 1, where 0 indicates no agreement and 1 indicates perfect agreement. The interpretation of the ICC values is as follows: less than 0.5 signifies poor reliability, between 0.5 and 0.75 indicates moderate reliability, between 0.75 and 0.9 reflects good reliability, and values greater than 0.9 denote excellent reliability [26]. To evaluate the screening accuracy of our model, we utilized receiver operating characteristic (ROC) analysis. This graphical method illustrates a classifier’s prediction accuracy, with the area under the ROC curve (AUC) representing the likelihood that the model ranks positive cases higher than negative ones; a higher AUC indicates better performance. An AUC greater than 0.7 is considered clinically acceptable [38]. Optimal cutoff values were determined using Youden’s index (sensitivity + specificity – 1) [9]. This study utilized Bland–Altman plots to visually evaluate the agreement between scores from the ChatGPT-4-adapted PHQ-9 and GAD-7 questionnaires and those from the validated versions. In these plots, the differences between paired scores (y-axis) were plotted against the mean of the paired scores (x-axis) to visually evaluate bias and variability [19]. The mean difference (bias) was calculated to assess the systematic error between the two measurements, while the 95% limits of agreement (mean difference ± 1.96 × SD of the differences) were used to estimate the range within which most differences between scores are expected to fall [19]. The width of the 95% limits of agreement was used to evaluate the degree of variability and determine the interchangeability of the two methods. Wider 95% limits of agreement suggest greater variability and less agreement between the measures.

Results

Demographics and questionnaire results

A total of 200 college students were recruited. The mean age was 21.07 years (SD = 1.31), and 139 (67.50%) of the participants were women. All the college student were able to complete the administered questionnaires without assistance. The median results of the questionnaires, with interquartile ranges and total scores, were as follows: GPT-PHQ-9: 4 (IQR 3–6 [range 0–20]); PHQ-9: 4 (IQR 2–5 [range 0–16]); GPT-GAD-7: 3 (IQR 2–5 [range 0–15]); GAD-7: 3 (IQR 2–4 [range 0–13]).

Correlation measures

Spearman correlation analysis between the ChatGPT questionnaire and the validated anxiety and depression questionnaires revealed both positive intra- and inter-correlations. The analysis showed a moderate correlation between the total scores of the ChatGPT questionnaire and the validated questionnaires: PHQ-9 (Spearman’ s rho = 0.63, p < 0.05) and GAD-7 (Spearman’ s rho = 0.68, p < 0.001). However, other items on the ChatGPT questionnaire did not demonstrate significant correlations with the validated questionnaires. Moreover, the correlation for PHQ-9 was observed to be stronger than that for GAD-7 (Fig. 1).

Fig. 1
figure 1

Spearman correlation matrix of ChatGPT-adapted questionnaire and the validated anxiety and depression questionnaires. a Heatmap comparing Spearman correlation coefficients between GPT-PHQ-9 and PHQ-9 scores. b Heatmap comparing Spearman correlation coefficients between GPT-GAD-7 and GAD-7 scores. Each cell in the heatmap represents the Spearman correlation coefficient between two variables, with the color intensity and the value indicating the strength and direction of the correlation. Asterisks indicates significant values: * p < 0.05, *** p < 0.001

Reliability measures

Cronbach’s α indicated acceptable reliability for both GPT-PHQ-9 (α = 0.75) and GPT-GAD-7 (α = 0.76), demonstrating acceptable internal consistency. ICC estimates and corresponding 95% CI can be found in Table 1. The ICC for PHQ-9 was 0.80 (95% CI: 0.73–0.85; p < 0.001), indicating good consistency. In comparison, the GAD-7 had an ICC of 0.70 (95% CI: 0.60–0.77; p < 0.001), suggesting moderate consistency. These findings indicate moderate to good agreement between the total scores of the ChatGPT questionnaire and those of the validated questionnaires.

Table 1 Intraclass correlation coefficient (ICC) to evaluate the consistency between ChatGPT-adapted questionnaires and validated questionnaires

Validity measures

The ROC curve analysis for the GPT-PHQ identified 9.5 as the optimal cutoff point for detecting depressive symptoms, with a sensitivity of 96.17% and a specificity of 88.23%. The area under the GPT-PHQ-9 ROC curve was 95.85% (95% CI: 0.927–0.989, p < 0.001). For the GPT-GAD, the optimal cutoff for identifying anxiety symptoms was 6.5, yielding a sensitivity of 77.54% and a specificity of 84.61%. The area under the GPT-GAD ROC curve was 85.82% (95% CI: 0.765–0.952, p < 0.001).

Agreement analysis

The Bland–Altman analysis revealed that the PHQ-9 (limits of agreement, 95% CI: -3.962 to 5.772) and the GAD-7 (limits of agreement, 95% CI: -4.665 to 5.482) did not demonstrate sufficient agreement at the 95% level (Fig. 2) This indicates that these scales are not interchangeable. A slight positive bias was observed between the GPT-PHQ-9 and PHQ-9, with a mean difference of 0.90 (SD: 2.483), and between the GPT-GAD-7 and GAD-7, with a mean difference of 0.41 (SD:2.589). These results indicate a slight positive bias in the adapted versions, where scores tended to be marginally higher than those of the validated questionnaires. Although the observed bias is small, the relatively wide 95% consistency limit suggests that there is still some variability between the two measures. This suggests that the ChatGPT adapted questionnaire may not be a complete substitute for the validated questionnaire.

Fig. 2
figure 2

Bland–Altman plots for the agreement between ChatGPT-adapted and validated questionnaires. a Bland–Altman plot showing the difference in total scores between GPT-PHQ-9 and PHQ-9. b Bland–Altman plot showing the difference in total scores between GPT-GAD-7 and GAD-7. The x-axis represents the mean of the total scores for the two questionnaires, and the y-axis represents the difference between them. The solid line indicates the mean difference, while the dashed lines represent the 95% limits of agreement

Discussion

This study assessed the concordance between the ChatGPT-4-adapted questionnaire and established depression and anxiety questionnaires among college students. The results showed that the questionnaire adapted by ChatGPT-4 demonstrated good consistency with the validated questionnaire. The results suggest that the ChatGPT-4 questionnaire could be effectively used to assess anxiety and depression in college students, and can play a potentially effective role in clinical practice.

To the best of our knowledge, this study is the first to combine validated questionnaires with ChatGPT-4. It adapted structured questionnaires tailored to the characteristics of college students. The results indicated good consistency between the ChatGPT adapted questionnaires and the validated instruments. Previous cross-sectional studies have shown that ChatGPT questionnaires only exhibit acceptable significant correlations with validated questionnaires [11], a finding consistent with our results. However, a pilot study aims to investigate the impact of ChatGPT on user experiences revealed that compared adapted by ChatGPT questionnaires did not draw the anticipated positive responses from the participants [44]. The discrepancy between these findings and this study could be attributed to several factors. First of all, the ChatGPT questionnaire in our study was refined by computer professionals to better reflect contemporary college life, potentially making it more relatable for students and encouraging more genuine responses. Moreover, the perceived usefulness as important factors influencing students’ belief in ChatGPT and appear to be critical in predicting and encouraging its successful adoption [1]. In the context of AI-powered chat technology like ChatGPT, students are generally open to embracing new technologies [36]. Research has shown that college students have a high degree of acceptance of the new technology ChatGPT, which may make them hold a positive attitude toward ChatGPT psychological prediction, which is conducive to the adoption of the new technology [3, 5]. Finally, a quantitative study utilizing online survey methods suggests that people who have never used an AI Chatbot may still show greater willingness to do so, as they may view it as a convenient and effective solution for mental health support. This expectation of convenience, especially for those unfamiliar with the technology, may increase initial user engagement [30]. In this study, a large portion of the college student population, while aware of ChatGPT, had not previously used it, which might have enhanced their compliance and the accuracy of their responses due to the expected convenience of using ChatGPT.

According to the questionnaire scores, the anxiety and depression levels among college students are predominantly at a minimal or mild level. Studies have demonstrated that the depressive episode assessment protocol adapted by ChatGPT-4 closely with the recommendations of primary care physicians. The results indicate that ChatGPT-4 may significantly enhance clinical decision-making in mild cases of depression [29]. However, ChatGPT was more accurate in identifying mild depression than moderate depression, which may be the reason why ChatGPT is more accurate in the depression scores of college students in this study. A recent case study compared mental health assessments generated by ChatGPT with assessment norms set by mental health professionals. The results of the study showed that, in most cases, ChatGPT scores for mental resilience were generally lower than those assessed by professionals. [16]. This suggests potential differences in how AI-based assessments and human experts interpret and evaluate psychological constructs beyond diagnostic categories like depression and anxiety. There were numerous prior reports about applying natural language processing to detect mental health. However, these studies focused on detecting pre‐selected diseases [43]. When training a comprehensive model to detect and classify psychiatric disorders into different diagnoses, the heterogeneous presentations of these disorders and poor reliability in psychiatric diagnosis, may bring major challenges to the training process [2]. This may be because ChatGPT performance is not robust when training on the datasets from different data sources due to diverse writing styles and semantic heterogeneity [43]. As a result, the ChatGPT detection model often performs suboptimally in complex scenarios.

Although this study concluded that the ChatGPT-adapted questionnaire scores and traditional questionnaire scores have some consistency and can assist in the diagnosis of depression, there is also a potential risk of ChatGPT misinterpreting the user's responses that is of concern [8]. It has been shown that LLM tends to produce erroneous or misleading information. These errors, combined with the authoritative tone of chatbot responses, may lead clinicians or patients to inadvertently believe inaccurate information, resulting in an underestimation or overestimation of symptom severity [6]. Therefore, ChatGPT should be used in conjunction with clinical professional judgment as an adjunct, rather than a substitute, for traditional methods to ensure the reliability of mental health assessments. Future research should focus on enhancing NLP algorithms to improve ChatGPT’s ability to recognize semantic nuances in user responses, particularly in complex mental health assessment scenarios.

Nevertheless, it is anticipated that both medical professionals and patients will increasingly utilize Chatbots [28]. The integration of ChatGPT as an artificial intelligence tool in healthcare systems is a trend, as it significantly enhances precision and reduces the time required for numerous healthcare processes [12]. The long-term benefits of incorporating ChatGPT in various aspects of healthcare can lead to improved efficiency and accuracy in the sector. As ChatGPT technology advances, future research could explore its efficacy in handling more complex patient scenarios, providing new ways for clinical staff and patients to conduct scale assessments.

Limitations

While these findings indicate the promise of AI-powered questionnaires for PHQ-9 and GAD-7, the participant group was limited to college students, the sample is not representative enough. Moreover, the ability of the ChatGPT to accurately understand and adapt psychological concepts may vary in different cultural and linguistic environments. This uncertainty about how the ChatGPT understands psychological concepts such as depression and anxiety in different cultural contexts may affect the validity of different groups. Future research should extend these findings to other demographic groups, including those from different cultural backgrounds, to validate the effectiveness of ChatGPT-adapted questionnaires across a broader population. Additionally, the current ChatGPT-4 sits behind a paywall and the Pro version with Bing BETA internet browsing requires a paid subscription (US $20 per month). Therefore, there are some concerns about the financial accessibility of LLMs and the subscription costs that enable access to accuracy-improving features.

Conclusions

This cross-sectional study demonstrated substantial concordance between the ChatGPT-adapted questionnaire and validated instruments for assessing anxiety and depression among college students. ChatGPT can be employed to explore new subfields in the clinical evaluation of patients, due to its ability to use various types of data. In the future, new questionnaires could be developed based on the characteristics of other populations. As technology advances, the potential for directly using the ChatGPT website for patient questionnaire assessments in clinical settings becomes a possibility that could significantly reduce the workload of healthcare personnel. However, further research is required to evaluate how this application might benefit patients and enhance clinical workflows.

Data availability

The datasets adapted and/or analyzed in the study are not currently publicly available but are available from the corresponding authors of this study upon reasonable request.

References

  1. Abdaljaleel M, Barakat M, Alsanafi M, Salim NA, Abazid H, Malaeb D, Mohammed AH, Hassan B, Wayyes AM, Farhan SS, Khatib SE, Rahal M, Sahban A, Abdelaziz DH, Mansour NO, Alzayer R, Khalil R, Fekih-Romdhane F, Hallit R, Hallit S, Sallam M. A multinational study on the factors influencing university students’ attitudes and usage of ChatGPT. Sci Rep-Uk. 2024;14:1983. https://doi.org/10.1038/s41598-024-52549-8.

    Article  CAS  Google Scholar 

  2. Aboraya A, Rankin E, France C, El-Missiry A, John C. The Reliability of psychiatric diagnosis revisited: the clinician’s guide to improve the reliability of psychiatric diagnosis. Psychiatry (Edgmont). 2006;3:41–50.

    PubMed  Google Scholar 

  3. Ajlouni A, Wahba F, Almahaireh A. Students’ attitudes towards using ChatGPT as a learning tool: the case of the University of Jordan. Int J Interact Mobile Technol. 2023;18:99–117.

    Article  Google Scholar 

  4. Al Naqbi H, Bahroun Z, Ahmed V. Enhancing work productivity through generative artificial intelligence: a comprehensive literature review. Sustainability-Basel. 2024;16:1166.

    Article  Google Scholar 

  5. Alfadda HA, Mahdi HS. Measuring Students’ Use of Zoom Application in Language Course Based on the Technology Acceptance Model (TAM). J Psycholinguist Res. 2021;50:883–900. https://doi.org/10.1007/s10936-020-09752-1.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Alkaissi H, Mcfarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus J Med Science. 2023;15:e35179. https://doi.org/10.7759/cureus.35179.

    Article  Google Scholar 

  7. Apputhurai P, Palsson OS, Bangdiwala SI, Sperber AD, Mikocka-Walus A, Knowles SR. Confirmatory validation of the patient health questionnaire - 4 (PHQ-4) for gastrointestinal disorders: a large-scale cross-sectional survey. J Psychosom Res. 2024;180:111654. https://doi.org/10.1016/j.jpsychores.2024.111654.

    Article  PubMed  Google Scholar 

  8. Blease C, Torous J. ChatGPT and mental healthcare: balancing benefits with risks of harms. Bmj Ment Health. 2023;26, https://doi.org/10.1136/bmjment-2023-300884.

  9. Brehaut E, Neupane D, Levis B, Wu Y, Sun Y, Ioannidis J, Markham S, Cuijpers P, Patten SB, Benedetti A, Thombs BD. “Optimal” cutoff selection in studies of depression screening tool accuracy using the PHQ-9, EPDS, or HADS-D: a meta-research study. Int J Meth Psych Res. 2023;32:e1956. https://doi.org/10.1002/mpr.1956.

    Article  Google Scholar 

  10. Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

    Google Scholar 

  11. Coraci D, Maccarone MC, Regazzo G, Accordi G, Papathanasiou JV, Masiero S. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol. 2023;33, https://doi.org/10.4081/ejtm.2023.12114.

  12. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell. 2023;6:1169595. https://doi.org/10.3389/frai.2023.1169595.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Dergaa I, Chamari K, Zmijewski P, Ben SH. From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing. Biol Sport. 2023;40:615–22. https://doi.org/10.5114/biolsport.2023.125623.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Dessauvagie AS, Dang HM, Nguyen T, Groen G. Mental health of university students in southeastern asia: a systematic review. Asia-Pac J Public He. 2022;34:172–81. https://doi.org/10.1177/10105395211055545.

    Article  Google Scholar 

  15. Elyoseph Z, Levkovich I. Beyond human expertise: the promise and limitations of ChatGPT in suicide risk assessment. Front Psychiatry. 2023;14:1213141. https://doi.org/10.3389/fpsyt.2023.1213141.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Elyoseph Z, Levkovich I, Shinan-Altman S. Assessing prognosis in depression: comparing perspectives of AI models, mental health professionals and the general public. Fam Med Community He.2024; 12, https://doi.org/10.1136/fmch-2023-002583.

  17. Fonseca-Pedrero E, Diez-Gomez A, Perez-Albeniz A, Al-Halabi S, Lucas-Molina B, Debbane M. Youth screening depression: Validation of the Patient Health Questionnaire-9 (PHQ-9) in a representative sample of adolescents. Psychiat Res. 2023;328:115486. https://doi.org/10.1016/j.psychres.2023.115486.

    Article  Google Scholar 

  18. Ghrouz AK, Noohu MM, Dilshad MM, Warren SD, Bahammam AS, Pandi-Perumal SR. Physical activity and sleep quality in relation to mental health among college students. Sleep Breath. 2019;23:627–34. https://doi.org/10.1007/s11325-019-01780-z.

    Article  PubMed  Google Scholar 

  19. Giavarina D. Understanding Bland Altman analysis. Biochem Med. 2015;25:141–51. https://doi.org/10.11613/BM.2015.015.

    Article  Google Scholar 

  20. Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. Jmir Med Educ. 2023;9:e45312. https://doi.org/10.2196/45312.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Hoch CC, Wollenberg B, Luers JC, Knoedler S, Knoedler L, Frank K, Cotofana S, Alfertshofer M. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Oto-Rhino-L. 2023;280:4271–8. https://doi.org/10.1007/s00405-023-08051-4.

    Article  Google Scholar 

  22. Hu K. ChatGPT sets record for fastest-growing user base—analyst note.2023. https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/.

  23. Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof. 2023;20:1. https://doi.org/10.3352/jeehp.2023.20.1.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Humar P, Asaad M, Bengur FB, Nguyen V. ChatGPT Is Equivalent to First-Year Plastic Surgery Residents: Evaluation of ChatGPT on the Plastic Surgery In-Service Examination. Aesthet Surg J. 2023;2023(43):NP1085–9. https://doi.org/10.1093/asj/sjad130.

    Article  Google Scholar 

  25. Jeblick K, Schachtner B, Dexl J, Mittermeier A, Stuber AT, Topalis J, Weber T, Wesp P, Sabel BO, Ricke J, Ingrisch M. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol. 2023. https://doi.org/10.1007/s00330-023-10213-1.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Koo TK, Li MY. A Guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15:155–63. https://doi.org/10.1016/j.jcm.2016.02.012.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Kroenke K, Spitzer RL, Williams JB. The PHQ-9: validity of a brief depression severity measure. J Gen Intern Med. 2001;16:606–13. https://doi.org/10.1046/j.1525-1497.2001.016009606.x.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. New Engl J Med. 2023;388:1233–9. https://doi.org/10.1056/NEJMsr2214184.

    Article  PubMed  Google Scholar 

  29. Levkovich I, Elyoseph Z. Identifying depression and its determinants upon initiating treatment: ChatGPT versus primary care physicians. Fam Med Community He. 2023;11, https://doi.org/10.1136/fmch-2023-002391.

  30. Li L, Peng W, Rheu M. Factors Predicting Intentions of Adoption and Continued Use of Artificial Intelligence Chatbots for Mental Health: Examining the Role of UTAUT Model, Stigma, Privacy Concerns, and Artificial Intelligence Hesitancy. Telemed E-Health. 2024;30:722–30. https://doi.org/10.1089/tmj.2023.0313.

    Article  Google Scholar 

  31. Li W, Zhao Z, Chen D, Peng Y, Lu Z. Prevalence and associated factors of depression and anxiety symptoms among college students: a systematic review and meta-analysis. J Child Psychol Psyc. 2022;63:1222–30. https://doi.org/10.1111/jcpp.13606.

    Article  Google Scholar 

  32. Lin KC, Chen TA, Lin MH, Chen YC, Chen TJ. Integration and Assessment of ChatGPT in Medical Case Reporting: a multifaceted approach. Eur J Invest Health. 2024;14:888–901. https://doi.org/10.3390/ejihpe14040057.

    Article  Google Scholar 

  33. Openai, ChatGPT: Optimizing language models for Dialogue. 2024. https://openai.com/blog/chatgpt.

  34. Phelan E, Williams B, Meeker K, Bonn K, Frederick J, Logerfo J, Snowden M. A study of the diagnostic accuracy of the PHQ-9 in primary care elderly. Bmc Fam Pract. 2010;11:63. https://doi.org/10.1186/1471-2296-11-63.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Spitzer RL, Kroenke K, Williams JB. Validation and utility of a self-report version of PRIME-MD: the PHQ primary care study. Primary Care Evaluation of Mental Disorders. Patient Health Questionnaire. Jama-J Am Med Assoc. 1999;282:1737–44. https://doi.org/10.1001/jama.282.18.1737.

    Article  CAS  Google Scholar 

  36. Strzelecki A, Artur SUKP, Https OO, Information VFA, To use or not to use ChatGPT in higher education? A study of students' acceptance and use of technology.

  37. Tang H, Ding LL, Song XL, Huang ZW, Qi Q, He LP, Yao YS. Meta-analysis of detection rate of depressed mood among Chinese college students from 2002 to 2011. J Jilin Univ. 2013;39:965–9.

    Google Scholar 

  38. Thormundson B. Usage of ChatGPT by demographic 2023, by age and gender. 2023. https://www.statista.com/statistics/1384324/chat-gpt-demographic-usage/.

  39. Wang Y, Liang L, Sun Z, Liu R, Wei Y, Qi S, Ke Q, Wang F. Factor structure of the patient health questionnaire-9 and measurement invariance across gender and age among Chinese university students. Medicine. 2023;102:e32590. https://doi.org/10.1097/MD.0000000000032590.

    Article  PubMed  PubMed Central  Google Scholar 

  40. Woldetensay YK, Belachew T, Tesfaye M, Spielman K, Biesalski HK, Kantelhardt EJ, Scherbaum V. Validation of the Patient Health Questionnaire (PHQ-9) as a screening tool for depression in pregnant women: Afaan Oromo version. PLoS ONE. 2018;13:e191782. https://doi.org/10.1371/journal.pone.0191782.

    Article  CAS  Google Scholar 

  41. Yeo YH, Samaan JS, Ng WH, Ting PS, Trivedi H, Vipani A, Ayoub W, Yang JD, Liran O, Spiegel B, Kuo A. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29:721–32. https://doi.org/10.3350/cmh.2023.0089.

    Article  PubMed  PubMed Central  Google Scholar 

  42. Yiqiu HU, Zhenghua L. An Intervention Study of Psychological Health of Depressed College Students:the Different Effects of Different Types of School Support. J Educ Sci Hunan Norm Univ. 2023

  43. Zhang T, Schoene AM, Ji S, Ananiadou S. Natural language processing applied to mental illness detection: a narrative review. Npj Digit Med. 2022;5:46. https://doi.org/10.1038/s41746-022-00589-7.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Zou Z, Mubin O, Alnajjar F, Ali L. A pilot study of measuring emotional response and perception of LLM-generated questionnaire and human-generated questionnaires. Sci Rep-Uk. 2024;14:2781. https://doi.org/10.1038/s41598-024-53255-1.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

This study was supported by the Traditional Chinese Medicine Innovation and Development Joint Fund [Grant Number: 2023AFD160].

Traditional Chinese Medicine Innovation and Development Joint Fund,2023AFD160

Author information

Authors and Affiliations

Authors

Contributions

Concepts, methodological guide: Fen Yang. Data collection: Jiali Liu, Juan Gu, MengJie Tong and YaKe Yue. Data Analysis: Jiali Liu. Writing Guide: Fen Yang, Lijuan Zeng, Yufei Qiu, Yiqing Yu and Shuyan Zhao.

Corresponding authors

Correspondence to Fen Yang or Shuyan Zhao.

Ethics declarations

Ethics approval and consent to participate

Our study did not require further ethics committee approval as it did not involve animal or human clinical trials and was not unethical. In accordance with the ethical principles outlined in the Declaration of Helsinki, all participants provided informed consent before participating in the study. The anonymity and confidentiality of the participants were guaranteed, and participation was completely voluntary.

Consent for publication

Not Applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, J., Gu, J., Tong, M. et al. Evaluating the agreement between ChatGPT-4 and validated questionnaires in screening for anxiety and depression in college students: a cross-sectional study. BMC Psychiatry 25, 359 (2025). https://doi.org/10.1186/s12888-025-06798-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12888-025-06798-0

Keywords