Introduction

Major depressive disorder (MDD) is a prevalent and debilitating condition characterized by persistent feelings of sadness, hopelessness, and loss of interest or pleasure in daily activities [1]. An estimated 322 million individuals worldwide meet criteria for MDD, representing a substantial public health challenge impacting quality-of-life, productivity, and overall well-being [2]. While various pharmacological, psychotherapeutic, and neuromodulatory treatments are available, 30–40% of individuals with MDD experience treatment-resistant depression (TRD), defined as non-response to two or more trials of first-line interventions [3]. Because MDD is a highly heterogeneous disorder, with 277 possible symptom combinations meeting DSM-5 criteria [4], it is possible that different subtypes or symptom clusters of MDD may respond differentially to various treatments [5]. Therefore, aligning optimal treatment strategies to symptom presentations may be valuable for addressing TRD. This first requires understanding the mechanisms and predictors of treatment response in MDD, as this knowledge can inform the development of more personalized treatments that are optimal for different individuals based on their unique symptom profiles.

Repetitive transcranial magnetic stimulation (rTMS) is an established clinical intervention for the treatment of TRD, with response rates of approximately 40–60%, providing a valuable treatment option for patients not responding to first line psychotherapy and pharmacotherapy [6, 7]. rTMS involves the application of repetitive magnetic pulses to the cortex, with the dorsolateral prefrontal cortex (DLPFC) a common target for MDD due to its role in mood regulation, cognitive control, executive functioning, and emotion processing [8]. The DLPFC is also implicated in the pathophysiology of depression due to its connections with the limbic system, while acute elevations in mood following rTMS provide further evidence of DLPFC dysfunction in MDD [9,10,11,12]. Ultimately, rTMS is thought to exert its therapeutic effects through modulation of cortical excitability, neurotransmitter systems, and neuroplastic mechanisms [13], but understanding the predictors of treatment response is crucial for tailoring treatment strategies and enhancing clinical outcomes in individuals with TRD.

In recent years, advances in machine learning (ML) have provided new opportunities for leveraging large-scale datasets, such as electronic medical records (EMR), to improve healthcare delivery and inform clinical decision-making [14]. For instance, ML prediction models can be used as structured tools to more accurately stratify patients into subgroups or individualized treatment pathways [15]. In psychiatry, particularly in TRD and rTMS, ML can reveal latent multifactorial relationships that conventional methods might overlook. Recent studies utilizing large clinical registries have demonstrated the effectiveness of predictive and inferential models of treatment response and remission, highlighting factors such as the influence of treatment protocols, patient demographics, and psychiatric comorbidities on outcomes [16, 17]. Despite the rapid growth of ML in psychiatric research, the generalizability and clinical utility of many ML models published to date remain poor [18]. Motivated by the potential of ML to enhance understanding of rTMS treatment for patients with TRD, this study performed a comprehensive retrospective analysis to investigate demographic, clinical, and treatment-related factors associated with treatment response and remission.

This retrospective approach offers several strengths, including a moderately sized sample, the ability to capture real-world clinical practices, and identify long-term treatment outcomes [19] without the resource demands of a clinical trial. Specifically, this study explores 1) demographic and clinical characteristics of TRD patients receiving rTMS treatment, 2) various protocols and parameters utilized in rTMS administration, and 3) pre-treatment and treatment-related factors predictive of treatment response or remission. This paper aims to address the following questions: whether ML can predict response or remission following rTMS based on pre-treatment patient features and treatment-related factors better than chance; which features distinguish individuals responding favorably to rTMS from those who do not; and whether the ML models may offer clinical benefit. In generating ML models to predict rTMS treatment response and identifying the features that drive both positive and negative outcomes, this study aims to advance understanding of TRD, optimize treatment strategies, and ultimately improve patient outcomes.

Materials and methods

Design

A retrospective study was conducted, including EMR (Epic, Verona, WI, USA) of TRD patients, collected from the clinical database of the UC San Diego Interventional Psychiatry Program (UCSD-IPP). All contributing procedures comply with the ethical standards of the relevant national and international committees on human subjects’ research and the Helsinki Declaration and were approved by the UCSD Human Research Protections Program.

Patients

Included patients were prescribed rTMS treatment from UCSD psychiatrists credentialed in delivering rTMS. rTMS was initiated in outpatients who met criteria for TRD, did not present with any treatment contraindications, had adequate insurance coverage, and provided consent. As illustrated in Fig. 1, EMR records for 419 patients seeking rTMS in 2017–2023 at UCSD were screened for the following inclusion criteria: (i) primary diagnosis of MDD; (ii) aged 18 years and older; (iii) completion of at least 15 sessions by the date of final PHQ-9; (iv) clinically meaningful depressive symptoms, as indicated by PHQ-9 total score >4; (v) failure to respond to at least 2 antidepressant pharmacotherapy trials; and (vi) completion of clinical evaluations prior to and following the acute treatment phase.

Fig. 1
figure 1

Flowchart of patient inclusion.

A total of 246 patients began a rTMS treatment course and 239 completed at least 15 sessions before their final PHQ-9 assessment. Two participants had baseline PHQ-9 scores <5, indicating no clinically significant depressive symptoms, and were excluded from analyses. Five patients were excluded for not having received an MDD diagnosis, yielding a final analytical sample of 232 MDD patients. All patients had previously failed to respond to at least two antidepressant treatments and 90% reported having an MDD diagnosis >10 years.

Treatment procedure

All rTMS treatment was performed in the UCSD-IPP clinic by technicians trained in protocol administration with oversight by an attending psychiatrist. Overseeing psychiatrists reviewed the treatment plans after each session. rTMS was delivered using either a Magventure MagPro with a B70 coil or a BrainsWay H1-coil system. For all patients, the resting motor threshold (rMT) was identified by finding the motor hotspot and utilizing the lowest level of stimulation intensity to elicit movement in the opposite abductor pollicis brevis (APB) muscle on 50% of stimulation attempts. Once rMT was obtained, treatment session parameters were determined. For Magventure treatments, the coil position was localized using the Beam F3 method, and treatment was delivered per standard practice [20]. Patients who were initiated on the iTBS protocol (at 120% rMT), involving 600 total pulses in bursts of 3 with 8-s inter-train-intervals, were transitioned to either BiTBS (90% rMT; 3600 total pulses, 1800 delivered via 3-iTBS and 1800 delivered cTBS 120% rMT protocol to the left and right hemispheres, respectively) or 3-iTBS (iTBS delivered three times consecutively, per Li et al. [21] if there was deemed poor efficacy or issues with tolerability). High-frequency left (HFL; 19 min) and low-frequency right (LFR; 10 min) protocols were used less commonly, both at 120% rMT. The obsessive-compulsive disorder (OCD) protocol uses a B70 coil and delivers 2000 pulses per session via 50, 2-s trains to dorsomedial prefrontal cortex (DMPFC), at 100% rMT of the foot with the Magventure B70 coil. For the BrainsWay H1-coil system either iTBS or the standard FDA-approved 18 Hz protocol was used at 120% rMT. The end of each patient’s acute treatment period was clinician-determined and coincided with reduced frequency of rTMS treatment to <2 sessions per week. Sequencing of TMS protocols across all patients included in the analysis is depicted in Supplementary Fig. 1.

Clinical assessment

The PHQ-9 is a nine-item self-report scale used to assess the presence and severity of depressive symptoms based on the Diagnostic and Statistical Manual (DSM) criteria for a Major Depressive Episode (MDE). Questions are on a four-point scale related to the frequency of symptoms presented on each item, including “not at all,” “several days,” “more than half the days,” and “nearly every day.” Total scores are calculated by adding scores to all nine items (total range: 0–27), and previously established cutoff scores for severity are: 0–4 (no symptoms), 5–9 (mild symptoms), 10–14 (moderate symptoms), 15–19 (moderately severe symptoms), and 20–27 (severe symptoms). The PHQ-9 was administered pre-treatment and at rTMS visits throughout the acute treatment phase to monitor changes in depressive symptoms. Only pre-treatment and post-acute treatment phase scores were used in the present analyses.

Predictive features

Pre-treatment features

A total of 32 EMR-derived features were extracted from patient records, including: age, gender, ethnicity, religious affiliation, educational attainment, employment status, relationship status, body mass index (BMI) category (obesity: ≥ 30 kg/m2; overweight: 25 ≤ kg/m2 < 30; normal 18.5 < kg/m2 < 25), alcohol, tobacco, and cannabis use (each operationalized as 3-level factor: current, former, or never), family history of mood disorders, 1st degree relatives with mood or psychiatric disorder diagnoses, prior ECT treatment, prior rTMS treatment, history of psychological trauma, non-suicidal self-injury (NSSI), history of suicidal ideation (SI) or suicide attempts, psychiatric hospitalizations, and duration of current MDE. Psychiatric comorbidities were also recorded, including anxiety disorders, trauma-related disorders, substance use disorders, OCD, somatic symptom and related disorders, personality disorders, autism spectrum disorders, eating disorders, and neurocognitive disorders. Comorbid diagnoses were extracted from physician notes. Psychiatric medication use, including benzodiazepines, non-benzodiazepine sedatives, anxiolytics, antipsychotics, and psychostimulants, was recorded, and antidepressant medication burden was translated into daily defined dose (DDD) equivalents. A detailed description of the coding for each drug category can be found in the Supplement. Certain continuous variables were categorized to align with clinically relevant thresholds, enhancing interpretability in the context of TRD treatment. Age was grouped into age brackets (<45, 45–64, ≥65 years), PHQ-9 scores were categorized by established severity levels, BMI was grouped by WHO classifications (normal, overweight, obese), and MDE duration was categorized as <6 months, 6–12 months, and >12 months to capture differences in chronicity. While this approach may reduce statistical power, it provides a meaningful, clinically interpretable framework for evaluating treatment outcomes. Features were extracted by manual search through individual patient’s EMR. Variables were primarily collected via initial treatment consultation note completed by the prescribing psychiatrist. In the case where specific data points were not immediately available in the consultation note, searches were conducted to identify the nearest timepoint in which that information was available.

Treatment-related features

To determine the role of treatment-related factors in depression outcomes, the following five parameters were recorded: protocol type (deep TMS, iTBS, bilateral TBS, or others that included HFL, LFR, 3-iTBS, or OCD protocols); number of total rTMS sessions; cumulative number of pulses received throughout the treatment course; and, whether patients switched rTMS protocols during treatment.

Statistical analysis

R statistical software (v.4.3.1) was used for data analysis. Independent sample t-tests and chi-square tests were used to compare demographic and clinical characteristics between rTMS non-responders and responders, defined as patients with ≥50% decrease in post-treatment PHQ-9 scores from pre-treatment scores. Pairwise post-hoc tests for factors with more than two levels were false discovery rate (FDR)-corrected. To mitigate low variance within categorical variables, variables were combined and/or operationalized as follows: rTMS protocols: 18 Hz deep TMS (dTMS), BiTBS, or iTBS, made up the majority (86.6%) of rTMS protocols in the sample; therefore, other protocols, including 3-iTBS, 3-iTBS/OCD, OCD, and HFL were combined into a single factor level (“Other”). Race/ethnicity: patients were coded as non-Hispanic, white, or “Other”. Psychiatric comorbidities: patients were coded with 1 if they had at least one comorbid diagnosis for each individual comorbidity of interest. Duration of current MDE: patients were coded as <6 months, 6–12 months, or >12 months in duration. Patients were binarized indicating completion of a 4-year degree or not. Relationship status was coded as: currently single/unpartnered, partnered/married, or separated/divorced/widowed. Non- antidepressant psychiatric medication use was coded 1 or 0 for each drug category.

To identify pre-treatment patient characteristics and treatment-associated features that best discriminate response and remission groups, a series of classification models were trained using the caret package: (i) regularized logistic regression (elastic net regression; ENet); (ii) support vector machine (SVM) with radial basis function; (iii) random forest (RF); and (iv) stochastic gradient boosting machines (GBM). All features were filtered for near-zero-variance (<5%) and linear combinations. Nested cross-validation was used for hyperparameter tuning and model performance evaluation using nestedcv with 10 inner folds and 20 outer folds. RF parameters optimized in the nested grid search included mtry (number of variables to split), splitrule (quality of each node split), and min.node.size (minimum number of samples in a node), and GBM parameters included interaction.depth (highest interaction per tree), n.minobsinnode (minimum number of samples in a node to split), n.trees (total number of trees), and shrinkage (learning rate), whereas the default tuning grids in caret were applied for ENet and SVM. To mitigate class imbalance in the dataset, Synthetic Minority Over-sampling TEchnique (SMOTE) was applied to equalize the class ratio (1:1) [22]. Tree-based classification models, such as gradient boosted trees and random forest, detect interactions by optimizing decision tree paths in which the influence of a given variable is conditioned on the values of preceding variables in the path. It is possible to visualize these relationships from the global model via SHAP interaction values.

Feature selection via a univariate filter was performed on each outer fold prior to hyperparameter tuning, and a Relief-based algorithm (ReliefFbestK in CORElearn package; k = 10) was applied in parallel for comparison with the univariate filter. Generally, performance was superior for models trained after univariate filtering (Supplementary Fig. 2), with 10 features selected on average (range: 7–13) for all models. Optimization criteria for each model’s hyperparameter tuning procedure were receiver operating characteristic area under the curve (ROC AUC). Continuous features were standardized, and categorical features were one-hot encoded, with the first level dropped as reference. Missing variables were treated as missing completely at random and nonparametric random forest-based imputation was performed using missForest.

Generalization performance and 95% confidence intervals were generated by averaging AUC across the best tune of all outer folds. Predicted probabilities across all outer folds were merged to generate additional performance metrics (F-score, sensitivity, specificity) and models with the best performance for an outcome were selected for downstream analysis. Significance of AUC values were determined by permutation testing (B = 99) of class labels, re-running the procedure for each model, and calculating the proportion of permuted models with AUC values greater than the observed data. Alpha level was set at 0.05. The contributions of predictive features were determined using SHapley Additive exPlanations (SHAP) analysis in the fastshap R package, and directionality of associations between features and outcome variables was ascertained by visualizing individual SHAP values and their dependencies with the shapviz package in R. SHAP values explain the output of machine learning models by quantifying the contribution of each feature to a specific prediction, helping to identify which factors positively or negatively influence the model’s outcome.

Using the best performing models, the predicted probabilities of treatment response and remission were derived. Model calibration was assessed with the Hosmer-Lemeshow (HL) goodness-of-fit test by binning predicted probabilities into deciles. Probability estimates were updated using beta regression, which has been shown to outperform other calibration methods [23], and recalibrated probabilities were reassessed using HL test. Decision curve analysis (DCA) was applied using dcurves to assess the clinical utility of prediction models through estimation of net benefit across thresholds of predicted risks using recalibrated probabilities from both “response” and “remission” classification models. Net benefit is equivalent to the percentage of individuals appropriately treated (“true positives”, responders and remitters) minus a weighted percentage of those inappropriately treated (“false positives”).

Results

Patient and treatment-related characteristics

Sociodemographic and clinical characteristics of patients, and treatment-related characteristics included in the analysis are outlined in Table 1. Across all patients, post-treatment PHQ-9 scores were significantly reduced compared to pre-treatment values (mean difference = −7.56, t231 = −17.6, d = 1.16, 95% CI = 0.99–1.32). Eighteen patients had baseline PHQ-9 scores < 10, indicative of minimal depressive symptoms, and response rates within this subgroup were lower than patients with PHQ-9 ≥ 10 at baseline (38.9 vs. 43.9%) but did not significantly differ (X2 = 0.17, p = 0.68). In total, 101 patients (44%) responded to rTMS, (mean difference = −13.0, t100 = −29.4, d = 2.94, 95% CI = 2.48–3.39; Fig. 2) and 53 patients (23%) remitted. The average patient age was 55.1 ± 16.9 and 54.1 ± 17.0 years in the responder and non-responder groups, respectively. Patient sex and ethnicity did not differ between groups; however, responders were more likely to have been former smokers than non-responders, and less likely to have never used alcohol. Most patients were naïve to rTMS (88.4%) and ECT (83.3%) treatment, and the majority (60%) had MDEs lasting longer than 12 months at treatment outset. Response and remission rates for patients in this study with a history of rTMS were 48.1 and 18.5%, respectively. Relative to patients without a history of rTMS, there were no differences with response to response (X2 = 0.09, df = 1, p = 0.76) or remission (X2 = 0.11, df = 1, p = 0.74). Baseline PHQ-9 scores did not differ between responders (17.9 ± 5.4) and non-responders (17.6 ± 5.3). Responders were more likely to have a history of trauma and to be diagnosed with trauma disorder compared to non-responders, but were less likely to have an anxiety disorder or be prescribed benzodiazepines or antipsychotic medications. The ratio of patients receiving sequential iTBS-BiTBS to iTBS-only treatment was higher among non-responders than responders (padj = 0.04), and non-responders received fewer rTMS sessions on average compared to responders.

Table 1 Sociodemographic and clinical characteristics of patient sample.
Fig. 2: Depression scores pre- and post-rTMS treatment course.
figure 2

Each point represents an individual patient’s PHQ-9 score at baseline and the conclusion of their acute treatment phase, grouped by response status. Horizontal dashed line at PHQ-9 = 5 indicates the threshold for depression remission. Each point represents an individual patient’s PHQ-9 score at baseline and the conclusion of their acute treatment phase, grouped by response status. Horizontal dashed line at PHQ-9 = 5 indicates the threshold for depression remission.

Classification results

Treatment response

Across the four models tested, the SVM model predicted treatment response significantly above chance (p < 0.01; Supplementary Fig. 3) with modest discrimination performance (mean AUC = 0.689, 95% CI = 0.638–0.740), outperforming RF (0.654, 0.594–0.715), ENet (0.668, 0.616–0.719), and GBM (0.617, 0.563–0.671) models (Supplementary Fig. 2). Thus, the SVM model was evaluated further. The sensitivity and specificity of the model to discriminate treatment responders from non-responders were 72.3 and 53.4%, respectively, with an F-score of 0.64. Misclassification rates varied across rTMS protocols: BiTBS had a misclassification rate of 27.8%, iTBS 47.8%, dTMS 41.7%, and Other protocols 41.9%. Following recalibration, observed and predicted probabilities of treatment response were similar (HL test: \({\chi }_{8}^{2}\) = 5.51, p = 0.70, RMSE = 1.73; Supplementary Fig. 4). In descending order, the most important variables in prediction of treatment response were anxiety disorder, former tobacco use, obesity, trauma disorder, antipsychotic use, benzodiazepine use, iTBS-BiTBS protocol, >12-month duration of current MDE, history of trauma, number of rTMS sessions, and iTBS protocol (Fig. 3A). Partial dependence plots illustrating the relationship between the top-3 features, the total number of rTMS sessions, and treatment response are shown in Fig. 3B, and the expected response probability for a representative patient is shown in Supplementary Fig. 5). DCA indicated that for threshold probabilities between 25 and 55%, corresponding to a number needed-to-treat (NNT) between 4 and 2 respectively, the net benefit of applying the model to aid in determining whether a patient should undergo rTMS is higher than either a “treat all” or “treat none” approach (Fig. 4A).

Fig. 3: Feature Importance and Partial Dependence Plots for Treatment Response Prediction.
figure 3

A Ranked feature importance for treatment response classification model. Grey bars indicate negative association with treatment response (worse outcome), and black bars indicate positive association (better outcome). B Partial dependence plots illustrating directional associations between selected features and treatment response. Larger positive SHAP values indicate greater probabilities of treatment response, whereas more negative values indicate lower probabilities of response.

Fig. 4: Decision Curve Analysis for Treatment Response and Depression Remission Models.
figure 4

Decision curve analysis for recalibrated (A) treatment response and (B) depression remission models. Threshold probability represents the point at which a positive response is equally valued to avoiding unnecessary treatment. Decision curves for each model are shown in blue, compared to a “treat all” approach shown in red and a “treat none” approach shown in green.

Depression remission

Across the four models tested, the GBM model predicted remission following rTMS treatment significantly above chance (p < 0.01; Supplementary Fig. 3) with good discrimination performance (mean AUC = 0.745, 95% CI = 0.692–0.797), outperforming RF (0.735, 0.680–0.791), ENet (0.740, 0.668–0.813), and SVM (0.683, 0.627–0.738) models (Supplementary Fig. 2). Thus, the GBM model was evaluated further. The sensitivity and specificity of the model to discriminate remitters from non-remitters were 58.5 and 77.7%, respectively, with an F-score of 0.50. Misclassification rates for remission also differed by rTMS protocol: BiTBS had a rate of 33.3%, iTBS 30.4%, dTMS 23.8%, and Other protocols 29.0%. Following recalibration, the observed and predicted probabilities of treatment response remained dissimilar (HL test: \({\chi }_{8}^{2}\) = 19.1, p = 0.01, RMSE = 2.73; Supplementary Fig. 4). In descending order, the most important variables in the prediction of depression remission were pre-treatment PHQ-9 score, obesity, antipsychotic use, current MDE greater than 12 months, benzodiazepine use, anxiety, prior ECT, and iTBS-BiTBS protocol (Fig. 5A). Partial dependence plots illustrating the relationship between the top-four features and treatment response are shown in Fig. 5B. DCA indicated that for threshold probabilities between 15 and 40%, corresponding to a NNT between approximately 7 and 3, respectively, using the model to determine whether a patient should undergo rTMS yielded greater net benefit than employing either a “treat all” or “treat none” strategy (Fig. 4).

Fig. 5: Feature Importance and Partial Dependence Plots for Depression Remission Prediction.
figure 5

A Ranked feature importance for depression remission classification model. Grey bars indicate negative association with remission (worse outcome). B Partial dependence plots illustrating directional associations between selected features and depression remission, indicated by SHAP values.

Discussion

Findings presented in this study underscore the substantial variability in depression responses and remission, despite rTMS efficacy, and the promise of ML to predict treatment outcomes using EMR data alone. Importantly, both ML models for response and remission identified psychiatric comorbidities, depressive episode duration, and specific rTMS protocols as predictive features, with trauma history also associated with response outcomes.

Predictive features of treatment outcomes

Past research investigating the relationship between psychiatric comorbidities and treatment response has yielded valuable insights into the complex interplay between mental health conditions and their respective treatments [24]. This literature presents mixed results regarding impact of trauma history on rTMS. Experiences of childhood trauma and PTSD diagnosis have been reported to negatively affect rTMS outcomes in MDD patients [25, 26] or to have no significant impact [27]. Positive findings, as presented here, may point to a consequence of trauma-induced neuroplastic changes that sensitize the brain to neurostimulation [28]. For example, the therapeutic impact of rTMS could be modulated by individual differences in neural circuitry affected by trauma, which are not always uniformly engaged by standard rTMS protocols, such as 10 Hz rTMS to the left DLPFC as done in Hu et al. [26] and Yesavage et al. [25].

Former tobacco use was identified as a positive predictor of treatment response. Past literature on tobacco use in rTMS outcomes is mixed [29, 30], but nicotine’s complex neurobiological effects may enhance efficacy. However, the lack of association with current tobacco use suggests that other unaccounted factors may be influencing this finding, or it may be a statistical anomaly. Further investigation is needed to determine whether former tobacco use truly enhances rTMS response, if it serves as a proxy for another predictor, or if it is not directly related to treatment outcomes. Given the current findings, and coupled with trauma history, these associations suggest past experiences and lifestyle modifications may prime the brain’s receptivity to rTMS.

Features including comorbid anxiety disorder, concomitant benzodiazepine or antipsychotic medications, and a longer duration of current MDE were linked to lower probabilities of treatment response and remission. Anxiety is highly comorbid with depression, and this specific comorbidity may reflect distinct neurobiological underpinnings, affecting the response to rTMS [31, 32]. For instance, hyperactivity in the amygdala and its connectivity with the PFC, can be more pronounced in individuals with comorbid anxiety [33]. An extensive body of research supports that comorbid anxiety can complicate the course of depression, often leading to a more chronic and treatment-resistant condition [34].

Concurrent use of benzodiazepines or antipsychotics was associated with less favorable responses. These medications exert sedative effects and reduce cortical excitability, potentially impeding operating mechanisms of rTMS. Benzodiazepines, for instance, enhance GABAergic inhibition, which may diminish excitatory processes [33]. Antipsychotics, particularly those blocking dopamine receptors, have been shown to attenuate rTMS outcomes [34], potentially disrupting the dopaminergic modulation crucial for rTMS-induced neuroplasticity. These findings suggest patients prescribed benzodiazepines or antipsychotics may require adjusted protocols, perhaps with increased intensity or frequency to overcome medication-induced reductions in cortical excitability, or timing treatment sessions relative to medication dosing.

The observation that longer depressive episodes correlated with poorer outcomes is consistent research indicating depression chronicity can negatively impact the efficacy of various treatments. Chronic depression is associated with more severe neurobiological changes, including alterations in the hippocampus and PFC [35]. Earlier intervention might enhance response by addressing such changes before they become entrenched. Fava (2003) highlighted the predictive role of untreated depression duration in treatment resistance [36] and underscoring the possibility that adjusting standard protocols, or combining with other interventions, may increase efficacy for these patients.

The observed absence of sociodemographic variables, such as age, sex, and ethnicity, impacting outcomes aligns with past studies [37, 38], yet diverges from others. For example, Hanlon and colleagues (2022) found younger age and female sex were associated with better rTMS response rates, suggesting demographic characteristics may influence treatment success [39]. Slan and colleagues (2024) recently found iTBS was more effective for males, while 10 Hz favored females, and rTMS overall being most effective in patients over 50, particularly females [40]. The contrasting evidence indicates a complex relationship between sociodemographic and rTMS efficacy, warranting further investigation to clarify their roles in treatment personalization and optimization.

Accrued effects from patient history, such as trauma and anxiety, and concurrent benzodiazepine use may influence rTMS outcomes. The potential cumulative effect of these factors on response could be linked to their disparate impact on brain neuroplasticity and neurotransmission. For instance, a history of trauma and relatively lower anxiety might enhance receptivity to rTMS via induced neuroplastic changes, whereas the suppressive effect of benzodiazepines on cortical excitability could counteract these benefits, requiring adjustments in treatment protocols to optimize efficacy. Recent studies demonstrate that while rTMS produces generalized improvement across depressive symptoms, anxiety-related symptoms are less responsive, with outcomes influenced more by baseline symptom profiles than protocol variations [41, 42]. Further investigation into the interaction of these factors may provide deeper insights into personalized rTMS strategies for patients with complex clinical profiles.

Regarding treatment-related factors, the number of rTMS sessions was a positive predictor, with SHAP values indicating a positive linear dose-response relationship. This relationship aligns with notions that additional sessions sustain antidepressant effects, possibly through cumulative neuroplastic changes. However, this is not consistently upheld, as a meta-analysis reported higher doses of rTMS, measured by total pulses, were not always associated with depression improvement [38]. Adherence to the iTBS protocol was also among the strongest predictors of response. Notably, iTBS demonstrated superiority over iTBS-BiTBS in our sample, challenging some of the existing literature that suggests bilateral approaches can be advantageous for certain patient populations [43]. Despite this, the current findings likely relate to iTBS-BiTBS patients experiencing greater treatment-resistance at baseline, with providers transitioning protocols from unilateral to bilateral in cases of inadequate symptom relief.

Machine learning, electronic medical records, and clinical decision-making

Recent reviews of ML prediction models in psychiatric research have found that despite relatively ‘good’ discrimination performance (AUC~0.65–0.80), most studies had methodological weaknesses that limited their generalizability and utility in clinical decision-making [18, 44]. While AUC values below 0.7 are generally considered modest, they can still offer meaningful insights in complex psychiatric conditions like TRD. These models help identify relevant predictors and inform future strategies for refining and validating predictive approaches [44, 45]. The authors recommended more comprehensive reporting of ML methods and results, more robust validation methods, model calibration, and formal assessment of clinical utility, such as DCA [44, 46, 47]. In our analyses, DCA demonstrated that the response and remission models outperformed “treat all” and “treat none” strategies within probability threshold ranges of 25–55 and 15–40%, or NNT between 2 and 4 patients, and 3 and 7 patients, respectively. In other words, the NNT results suggest that for every 2–4 patients treated based on the response model, or every 3–7 patients treated based on the remission model, one additional patient would experience a superior outcome from rTMS compared to treating all patients or none. While immediately applying this analysis to real-world clinical-decision making is not warranted at this time, we included it here to advocate for the use of DCA in precision psychiatry prediction and prognostic models, aiming to illustrate its potential future utility. Both models indicate net clinical benefit within their respective threshold ranges, though more refined models incorporating additional patient features, such as neurobiological markers, polygenic risk scores, and psychosocial determinants, may confer additional benefit. Considering the use of internal cross-validation and novel methods, this model is best positioned as a preliminary predictive tool, offering foundational insights that can inform the development of more refined predictive models with future external validation. Despite demonstrating good discrimination performance, the remission model failed recalibration, as evidenced by a significant Hosmer-Lemeshow test (p = 0.01), indicating a mismatch between predicted and observed probabilities. This lack of calibration reduces the model’s reliability in clinical settings, as it may overestimate or underestimate remission probabilities, ultimately limiting its immediate utility for decision-making.

The implications of these findings are multi-fold. Firstly, they re-affirm the clinical utility of rTMS in managing TRD, especially when considering individual patient histories and comorbidities. Secondly, these results lend support for the predictive power of ML algorithms derived from EMR for refining treatment approaches, suggesting that these tools could become integral to psychiatric precision medicine. Future research should aim to validate these predictive models in diverse clinical settings, ensuring that findings are replicable and robust.

Limitations

The retrospective study design provides a robust dataset but holds inherent limitations. Firstly, while our tree-based models implicitly account for interaction effects, we did not explicitly analyze these due to sample size constraints, which may have limited our ability to detect and interpret nuanced interactions among predictors. Secondly, this study is susceptible to selection bias since it involves patients who initiated and completed treatment, which may exclude those with different response trajectories, possibly skewing findings towards patients more likely to complete treatment. Moreover, the sample in this study was predominantly highly educated and non-Hispanic White, limiting generalization to other demographics and pointing more towards the need to address disparities in access to TMS [48]. Third, information bias may arise from EMR data availability, accuracy, and completeness, with additional variability in the timing of data collection vis-à-vis treatment, possibly creating discrepancies between recorded and actual patient conditions. Reliance on EMR-extracted data presents challenges particularly in terms of incomplete entries, inconsistent documentation, and potential coding errors, which may affect the robustness of the model’s predictions. Further, our predictive models rely on several key assumptions. Tree-based models inherently capture non-linear relationships and interactions among predictors, eliminating the need for explicit interaction terms. Missing data were addressed through single imputation, assuming that data were missing at random. Overfitting was mitigated through feature selection, with checks for highly correlated predictors during preprocessing. These assumptions and limitations underscore the preliminary nature of this model, which serves as a foundation for future refinements. Future research should address these limitations by implementing prospective designs, standardizing rTMS protocols, timely data collection, and diverse populations.

Conclusion

This study elucidates predictors of treatment response to rTMS in TRD patients and highlights the importance of treatment approaches tailored to individual characteristics. By leveraging EMR and ML techniques, this study identified demographic, clinical, and treatment-related factors associated with outcomes, providing valuable insights for clinical practice and research.