Introduction

Peritumoral edema (PTE), characterized by pathological fluid accumulation in the brain parenchyma, is a prevalent and serious complication of metastatic brain tumors that adversely affects both neurological function and survival outcomes1. As the most frequent malignant intracranial neoplasms in adults, metastatic brain tumors exhibit an incidence rate approximately tenfold higher than primary intracranial tumors2. Mechanistically, PTE elevates intracranial pressure, which induces cerebral hypoxia and may progress to life-threatening herniation, ultimately resulting in irreversible neurological deficits or mortality3. The current treatment strategy involves the use of mannitol, diuretics, and steroid medications to alleviate the symptoms of cerebral edema. Previous studies have shown that the control rate of these drugs for refractory cerebral edema ranges from 27 to 39%4,5,6. It is noteworthy that these medications do not eradicate the underlying causative factors and are often accompanied by various adverse reactions7. For instance, long-term use of steroid medications can lead to severe systemic side effects, including but not limited to immunosuppression and avascular necrosis8. Similarly, the administration of mannitol may also trigger systemic hypotension, reduced cerebral perfusion, and even acute renal failure9.

Vascular endothelial growth factor A (VEGF-A) is a potent inducer of angiogenesis and enhancer of vascular permeability, which is considered to play a pivotal role in the pathogenesis of brain tumor-associated edema10. As a result, bevacizumab (BEV), a monoclonal antibody targeting VEGF-A, has emerged as an effective therapeutic option for addressing cerebral edema11,12,13,14. However, despite the demonstrated efficacy of BEV in treating cerebral edema, its use is accompanied by numerous concerns, such as the potential risks of tumor hemorrhage, ischemic stroke, and induction of abnormal tumor angiogenesis, as well as potential effects on patients’ cognitive function15. Furthermore, as a high-cost medication, the economic factor must also be considered in the use of BEV. Given the potential risks and its high cost, the ability to accurately predict BEV efficacy in treating cerebral edema holds significant value for clinical decision-making. It can aid in optimizing treatment plans, ensuring therapeutic outcomes while alleviating the financial burden on patients.

Artificial intelligence (AI) includes machine learning (ML) and deep learning (DL), each with its own structure and applications. Through the application of advanced algorithms for deep analysis of diverse data, ML technology is capable of precisely capturing and interpreting the intricate relationships between various clinical features and outcomes. This characteristic endows ML technology with the potential to surpass the limitations of traditional statistical methods in risk prediction, thereby offering clinicians more precise and comprehensive insights to inform decision-making16. ML models exhibit significant advantages over other prediction methods, primarily in their ability to automatically learn and identify optimal predictive features from training data. This process, devoid of manual intervention, substantially enhances the accuracy and efficiency of predictive models. Additionally, ML models possess robust generalization capabilities, which mean that the knowledge and patterns learned from training data can be effectively applied to new, unseen patient cohorts. This further strengthens their applicability and value in real-world clinical scenarios17.

Recent advances in neuro-oncology highlight the transformative potential of AI and ML. In this field, ML algorithms extract quantitative imaging features—including shape, grayscale, and texture characteristics—from medical images, which are subsequently processed by classifiers such as support vector machines, random forests (RF), and k-nearest neighbors to significantly improve the efficiency and accuracy of brain tumor detection18,19. Large-scale clinical validations underscore AI’s prognostic value in neuro-oncology. Recent studies involving 79,638 participants established robust performance in predicting survival outcomes for high-grade glioma20, while comprehensive meta-analyses confirmed ML’s reliability in forecasting tumor recurrence21. Beyond diagnostic applications, ML has shown comparable promise in neurovascular disorders, particularly in predicting poor functional outcomes following aneurysmal subarachnoid hemorrhage22. The predictive utility of ML extends to cerebral edema management. Multiple studies have developed MRI radiomics-based models that demonstrate good accuracy in predicting post-stroke cerebral edema23,24,25 , with texture features showing particular value in identifying symptomatic edema after endovascular treatment in high-risk populations26. Notably, ML approaches for post-stroke edema prediction have consistently achieved accuracy rates exceeding 70% in independent validations27,28,29.

Despite these advances, current research lacks investigations into predicting BEV treatment efficacy for PTE in metastatic brain tumors. This study aims to develop the first ML-based predictive model integrating quantitative imaging biomarkers with clinical parameters. By identifying pretreatment predictors of BEV response, our approach could provide clinicians with two critical decision-support functions: (1) avoiding BEV administration in patients unlikely to benefit, thereby reducing treatment risks and costs, and (2) optimizing therapeutic selection for potential responders. These findings could address an urgent need in neuro-oncology by providing an objective tool for personalizing BEV therapy in metastatic brain tumor patients.

Methods

Data sources

This study conducted a retrospective analysis of clinical data from patients with metastatic brain tumors who underwent BEV therapy from September 2013 to January 2024. The dataset consisted of clinical information from a total of 300 patients, including 18 features and 1 label. Among the 18 features, there were 13 clinical characteristics (demographics, medical history), plus 5 radiological features (tumor location, maximum tumor diameter, tumor volume, edema volume, and edema index). Figure 1 illustrates the workflow of this study. This retrospective study was conducted using existing, fully anonymized clinical data without any patient intervention. The Institutional Review Board (IRB) of Peking Union Medical College Hospital, Chinese Academy of Medical Sciences waived the requirement for informed consent in accordance with Article 39 of China’s Ethical Review Measures for Biomedical Research Involving Humans (2016), and the study protocol was approved by the hospital’s Academic Ethics Committee. The research strictly adhered to the ethical principles of the Declaration of Helsinki and received formal ethical approval exemption from the IRB (Exemption No. ZS-3931). The study design, ethical review process, and informed consent waiver procedures were fully compliant with Article 39 of the aforementioned Ethical Review Measures and Article 13 of China’s Personal Information Protection Law (2021).

Fig. 1
figure 1

Workflow diagram of machine learning model training in this study.

Efficacy evaluation

The tumor volume and maximum tumor diameter were measured on the T1-weighted contrast-enhanced sequence, while the edema volume was assessed on the T2/FLAIR sequence. Accurate tumor and edema volumes were obtained by summing the segmented volumes from each axial image30. The edema index was calculated based on the edema volume and tumor volume, using the formula: Edema Index = (Edema Volume + Tumor Volume) / (Tumor Volume)31. To evaluate treatment efficacy, we defined a quantitative criterion for edema remission: a reduction in edema volume exceeding 10% post-treatment compared to pre-treatment was considered as edema remission; conversely, a reduction below this threshold was defined as non-remission32. Based on this criterion, patients included in the study were categorized into edema remission and non-remission groups according to the change in edema volume before and after BEV treatment, to further analyze treatment outcomes and potential clinical correlations.

Data analysis

Continuous variables were presented as medians (interquartile ranges), while categorical variables were expressed as numbers/percentages. When analyzing the baseline characteristics of patients, continuous variables were assessed using logistic regression analysis, and categorical variables were analyzed with Chi-square tests. To facilitate the calculation of odds ratios when examining the relationship between independent variables and the dependent variable, continuous variables were first converted into categorical variables (e.g., age > 60 years). Subsequently, univariate logistic regression analysis was performed. Variables with a P-value < 0.05 were selected for inclusion in the multivariate logistic regression analysis. All statistical analyses were conducted using SPSS (IBM, Version 26.0).

Model construction and evaluation

The dataset of 300 patients was divided into a training set (70%) and a test set (30%) using stratified sampling. This split ratio was chosen to ensure a robust training set while maintaining a sufficiently large test set for reliable evaluation, particularly given the moderate size of our dataset. Preliminary experiments with alternative split ratios (e.g., 80–20%) showed no significant impact on model performance. Stratified sampling ensured that the distribution of patients in the edema remission group (n = 220) and the non-remission group (n = 80) was maintained in both subsets. Specifically, the training set consisted of 210 patients (154 from the edema remission group and 56 from the non-remission group), while the test set included 90 patients (66 from the edema remission group and 24 from the non-remission group). (Table 1).

Table 1 Dataset distribution for model development.

Before training, we conducted extensive data preprocessing to ensure the robustness of our predictive models. Missing values were addressed using the MICE (Multiple Imputation by Chained Equations) method with 10 iterations. Logistic regression (LR) was used as the imputation model for categorical variables, which is particularly effective for handling missing data in both continuous and categorical variables. Outliers were detected using the Interquartile Range (IQR) method and adjusted or removed based on clinical relevance and predefined statistical rules. Feature scaling was performed to normalize the data. Continuous variables were standardized using Z-score normalization, which transforms the data to have a mean of 0 and a standard deviation of 1. This step is particularly important for algorithms like LR and Gradient Boosting Trees (GBT), which are sensitive to the scale of input features. By normalizing the data, we ensured that no single feature dominated the model training process due to its larger scale, while also maintaining robustness to potential outliers.

To further optimize the feature set and improve model performance, we performed feature selection and dimensionality reduction. Feature selection was conducted using Recursive Feature Elimination (RFE) with LR as the underlying model, which iteratively removes the least important features based on model performance evaluated through cross-validation. Dimensionality reduction was performed using Principal Component Analysis (PCA), retaining components that explained 95% of the variance in the data. These steps helped us reduce the feature set from 18 to 13 features, thereby enhancing the predictive performance and generalization ability of the models.

In this study, we selected four distinct ML algorithms to train binary classification models aimed at predicting the efficacy of BEV treatment on PTE in patients with metastatic brain tumors. The choice of these algorithms was based on their unique strengths and potential complementarity, aiming to achieve optimal predictive performance. Specifically, the RF algorithm, with its ensemble learning characteristics, demonstrates robustness in handling large-scale, high-dimensional, and complex medical datasets. The LR model is suitable for dealing with linearly or approximately linearly separable data patterns, offering clear interpretability. The GBT algorithm is renowned for its powerful ability to mine nonlinear relationships. Lastly, the Naive Bayes (NB) model, grounded in solid probability theory, has advantages in handling small sample sizes and high-dimensional feature spaces. To further ensure model robustness and generalizability, a tenfold cross-validation strategy was implemented during model training, iteratively training on 9 subsets and validating on the remaining subset, with the final performance metrics averaged across all folds.

To mitigate the risk of overfitting and ensure the generalizability of our predictive models, we implemented several techniques tailored to the characteristics of each ML algorithm. For LR, we applied L2 regularization (Ridge regularization) and optimized the regularization strength (C) through grid search over the range [0.01, 0.1, 1, 10], while setting the class_weight parameter to ‘balanced’ to address class imbalance. For RF and GBT, we optimized key hyperparameters, including the number of trees and maximum depth of trees, through grid search, and set class_weight to ‘balanced’. For NB, we adjusted the Laplace smoothing parameter and applied the Synthetic Minority Over-sampling Technique (SMOTE) with k_neighbors = 5 to the training set, increasing the number of minority class samples to balance class distribution. SMOTE was applied exclusively to the training set, while the test set remained unchanged to ensure unbiased evaluation. Additionally, we employed tenfold cross-validation to further ensure model robustness and generalizability.

In the model performance evaluation phase, various evaluation metrics were employed, including accuracy, precision, recall, and F1-score, to comprehensively and objectively reflect the predictive efficacy of the models33. To further analyze the models’ performance under imbalanced sample scenarios, we calculated the F1-score, precision, and recall for each class ("non-remission" and “remission”) separately. Model performance was assessed using a tenfold cross-validation strategy on the training set, followed by evaluation on an independent test set. Furthermore, Receiver Operating Characteristic (ROC) curves and Precision-Recall (PR) curves were plotted, and the corresponding Area Under the Curve (AUC) was calculated to further quantify the models’ ability to distinguish between different treatment response groups. Finally, this study leveraged the Python programming language for data preprocessing, visualization analysis, and the training and evaluation of the models.

Results

Retrospective data analysis

This study included a total of 300 patients with metastatic brain tumors who received BEV treatment. Through retrospective analysis of the clinical and radiological characteristics of these two groups, we identified five variables that showed statistically significant differences between the two groups (P < 0.05): tumor origin, history of previous surgery, interval between BEV treatments, edema volume, and edema index. This finding suggests that these five variables may be crucial factors influencing the efficacy of BEV treatment on PTE in patients with metastatic brain tumors. The detailed results of the retrospective data analysis are presented in Table 2.

Table 2 Retrospective analysis of clinical and imaging features of all patients.

Univariate and multivariate logistic regression analysis

In this study, we performed a univariate logistic regression analysis on 18 variables to initially screen for potential key factors influencing the efficacy of BEV treatment on PTE in patients with metastatic brain tumors. The analysis revealed that five variables—tumor origin, surgical history, maximum tumor diameter, edema volume, and edema index—were statistically significantly associated with the treatment outcome, and thus were considered potential predictors of efficacy. To further validate the independent effects and relative importance of these factors, we subsequently conducted a multivariate logistic regression analysis on the five selected variables. The results indicated that edema volume and edema index emerged as independent risk factors for BEV treatment efficacy. Notably, the multivariate analysis highlighted edema volume as the most critical factor influencing treatment outcomes. Table 3 details the specific results of both the univariate and multivariate logistic regression analyses.

Table 3 Univariate and multivariate logistic regression analysis.

Machine learning prediction models

In this study, we employed four ML algorithms, RF, LR, GBT, and NB, to construct prediction models. Following training and cross-validation, these models were evaluated on an independent test set comprising 90 patients. Table 4 presents the performance metrics of each model in detail. Regarding model accuracy, the RF model achieved the highest accuracy of 0.89, significantly outperforming the LR and NB models, which had accuracies of 0.79 and 0.69, respectively. In terms of precision, the GBT model performed best with a precision of 0.97, while the RF model followed closely with a precision of 0.94. The RF model also attained the highest recall of 0.91, compared with the LR and GBT models, which both had a recall of 0.83. In terms of F1 score, the RF model scored the highest at 0.92, demonstrating its ability to balance precision and recall effectively. Figure 2 showcases the ROC curves and PR curves of the four ML models, providing an in-depth analysis of their classification performance. In Fig. 2A, the ROC curve analysis demonstrates that the RF model performs most effectively, achieving an AUC of 0.91, which is the highest among the four models. Figure 2B presents the PR curves of the models, focusing on evaluating the model’s performance on positive class samples. Here, the RF and GBT models demonstrate significant superiority, both achieving an Area Under the PR Curve (AP) of 0.91. Considering all performance metrics comprehensively, the RF model was identified as the best-performing model due to its superior performance in accuracy, recall, F1 score, AUC-ROC, and AUC-PR. Figure 3 illustrates the confusion matrices of the four models, further supporting the aforementioned conclusions.

Table 4 Machine learning model performance.
Fig. 2
figure 2

ROC curves and PR curves of all machine learning models in this study. Subfigure (A) ROC curves of four machine learning models. Subfigure (B) PR curves of four machine learning models.

Fig. 3
figure 3

Confusion Matrices of All Machine Learning Models Utilized in This Study. Subfigure (A) Confusion Matrix of Random Forest Model. Subfigure (B) Confusion Matrix of logistic regression Model. Subfigure (C) Confusion Matrix of Gradient Boosting Tree Model. Subfigure (D) Confusion Matrix of Naive Bayes Model.

To further evaluate the models’ performance under imbalanced sample scenarios, we analyzed their precision, recall, and F1-scores for the remission and non-remission groups, as shown in Table 5. The RF model demonstrated strong performance across all metrics, achieving high precision (0.95 for remission, 0.92 for non-remission), recall (0.93 for remission, 0.88 for non-remission), and F1-scores (0.93 for remission, 0.89 for non-remission), indicating its robustness in handling class imbalance. The GBT model also performed well, particularly in precision, though its recall for the non-remission group was slightly lower. In contrast, the Logistic Regression (LR) model showed a notable bias towards the remission group, with significantly lower recall and F1-scores for the non-remission group. The NB model performed the weakest across all metrics, indicating its limited effectiveness in imbalanced scenarios. Overall, the RF and GBT models outperformed the other models, demonstrating their superior ability to handle imbalanced datasets. These findings align with the earlier results from the ROC and PR curves, further confirming the RF model’s effectiveness in balancing precision and recall.

Table 5 Model performance on overall, remission, and non-remission groups.

In addition, we also conducted feature importance analysis using the RF model and the GBT model. For the RF model, the analysis revealed that edema volume, edema index, age, tumor volume, and maximum tumor diameter were the top five most important features. Among these, edema volume stood out as the most critical factor, with an importance score of 0.28. Furthermore, in the feature importance assessment of the GBT model, edema volume maintained its dominant position, achieving an impressive importance score of 0.61. This finding underscores the pivotal role of edema volume in the predictive models. Figure 4 visually presents the ranking of feature importance for both the RF and GBT models.

Fig. 4
figure 4

Feature Importance Analysis Based on Machine Learning Models. Subfigure (A) Feature Importance Plot of Random Forest Model. Subfigure (B) Feature Importance Plot of Gradient Boosting Tree Model.

To delve deeper into the potential associations among various features, we employed ML techniques to conduct a feature correlation analysis and plotted a feature correlation heatmap. The heatmap in Fig. 5 shows the correlation between various features. The analysis revealed a strong positive correlation between tumor volume and maximum tumor diameter, with a correlation coefficient of 0.81. Additionally, we observed a moderate positive correlation (r = 0.62) between tumor origin and gender. The edema index and edema volume exhibited a positive correlation of 0.57. Furthermore, a significant negative correlation was observed between the edema index and both tumor volume and maximum tumor diameter, with correlation coefficients of − 0.65 and − 0.59, respectively.

Fig. 5
figure 5

Heatmap of the correlation between features based on the machine learning model.

Discussion

The present study aimed to predict the efficacy of BEV in treating PTE in patients with metastatic brain tumors by integrating clinical characteristics, imaging features, and ML models. Univariate and multivariate logistic regression analyses identified edema volume and edema index as the main factors affecting BEV’s efficacy, a finding further supported by feature importance analyses in both RF and GBT models, where edema volume emerged as the most significant predictor. Edema volume serves not only as an imaging biomarker of mechanical compression but also reflects tumor biological behavior, particularly VEGF-A-driven angiogenic activity and vascular permeability—a dual role supported by prior studies linking VEGF-A to edema severity34,35. The strong positive correlation between tumor volume and maximum diameter (r = 0.82) reflects their shared role in quantifying tumor burden, suggesting that either parameter could suffice for clinical assessment. Additionally, the correlation between tumor origin and gender (r = 0.62) likely stems from the cohort composition: all breast cancer patients were female, while lung cancer patients were predominantly male. This observation warrants further research into gender-specific differences in VEGF signaling or angiogenic activity, which may guide personalized BEV therapy.

Furthermore, by applying various ML algorithms, including RF, LR, GBT, and NB, we successfully developed a model capable of accurately predicting the efficacy of BEV in treating PTE in patients with metastatic brain tumors. Among these models, the RF model excelled in multiple performance metrics, such as the AUC value, accuracy, precision, and F1 score, indicating its robust predictive power and stability. Notably, the RF model demonstrated exceptional performance in handling imbalanced data, which can be attributed to its ensemble learning mechanism, inherent randomness36, and adaptive weighting that balances precision and recall, making it robust for clinical applications37. This characteristic highlights RF’s suitability for medical studies, where class imbalance is common and accurate identification of minority classes (e.g., the non-remission group) is clinically critical. Our results confirm that imbalance significantly impacts model performance, with RF and GBT outperforming LR and NB in such scenarios. Future studies could further improve performance via advanced sampling (e.g., ADASYN), cost-sensitive learning, or ensemble techniques like stacking38,39. In conclusion, the results of this study not only validate the superiority of RF model in imbalanced scenarios but also provide important references for model selection and optimization under similar data distributions.

The predictive model developed in this study offers significant clinical utility by stratifying patients with PTE into probable responders and non-responders to BEV treatment based on integrated clinical and imaging characteristics. This stratification enables clinicians to optimize therapeutic strategies: prioritizing BEV for likely responders to accelerate edema remission, while considering alternative approaches for predicted non-responders to avoid unnecessary exposure to BEV’s risks (e.g., hemorrhage, hypertension) and reduce financial burdens—particularly valuable in resource-limited settings40. Unlike traditional clinical judgment alone, the AI model leverages advanced ML to analyze complex, multi-dimensional data, providing more objective and reproducible predictions. The model’s ability to guide targeted use of this high-cost biologic addresses a critical need for cost-effective precision medicine in neuro-oncology. The AI model’s implementation holds transformative potential across multidisciplinary clinical practice. For neurosurgeons involved in the long-term management of PTE, the model offers data-driven guidance to optimize medical therapy (e.g., BEV initiation or alternative strategies), supporting clinical decisions during patient follow-up. Neuro-oncologists can leverage its predictions to personalize treatment regimens, minimizing BEV-related risks for non-responders while optimizing outcomes for responders. Radiologists benefit from the model’s standardized quantification of imaging biomarkers (e.g., edema volume), reducing inter-observer variability and improving consistency in tumor board discussions. Through translating complex ML outputs into actionable clinical insights and integrating predictive analytics into routine practice, this approach not only bridges the gap between computational analytics and clinical decision-making in neuro-oncology but also fosters a collaborative framework for PTE management across multidisciplinary teams (neurologists, oncologists, and radiologists).

It is important to acknowledge the limitations of this study. Firstly, as a retrospective study, there is a potential risk of data bias. Secondly, despite selecting four representative algorithms, the range of ML algorithms employed in this study was limited; there may be other, more suitable algorithms for this particular research that were not included. Lastly, some subjectivity in the feature selection process cannot be overlooked. These limitations underscore the need for future studies to address these issues by incorporating larger, prospective datasets, exploring a wider range of algorithms, and adopting more rigorous and objective feature selection methods in order to further validate and refine the predictive models.

In future research, we plan to expand the scope of our predictive model by incorporating additional imaging features, such as perfusion and diffusion-weighted MRI parameters, which could provide deeper insights into the pathophysiology of PTE and further enhance predictive performance. For instance, perfusion MRI could offer valuable information on tumor vascularity and blood–brain barrier integrity, while diffusion-weighted imaging (DWI) might reveal microstructural changes in the brain parenchyma associated with edema formation41. Additionally, integrating molecular biomarkers and gene expression data, such as VEGF-A and other angiogenesis-related genes, could reveal underlying biological mechanisms driving edema formation and response to BEV42. By combining imaging and molecular data, we hope to develop a more comprehensive and multi-modal predictive framework. The integration of this model into clinical practice has the potential to significantly enhance patient management. To ensure successful implementation, we would collaborate with multidisciplinary teams, including neuro-oncologists, radiologists, and clinical researchers, to develop standardized protocols for data collection, model application, and outcome evaluation. Regular feedback from clinicians would be incorporated to refine the model and ensure its alignment with real-world clinical needs. By bridging the gap between predictive modeling and clinical decision-making, this approach has the potential to not only improve treatment outcomes but also optimize resource utilization and enhance the overall quality of care for patients with metastatic brain tumors.

Conclusion

This study pioneers a predictive model integrating imaging features and ML to forecast BEV efficacy in treating PTE in metastatic brain tumor patients. Beyond demonstrating high predictive accuracy, our model identifies edema volume as the most critical predictor, offering neurosurgeons and radiologists a data-driven tool to optimize therapeutic strategies. The clinical relevance lies in stratifying patients into likely responders and non-responders before BEV initiation, addressing two major neuro-oncology challenges: avoiding unnecessary treatment risks and reducing the economic burden of ineffective therapy. For clinicians, this model provides actionable insights—enabling personalized treatment plans that prioritize BEV for high-probability responders, while exploring alternatives for predicted non-responders. Integration of such AI-driven decision support systems could significantly enhance precision medicine, particularly in resource-constrained settings where cost-effectiveness is paramount. Future efforts should focus on prospective validation and real-world implementation to refine the model’s utility in diverse clinical scenarios.