1 Introduction

Breast cancer (BC) has developed as the top diagnosed cancer globally [1], and in America alone, there are over 260,000 new diagnoses every year [2]. With the tremendous leap of various adjuvant therapies including endocrine therapy, targeted therapy, immune checkpoint inhibitors (anti-programmed cell death protein-1 therapy, anti-PD1 therapy), platinum-based chemotherapeutic agents, the risk of BC-related deaths of patients has been greatly reduced. Increased BC morbidity alongside early diagnosis and great improvements in cancer treatments [3], combined to lead to a larger survivor population, which means a larger population with increased risk of other comorbidities, especially, cardiac disease. Compared with healthy controls, patients with BC are more likely to develop cardiac disease, and those who developed cardiac disease have a significantly bleak prognosis [4, 5]. In a study based on a large cohort population, it was discovered that the BC therapy-related incidence of major cardiac events was 4.1% at 5 years [6]. For patients with pre-existing cardiovascular risk factors, cardiac disease attributed to about 16.3% of deaths among all BC patients, exceeding the mortality due to cancer itself at 10-year follow-up [5]. In general, BC survivors who develop cardiac disease have a 3.8-fold higher all-cause mortality than those who do not [4]. Therefore, more focus should be paid to cardiac disease-specific death (CDSD) for patients with BC.

Normally, older age was considered and proved to be related to a higher rate of cardiac disease [6]. Therefore, enormous studies put emphasis on cardiac events in the old age cohort. CDSD risk in young women with BC remains a giant challenge, not only because of its rarity but also because of the lack of modifiable risk factors and the shortage of screening for young patients. Normally, young patients are prone to receive more intensive treatment compared with older women, and more aggressive therapies equals higher toxicity. The incidence of cancer treatment-associated cardiac dysfunction is 9–26% during or soon after the completion of cancer therapy for BC patients who received doxorubicin, 13–17% for those who received trastuzumab, and 27–34% for those who received combination therapies [7,8,9,10,11]. Despite facing high risk of cardiac disease development, uncertainties persist about the risk of CDSD among young BC patients on specific subgroups such as cancer stage, treatment, race, etc.

This study aims to predict the risk of CDSD in young BC patients after various treatment by means of machine learning models [12, 13], thus providing a wake-up call for clinical care and monitoring of this group of patients. Hopefully, these findings will serve as a credible epidemiological foundation for tailored management of young BC patients and assist healthcare systems in reducing the risk of CDSD burden among these patients.

2 Materials and methods

2.1 Data source and study design

Data about BC patients analyzed in this study were collected from the SEER database [SEER 17 Regs study data, (changes 2000–2021); version 8.4.3], which is openly accessible. Inclusion criteria: (1) female patients with BC; (2) all patients had histopathological and morphological evidence according to the International Classification of Cancer Diseases Edition III (ICD-O-3); (3) patients aged less than 50 years. Exclusion criteria: (1) not primary BC according to international rules; (2) patients with unknown survival time. The patients’ death, loss to follow-up, or December 31, 2021 is the endpoint of the follow-up.

2.2 Model construction

Feature selection: univariate fine-gray competition risk analyses were performed on clinical characteristics. Characteristics that were statistically significant in the univariate Fine-gray competition risk, including age at diagnosis, median household income, rural–urban continuum code, race, hormone receptor (HR) subtype, marital status, grade, T stage, N stage, M stage, radiotherapy, chemotherapy, surgery, were incorporated into machine learning models for CDSD risk prediction in young BC patients. A response variable was collected for causes of death information before running the training program, in which “1” = death from cardiac disease and “0” = alive or death from other causes. Patients were randomized into train data and test data in a 7:3 ratio. We also compared the area under the curve (AUC value) of logistic regression (LR), support vector machine (SVM), random forest (RF), decision tree (Iterative Dichotomiser 3, ID3), and Extreme Gradient Boosting (XGBoost) models on train and test data. LR is a classification algorithm also known as logarithmic dominance regression, it is an interpretable algorithm and a hallmark of classical predictive modeling. SVM is a binary classification model that aims to find a hyperplane to segment the samples and the principle of segmentation is interval maximisation. In ID3, nodes represent input factors and leaves denote decision outcomes. RF can increase the uncertainty of the model by randomly selecting features and randomly dividing the dataset to reduce the overfitting, therefore it is an integrated learning method based on multiple decision trees. Finally, XGBoost is an algorithm based on Gradient Boosting, which constructs multiple decision trees iteratively, and gradually optimizes the loss function through the gradient descent method. It adds a regularisation term to the loss function to gradually optimize the predictive performance of the model.

Further, receiver operating characteristic (ROC) analysis, AUC values, calibration curves, decision curves, and confusion matrix were employed to evaluate our model. Sensitivity, specificity, and correction are the primary assessment parameters in the confusion matrix. Our XGBoost model was visualised using SHapley Additive exPlanations (SHAP) values. SHAP is an ex-post model interpretation method, and its core idea is to compute the marginal contribution of features to the model output, and then interpret the "black box model" both globally and locally.

2.3 Statistical analysis

Categorical variables were expressed as frequencies and percentages, and continuous variables were expressed as means and standard deviations (M ± SD). The between-group comparison of categorical data was achieved using the χ2 test or Fisher’s exact test. To explore the association between various clinicopathology characteristics and the risk of CDSD in patients, the univariate Fine-gray competition model was introduced. To assess patients’ risk of CDSD and to identify independent risk factors, multivariate Fine-gray analyses were performed on variables that were statistically different in univariate analysis. In the competition risk analysis, deaths from CDSD were considered as target events, and deaths from other causes were considered as competition risk events. All statistical calculations were performed using the R programming language (version 4.0.2). Statistical significance was defined as a two-sided tail value of less than 0.05.

3 Results

3.1 Clinical characteristics of young BC patients

A total of 25,3362 patients aged younger than 50 years were incorporated in this study, among which, 209,067 of them survived, 1275 of them died from CDSD, and 43,320 of them died from other cause. The detailed clinicopathological information of young patients with BC are shown in Table 1 and generalized below. The median age at diagnosis of patients who experienced CDSD was 44.29 ± 4.47 years, significantly higher than those who were alive (42.75 ± 5.35 years) and those who died due to other causes (42.01 ± 5.77 years). Most of the patients who suffered CDSD were white (801 cases, 62.8%), unmarried (657 cases, 51.5%), HR-positive (HR +; 195cases, 15.3%), grade III/IV (591 cases, 46.4%), IDC histological type (963 cases, 75.5%), and lived in the counties in metropolitan areas with over1 million population (658 cases, 51.7%),

Table 1 Baseline characteristics of younger patients with breast cancer

As for median household income, the lower the household income, the higher the risk of patients to experience CDSD. A similar trend was found in rural–urban area where patients living, those who lived in metropolitan areas with a larger population basis had a lower risk of CDSD, while those who came from nonmetropolitan counties had the greatest risk of CDSD. Moreover, among all races and marital statuses, black people and unmarried patients had the highest chance of dying from cardiac disease.

3.2 Competition risk analysis of cardiac disease-specific death

In this study, univariate and multivariate Fine-Gray competition risk analysis was performed to analyze the CDSD for the young BC patients (Table 2). The results of univariate Fine-Gray analysis showed that age at diagnosis, median household income, rural–urban residential environment, HR subtype, grade, TNM stage, radiotherapy, chemotherapy, and surgery significantly affected the CDSD (P < 0.05). However, human epidermal growth factor receptor 2 (HER2) status and histological type were not associated with CDSD (P > 0.05). The factors that were statistically significant in the univariate analysis (P < 0.05) were added to the multivariate Fine-Gray competition risk model. Compared with the findings from the baseline characteristics, it was proved that age was still an independent risk factor for the CDSD occurrence of young BC patients (HR = 1.077; 95% CI = 1.062–1.093, P < 0.001). The black population (vs white: HR = 2.391, 95% CI = 2.046–2.794, P < 0.001) had the highest CDSD risk among patients of all races. Patients with higher household income (HR = 0.927; 95% CI = 0.887–0.968, P < 0.001) and married marital status (vs unmarried: HR = 0.523, 95% CI = 0.457–0.598, P < 0.001) always had lower CDSD risk, while nonmetropolitan residential environment (vs counties in metropolitan areas over 1 million population HR = 1.785; 95% CI = 1.428–2.232, P < 0.001) and HR + subtype (vs HR-negative: HR = 1.428; 95% CI = 1.104–1.675, P < 0.005) always correlated with higher CDSD risk. Higher T stage were correlated with higher CDSD risk, obviously, young BC patients with T2 (vs T1: HR = 1.372, 95% CI = 1.175–1.610, P < 0.001), T3 (vs T1: HR = 1.452, 95% CI = 1.130–1.866, P < 0.01), and T4 (vs T1: HR = 1.363, 95% CI = 1.101–1.562, P < 0.05) had a significant higher CDSD risk than those in T1 stage. Surgery (vs no surgery: HR = 0.840, 95% CI = 0.735–0.959, P < 0.05) can help induce the risk of CDSD for young BC patients. However, chemotherapy (vs no chemotherapy: HR = 1.322, 95% CI = 1.102–1.689, P < 0.05) will significantly increase the risk of developing CDSD. Moreover, N stage, M stage, tumor grade, and radiotherapy are not independent risk factors for CDSD of young BC patients.

Table 2 Univariate and multivariate Fine-Gray competition risk analysis of characteristics

3.3 Establishment and evaluation of predictive models for estimating the cardiac disease-specific death of young BC patients

To further discriminate the exact patients at high risk of CDSD, novel models based on machine learning were constructed to predict the CDSD risk of young BC patients. The patients were divided into train and test data groups in a 7:3 ratio. To guarantee the stability and reliability of our model, ten-fold cross-validation was performed in the train set for iterative testing and tuning, and therefore generate the optimal model (gamma = 0.1, min_child_weight = 500, scale_pos_weight = 90, subsample = 0.5, max_delta_step = 6, alpha = 2, max_depth = 5, eta = 0.1, nround = 100) (Table 3). To evaluate our model, receiver operator characteristic (ROC) curves were plotted for the train and test set, respectively, and the corresponding area under the ROC curves (AUC) was calculated (Fig. 1). It can be seen that the XGBoost model had an outstanding performance in terms of predicting CDSD risk for young BC patients (train set: AUC = 0.846; test set: AUC = 0.836). To exhibit the performance of our model in a clearer manner, our XGBoost model was compared with several traditional machine learning algorithms, including LR (train set: AUC = 0.755; test set: AUC = 0.746), RF (train set: AUC = 0.820; test set: AUC = 0.803); SVM(train set: AUC = 0.735; test set: AUC = 0.644), ID3 (train set: AUC = 0.799; test set: AUC = 0.785), and our XGBoost model has the highest AUCs value (Table 4).

Table 3 Main parameters of the XGBoost model
Fig. 1
figure 1

ROC curves of the XGBoost model’s predicted results in the train and test set. A ROC curve on the train data; B ROC curve on the test data; ROC receiver operating characteristic curve, AUC area under the curve

Table 4 Performance of prognostic models built by machine learning algorithms on train and test data (area under the ROC curve)

The closer the calibration curve matches the standard line, the more accurately the dataset’s actual class distribution is predicted by the model. The calibration curve demonstrates that, in both the train set and test set, the predicted probability of our XGBoost model agrees well with the actual risk (Fig. 2). Our model’s predicted value and the actual probability of the result are greatly comparable. Further, decision curve analysis was performed to assess the clinical utility of our model [14]. The decision curves analysis shows a more net benefit than full or no treatment across a threshold probability range in the train (Fig. 3A) and test (Fig. 3B) sets.

Fig. 2
figure 2

Calibration curves of the XGBoost model’s predicted results in the train and test set. A Calibration curve on the train data; B Calibration curve on the test data

Fig. 3
figure 3

Decision curves of the XGBoost model’s predicted results in the train and test set. A Decision curve on the train data; B Decision curve on the test data

The contribution of each variable to the final prediction was illustrated by the SHAP values, which can help clarify and evaluate model predictions for each individual patient. The final SHAP value which corresponded to the predicted score was provided by the combined effect of all factors. The SHAP value for each sample is represented by each point on the graph; the point with a color closer to purple equals to a higher value, while that with a color closer to yellow reflects a lower value. The more dispersed the points in the graph, the stronger effect of the variable has on the model. The results showed that median household income, marital status, race, and age at diagnosis were the top four strongest predictors (Fig. 4).

Fig. 4
figure 4

The SHAP value of clinical characteristics in terms of importance in the XGBoost model

From the confusion matrix of the XGBoost model (Fig. 5), it was calculated that the sensitivity, specificity, and correction were 0.81, 0.94, and 0.94 for the train set, and 0.82, 0.95, and 0.96 for the test set, respectively. The results proved that our model showed extraordinary performance while predicting the CDSD risk for young BC patients.

Fig. 5
figure 5

Confusion matrix of the XGBoost model’s predicted results in the train and test set. A Confusion matrix on the train set; B Confusion matrix on the test set

4 Discussion

BC patients have a higher risk of developing cardiac disease compared to non-cancer controls [15]. A previous population-based study with over 40 years of follow-up assessed the cardiac disease death risk for patients from 21 cancer sites. It discovered cardiac disease death risk as the most significant competing risk clinically for most non-metastatic cancers, and in BC, CDSD had surpassed primary tumor gradually as the primary cause of death [16]. Once the cardiac events occur, the patients always face significantly worse overall survival outcomes, and in certain BC populations, CDSD exceeds cancer death rates. Therefore, it is imperative to recognize high-risk populations and develop effective cardioprotection strategies.

When it comes to cardiac diseases, the old age population always comes to mind. Therefore, most previous studies put emphasis on this population and neglected the young age patients. A number of researches have documented that, young women have a larger chance of developing more aggressive subtypes of BC with poor prognostic features, and also present with more advanced disease stages [17, 18]. Young BC patients conventionally receive more intensive therapies compared with old women, but they still suffer from a higher risk of BC recurrence and death [19]. The longer survival time of young patients combined with higher toxicity caused by more aggressive therapies, highlights the non-negligible risk of CDSD for them. This study aims to assess the risk of CDSD for young BC patients from various specific clinicopathological and psychosocial subgroups.

Our baseline clinical characteristic analysis and competing risk analysis combined to prove that young BC patients with older age, low household income, non-metropolitan residential environment, black race, unmarried status, HR + subtype, higher T stage (T2-4), receiving chemotherapy, and non-surgery are under higher risk of CDSD. The racial disparity in the risk of CDSD might be because, hypertension rates among black patients rank the highest in the world according to estimates, and hyperaldosteronism is significantly correlated with cardiovascular risk [20, 21]. Household income, which reflects economic stability [22], was found to be the strongest risk factor for CDSD, this supports the findings proving the correlation between atherogenesis and a proinflammatory state and low socioeconomic status [23, 24]. These social determinants including racial disparities, marriage status, household income, and residential environment, combined to urge us to strengthen the management of rural and low-income families and highlight the indispensable role of family support.

The explanation behind the increased risk of CDSD may be multifactorial. For the first thing, smoking, hypertension, diabetes, nutritional factors, micronutrient deficiencies, and other common risk factors for cardiac disease and cancer may exacerbate the CDSD risk [25,26,27]. The prolonged survival times provided by both the tremendous leap of various BC therapies and the young age (< 50 years) of young BC patients combined to contribute to a higher chance of being exposed to the above-mentioned risk factors and consequently higher CDSD risk. This can also explain the higher CDSD risk of patients with HR + subtype due to their better prognosis and longer survival. Secondly, we must not neglect the cardiotoxicity of various cancer treatments [28], especially various chemotherapy, such as anthracyclines [29, 30], and HER-2 antagonists (e.g. trastuzumab) [31]. This may explain the higher CDSD risk of BC patients with higher T stage and who did not receive surgery. The patients with advanced stage might lose the chance to perform surgery and therefore are always exposed to longer periods of chemotherapy, which consequently aggravate the cardiotoxicity. Finally, mounting studies have proved that systemic vasculature and heart damage are caused by the tumor itself. It was discovered that neutrophil extracellular trap (that is cancer-induced inflammation) may accumulate in vasculature and heart, leading to cardiovascular dysfunction [32, 33].

Although for patients with BC, tumor management should be the top priority, our data highlighted the importance of competing risks. Besides tumors, BC patients also have a great chance to die from other causes, chief among which is cardiovascular disease. Therefore, to better discriminate the patients at high risk of CDSD, we constructed a robust XGBoost model for young BC patients. In general, our XGBoost model demonstrated good performance, indicating the high clinical value of our model. Moreover, our findings highlight the necessity to take into account competing risks, particularly in the development of risk assessment tools. Moreover, our affirmation of the indispensable role of cardiac disease as a competing risk among young BC patients with supports the growth of the new field of cardio-oncology. The limitations of this study are its retrospective nature, and the deficiencies inherent in the SEER database.

5 Conclusion

We identified independent CDSD risk factors for young BC patients and constructed machine-learning prognostic models to predict their CDSD risks. Our validation results indicate that the predicted probability of our XGBoost model agrees well with the actual CDSD risks, and it can help recognize high-risk populations and develop effective cardioprotection strategies. Hopefully, our findings can support the growth of the new field of cardio-oncology.