MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data

Li, Siyu; Huang, Yifan; Chen, Shengquan

doi:10.1186/s13059-025-03603-9

Methodology
Open access
Published: 11 June 2025

MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data

Siyu Li¹^na1,
Yifan Huang¹^na1 &
Shengquan Chen¹

Genome Biology volume 26, Article number: 162 (2025) Cite this article

907 Accesses
4 Altmetric
Metrics details

Abstract

Single-cell chromatin accessibility sequencing (scCAS) has proven invaluable for investigating the intricate landscape of epigenomic heterogeneity. We propose MINGLE, a mutual information-based interpretable framework that leverages cellular similarities and topological structures for accurate cell type annotation of scCAS data. Additionally, we introduce a convex hull-based strategy to effectively identify novel cell types. Extensive experiments demonstrate MINGLE’s superior annotation performance, particularly for rare and novel cell types, delivering valuable biological insights compared to existing methods. Moreover, MINGLE excels in cross-batch, cross-tissue, and cross-species scenarios, showing robustness to data imbalance and size, highlighting its versatility for complex annotation tasks.

Peer Review reports

Background

Over the past decade, single-cell sequencing technologies have achieved remarkable advancements, significantly enhancing our understanding of cellular heterogeneity across diverse biological systems [1]. Notably, single-cell chromatin accessibility sequencing (scCAS) technologies, which profile chromatin accessibility to capture the chromatin regulatory landscape that governs transcription, have developed rapidly, providing new insights into the landscape of epigenomic heterogeneity and the intricate mechanisms underlying gene regulation at the single-cell resolution [2,3,4].

Many downstream analyses for scCAS data, such as differentially accessible peak analysis, cell-type-specific motif discovery, and transcription factor activity inference, always require accurate cell type annotation beforehand. Traditionally, a prevalent cell type annotation technique involves grouping cells into clusters and then manually labeling these clusters based on peaks associated with marker genes [5,6,7]. While widely used, this approach suffers from several notable limitations. First, this clustering-based method often struggles to handle rare cell populations. Small or subtle clusters may be overlooked or incorrectly merged into larger, more prominent groups, resulting in the loss of important biological insights. Additionally, the manual process is inherently complex and highly subjective, often leading to inconsistent annotations that depend heavily on the expertise and judgment of biologists. As the scale of scCAS data continues to expand with high-throughput sequencing technologies, the manual annotation process becomes not only labor-intensive but also computationally infeasible for large datasets. These challenges underscore the pressing need for automatic cell type annotation methods that can efficiently leverage well-annotated scCAS datasets to accurately annotate cells in newly generated datasets.

Although numerous automatic cell type annotation methods have been developed for single-cell transcriptomic data [8,9,10,11,12], the effectiveness of these methods is significantly diminished when applied to scCAS data due to the high noise and extreme sparsity inherent in scCAS data [13, 14]. To address the unique challenges posed by scCAS data, researchers have recently developed several tailored methods. For example, Chen et al. proposed the first method tailored to scCAS data, EpiAnno, which leverages a Bayesian neural network to embed cells into a latent space and perform cell type annotation [14]. More recently, Zeng et al. developed SANGO, a state-of-the-art method for annotating cells in scCAS data by integrating DNA sequence information [15]. Additionally, conventional machine learning methods such as support vector machine (SVM), random forest (RF), and K-nearest neighbors (KNN) have also shown excellent and robust performance in cell type annotation [16, 17].

However, there are still significant limitations to existing methods: (1) With the growing proliferation of scCAS datasets spanning multiple tissues and leveraging diverse sequencing technologies, existing methods face significant challenges in consistently achieving high performance across these heterogeneous datasets, particularly struggling with annotating rare cell types in the dataset. (2) Existing methods fail to fully exploit the intricate cellular similarities and topological structures inherent in scCAS data, which are crucial for capturing subtle cell-to-cell relationships and accurately annotating cell types. (3) It is common to encounter novel cell types that are not present in the training set but appear in the test set during cell type annotation, which can provide valuable insights into previously unrecognized cellular differentiation and lineage relationships. However, existing methods designed for scCAS data lack the capability to identify novel cell types. (4) Many of the existing cell type annotation methods lack interpretability, which impedes researchers from deriving meaningful insights into the underlying cellular functions and dynamic processes reflected in the annotations.

To fill these gaps, we propose MINGLE, a mutual information-based interpretable framework that leverages similarities and topological structures among cells for accurate cell type annotation. To accurately annotate rare cell types in scCAS datasets, MINGLE first implements a masking-based class balancing strategy, which is inspired by the idea of masked autoencoders (MAE) [18]. Subsequently, MINGLE utilizes contrastive learning and graph convolutional networks (GCN) to perform cell type annotation based on the similarities and topological structures among cells. We also introduce a convex hull-based identification approach for identifying novel cell types only appearing in the test set. Finally, to enhance the interpretability of the annotation results and extract biologically meaningful insights, we incorporate a mutual information-based strategy [19]. By optimizing a binary feature selector based on mutual information between selected peaks and the corresponding annotation results, MINGLE can identify cell-type-specific peaks, providing a deeper understanding of the regulatory features associated with each cell type. Using a diverse range of scCAS datasets across various tissues, species, and sequencing platforms, we demonstrate that MINGLE significantly outperforms existing methods in annotating known cell types, especially rare cell types. We also show that MINGLE can identify novel cell types in scCAS datasets effectively, facilitating the discovery of previously unknown biological entities and the identification of new therapeutic targets or biomarkers. Comprehensive downstream analyses, including partitioned heritability analysis, tissue-specific expression enrichment, and Gene Regulatory Enrichment Analysis Tool (GREAT) analysis, further confirm that MINGLE is not only accurate but also interpretable, enabling a deeper understanding of cell-type-specific regulatory mechanisms and disease-associated genetic variants. More importantly, MINGLE excels in cross-batch, cross-tissue, and cross-species annotations, and demonstrates robustness to datasets with varying imbalance degrees and data sizes, highlighting its versatility and reliability in handling complex real-world scenarios.

Results

Overview of MINGLE

The main workflow of MINGLE encompasses data preprocessing, model training, novel cell type identification, and model interpretation (Fig. 1). Given a raw scCAS training data with cell type labels, MINGLE first performs data preprocessing and implements a masking-based class balancing strategy to handle rare cell types in scCAS datasets. Specifically, we apply downsampling to major cell types, while applying oversampling based on a masking strategy to rare cell types (Methods). Subsequently, MINGLE utilizes contrastive learning to derive low-dimensional representations of cells. The core principle is to train a multi-layer perceptron (MLP) to generate low-dimensional representations by maximizing the similarity of cells from the same cell types while minimizing the similarity of cells from different cell types. The trained MLP is then used to generate the final low-dimensional representations of cells in training and test sets. We perform the first-round cell type annotation based on the similarities between cells in the training and test sets in the low-dimensional space. To further capture the topological structures among training and test cells, MINGLE constructs a K-nearest neighbors graph, where each node represents a cell and each edge represents the similarity between cells in the low-dimensional space. MINGLE then applies graph convolutional networks (GCN) for semi-supervised training, from which we can obtain the second-round annotation results. Finally, MINGLE integrates the results from both rounds of annotations to produce the final annotations (Methods).

For identifying novel cell types only appearing in the test set, which are crucial for uncovering previously unrecognized cellular entities and advancing our understanding of cellular differentiation, we introduce a convex hull-based approach. We construct multiple convex hulls for each known cell type in the low-dimensional subspaces and classify test cells as novel if they lie outside the convex hulls of any known types (Methods).

Finally, to interpret the annotation results of MINGLE and extract biologically meaningful insights, we devise a mutual information-based strategy. By optimizing a binary feature selector based on the mutual information between selected peaks and their corresponding annotation results, MINGLE can derive importance scores for each peak reflecting its contribution to prediction. Peaks with the top K highest importance scores in each cell type are identified as cell-type-specific peaks, which can offer deeper insights into the regulatory features associated with each cell type (Methods).

MINGLE achieves accurate cell type annotation for scCAS data

To evaluate annotation performance of MINGLE on scCAS data, we first conducted five-fold cross-validation on six different scCAS datasets, which were derived from different species, tissues, and obtained using various sequencing methods. The datasets included Melanoma, SpleenA, ThymusA, ThymusB, Liver, and Heart (Additional file 1: Table S1). We compared MINGLE against the recently proposed and high-performing method SANGO [15], the first cell type annotation method specifically designed for scCAS data EpiAnno [14], and four conventional machine learning methods, including SVM, RF, and KNN with 9 neighbors or 50 neighbors (KNN9, KNN50), which were recommended by recent benchmark studies [16, 17] (Methods). Specifically, for each dataset, we randomly split all cells into five folds and iteratively annotated cell types for cells in each fold using the model trained with cells in the remaining four folds. As suggested by recent benchmark studies [16, 17], we adopt four metrics to evaluate the annotation performance comprehensively, including accuracy (Acc), macro F1 score (Macro-F1), Cohen’s kappa value (Kappa), and Jaccard (Additional file 1: Text S1).

As shown in Fig. 2A, MINGLE achieved nearly the best performance across all six datasets, particularly excelling in the metrics of Macro-F1, Kappa, and Jaccard. SANGO, the existing state-of-the-art method, consistently showed the second-best performance, aligning with the results from the original study [15]. To determine whether MINGLE achieved significantly higher metrics than other baseline methods on the six datasets, we further performed one-sided paired Wilcoxon signed-rank tests (Additional file 1: Fig. S1). The results demonstrated that MINGLE significantly outperformed the baseline methods on all metrics. We observed that MINGLE did not exhibit a significant advantage in the Acc metric across several datasets. This is because scCAS datasets often contain multiple rare cell types, leading to a high degree of imbalance in many datasets. The imbalance degree of a dataset is defined by estimating the normalized entropy of the cell type size distribution (Methods). For example, the ThymusA, ThymusB, and Liver datasets exhibit imbalance degrees exceeding 70% (Additional file 1: Table S1). The Acc metric, which measures overall prediction accuracy, fails to adequately capture the model’s ability to accurately annotate rare cell types. In contrast, metrics like Macro-F1 and Kappa provide a more balanced evaluation of performance across different cell types, making them more suitable for assessing the model’s effectiveness in identifying rare cell types.

To demonstrate the superiority of MINGLE in annotating rare cell types in detail, we further calculated evaluation metrics that specifically focus on only cells from the rare cell type for each of the six datasets. Here, we defined rare cell type as that with the smallest proportion in the dataset, and considered cells belonging to the rare cell type as positive samples and all other cells as negative samples in the test set to assess the binary classification performance. We employed three commonly used binary classification metrics, including recall, precision, and F1-score, and compared with the state-of-the-art method SANGO. We observed that while the overall annotation performance of MINGLE on some datasets, such as the Heart dataset (Fig. 2A), is comparable to that of SANGO, it shows a significant advantage over SANGO in annotating rare cell types (Additional file 1: Fig. S2). This advantage can likely be attributed to the masking-based class balancing strategy employed by MINGLE, making it particularly effective in addressing the challenges posed by imbalanced datasets and accurately identifying rare cell types. We have also demonstrated that MINGLE’s performance is robust to variations in the setting of masking ratio across most datasets (Additional file 1: Fig. S3).

In summary, MINGLE showcases outstanding performance in accurately annotating cell types across diverse scCAS datasets, spanning various species, tissues, and levels of imbalance. MINGLE is particularly superior in annotating rare cell types, highlighting its remarkable capacity to discern subtle variations within complex datasets.

MINGLE effectively identifies novel cell types in scCAS data

In the task of cell type annotation, it is common to encounter cell types that are not present in the training set but appear in the test set, namely novel cell types. These novel cell types can reveal insights into cellular differentiation and lineage relationships previously unrecognized. Hence, when novel cell types appear in the test set, we expect the model to accurately identify the cells belonging to these novel cell types and annotate them as “Novel” rather than assigning them labels with existing cell types in the training set.

To verify the superiority of MINGLE in identifying novel cell types, we again utilized the six scCAS datasets mentioned in the previous section to conduct experiments. Specifically, to simulate the presence of a novel cell type in the test set, we first divided the dataset into training set (80%) and test set (20%). Then, we removed a certain cell type from the training set, allowing it to appear only in the test set, and treated this cell type as a novel cell type. We repeated this process iteratively for each cell type in each dataset. Note that most existing cell type annotation methods tailored to scCAS data do not exhibit the capability to identify novel cell types (e.g., EpiAnno and SANGO). We compared MINGLE with SVM with rejection option (SVMrejection), which has been confirmed the ability to accurately identify novel cell types in single-cell transcriptomic data in the recent benchmark study [17]. To better assess the model's performance in identifying novel cell types, we treated this task as a binary classification task, where cells belonging to the novel type were considered as positive samples, and all other cells were considered as negative samples. We used three common binary classification metrics, including recall, precision, and F1-score for evaluation. As shown in Fig. 2B, when different cell types were considered as novel cell types, MINGLE showed high recall across all datasets, demonstrating that MINGLE can successfully identify all cells belonging to the novel cell type. However, we observed that in terms of precision, defined as the ratio of true positive predictions to the total number of positive predictions made by the model, MINGLE did not always have an advantage. This indicates that MINGLE may occasionally identify some other cells as novel type. Nevertheless, when considering the metric F1-score, which balances recall and precision, MINGLE consistently demonstrated superior overall performance.

Moreover, an effective annotation method should not only accurately identify novel cell types but also ensure correct annotation of other cells belonging to known cell types. To delve deeper into this aspect, we further analyzed the annotation results of MINGLE and SVMrejection (Additional file 1: Fig. S4). Taking the ThymusA dataset as an example, when the cell type such as “antigen presenting cells” was regarded as a novel cell type, MINGLE not only successfully identified cells belonging to the novel cell type but also accurately annotated the majority of the other cell types existing in the training set. In contrast, SVMrejection was unable to achieve both tasks simultaneously (Fig. 2C).

In conclusion, MINGLE shows superior performance in identifying novel cell types and maintaining accurate annotation across existing cell types in the training set, facilitating the discovery of previously unrecognized disease biomarkers and novel drug targets.

MINGLE enables interpretable cell type annotation

In MINGLE, we devise a mutual information-based interpretation strategy to interpret the model predictions. By optimizing a binary feature selector based on the mutual information between selected peaks and their corresponding annotation results, MINGLE can derive importance scores for each peak based on its contribution to prediction. Peaks with the top K highest importance scores in each cell type are identified as cell-type-specific peaks. To further demonstrate that MINGLE-identified cell-type-specific peaks can provide valuable insights into biological mechanisms, we conducted extensive downstream analyses on cell-type-specific peaks of the ThymusA dataset identified by MINGLE. The ThymusA dataset includes four cell types: thymocytes, vascular endothelial cells, antigen presenting cells, and thymic epithelial cells, all of which are closely associated with immune system diseases. We selected the top 2000 cell-type-specific peaks for each cell type based on the importance scores provided by MINGLE and left out the union of these cell-type-specific peaks from the complete peaks to form the set of background peaks. The term “complete peaks” refers to the full set of peaks after initial filtering.

First, we examined the overlap of the MINGLE-identified cell-type-specific peaks using a heatmap. As shown in Additional file 1: Fig. S5 A, the pairwise overlaps among cell types range from 23.5 to 28.5% (calculated as the proportion of shared peaks relative to the union of peaks). Furthermore, we quantified the Jaccard similarity index among these peaks to rigorously assess their uniqueness. As shown in Additional file 1: Fig. S5B, the Jaccard indices confirm that less than 17% of peaks are shared between any two cell types, reinforcing the biological specificity of the prioritized peaks. Note that there is a 20–30% overlap in cell-type-specific peaks across distinct cell types (Additional file 1: Fig. S5 A). This overlap is biologically expected, as these cell types do not function in isolation but rather interact within complex functional networks. For example, antigen presenting cells activate mature T cells derived from thymocytes, and during this immune response process, both may share regulatory regions of genes related to immune signaling (like cytokines or co-stimulatory molecules) [20, 21]. Thymic epithelial cells, which shape the thymic microenvironment for thymocytes’ development, collaborate closely with thymocytes—they share regulatory peaks for genes involved in thymic development [22, 23]. Even vascular endothelial cells, though primarily managing nutrient transport, assist in moving thymocytes to the thymus and antigen presenting cells to immune sites, involving shared regulation of genes related to cell adhesion molecules [24].

Second, the cell-type-specific peaks identified by MINGLE effectively capture and quantify heritability enrichment. We quantified the enrichment of heritability for four different fundamental diseases of the immune system within each set of cell-type-specific peaks and the set of background peaks using partitioned linkage disequilibrium score regression (LDSC) [25]. As shown in Fig. 3A, the cell-type-specific peaks in the ThymusA dataset exhibited strong enrichment of heritability for immune diseases compared to the background peaks. Specifically, the enrichment results suggest that systemic lupus erythematosus (Lupus) may be linked to the dysregulation or hyperactivation of antigen presenting cells and thymic epithelial cells. The high enrichment in antigen presenting cells could imply an increased frequency of self-antigen presentation to T cells, potentially contributing to the overactivation of autoimmune responses. Moreover, the results also showed a strong correlation between antigen presenting cells and ulcerative colitis.

Third, the cell-type-specific peaks identified by MINGLE provide tissue-specific expression enrichment. We utilized the cell-type-specific peaks identified by MINGLE and the background peaks to perform SNPsea analysis with default settings [26]. The enrichments of tissue-specific expression in profiles across 79 tissues were quantified. We illustrated the top 30 significantly enriched tissues in Fig. 3B, and observed that tissues related to thymus showed significant enrichment within the MINGLE-identified cell-type-specific peaks, while the background peaks exhibited much less enrichment. This suggests that the cell-type-specific peaks identified by MINGLE offer clear tissue specificity and better capture the cell heterogeneity in related tissues.

Fourth, the cell-type-specific peaks identified by MINGLE can reveal the functional implication of cell subpopulations. We used the cell-type-specific peaks to perform GREAT analysis to identify significant biological processes associated with each of the cell types [27]. As shown in Fig. 3C, GREAT analysis results revealed that the cell-type-specific peaks of thymocytes were associated with the biological processes, including regulation of gene expression and epigenetics, and DNA replication-dependent nucleosome assembly, which align with the known biological functions of thymocyte cells [28, 29]. The relationship between antigen presenting cells and the biological processes of nucleosome organization and chromatin silencing was also uncovered, which is consistent with the critical role of antigen presenting cells in activating T cells and its involvement in immune regulatory processes [30, 31]. The results also revealed the regulatory capacity of thymic epithelial cells under various stress conditions, including regulation of DNA-templated transcription in response to stress and regulation of transcription from RNA polymerase II promoter in response to stress [32,33,34].

We also conducted additional comparative analyses with EpiScanpy [35], a widely used tool for detecting differentially accessible peaks [36,37,38] (Additional file 1: Text S2). The results demonstrated the superior performance of MINGLE over EpiScanpy in identifying biologically meaningful cell-type-specific peaks, particularly for rare or less abundant cell types (Additional file 1: Fig. S6). Taken together, these findings suggest that MINGLE offers valuable insights into genetic and cellular mechanisms, thereby enhancing our understanding of cell-type-specific complex biological processes.

MINGLE is superior in cross-batch, cross-tissue, and cross-species annotation

In the previous sections, we have thoroughly demonstrated the advantages of MINGLE from different perspectives. In this section, we further evaluate the model performance using independent training and test sets, which is a more practical scenario.

We first consider the performance of MINGLE in cross-batch annotation. Batch effects are a common challenge in single-cell data analysis, where technical variations between different experiments can lead to discrepancies in cell type annotations. Therefore, it is crucial to develop methods that can robustly annotate cell types when the training and test sets are from different batches, effectively accounting for these technical differences. Here, we utilized the ThymusA dataset, which includes three distinct batches derived from different donors, to conduct cross-batch annotation experiments. Specifically, we trained the model on two batches and tested it on the remaining batch. As shown in Fig. 4A, MINGLE exhibited superior performance in annotating cell types when the training and test sets originated from different batches. Similarly, SANGO, EpiAnno, and SVM also achieved relatively good performance, indicating their robustness to cross-batch variations. However, the other three methods struggled significantly, exhibiting poor accuracy in cross-batch annotations, which highlights their limitations in handling datasets with pronounced batch effects.

In addition, given the potential challenges in obtaining well-labeled training sets from the same tissue as the test set, it is crucial to develop methods capable of training on well-labeled datasets from one tissue and accurately annotating cell types in another tissue, also namely cross-tissue annotation. To assess the model’s performance in cross-tissue annotation experiments, we utilized four scCAS datasets from three different mouse tissues (LungA, LungB, Spleen, and ThymusB). The three tissues engage in mutual interactions within the immune system, collaboratively ensuring the body’s overall health. To enable cross-tissue annotation, we retained the common cell types between training and test sets following SANGO [15]. Upon further examination of the common cell types across these tissues, we found that key immune cells such as T cells, B cells, and macrophages are present in each tissue (Fig. 4B). This indicates that there exists a shared core population of cells across different immune-related tissues, which may drive similar immune responses and provide a reliable foundation for cross-tissue prediction. Specifically, we trained the models separately on the first three datasets and then used the three trained models to annotate the ThymusB dataset. As shown in Fig. 4C, MINGLE consistently maintained superior annotation performance even when the training and test sets were derived from different tissues. Additionally, SVM demonstrated robust performance, ranking as the second most effective method in cross-tissue annotation. Notably, methods specifically designed for scCAS data, such as EpiAnno and SANGO, failed to deliver comparable results, highlighting the limitations of existing cell type annotation methods in cross-tissue annotation scenarios.

Furthermore, we also explored the model's performance in a cross-species scenario, while focusing on the same tissue. To conduct cross-species annotation experiments, we additionally collected new datasets from both human and mouse brains [5, 39], where the human brain dataset contains three separate batches (Methods). Since the human brain dataset contains a larger number of cells and captures broader biological diversity due to its clinical relevance and comprehensive annotations, we trained the model on the three separate batches of the human brain dataset and tested it on the mouse brain dataset. Notably, EpiAnno encountered memory limitations when training set sizes exceeded 20000 cells, preventing its evaluation under these conditions. The results indicated that MINGLE consistently outperformed other baseline methods across three cross-species experiments, achieving higher Acc, Macro-F1, and Jaccard scores (Fig. 4D). Nevertheless, it is evident that the cross-species prediction scores for all methods are relatively low, suggesting that direct transfer learning does not generalize effectively to single-cell data from human and mouse brains. This observation underscores the fundamental challenges inherent in cross-species analysis and highlights the necessity of developing domain-specific adaptation strategies to address these challenges.

In summary, the superior performance of MINGLE across cross-batch, cross-tissue, and cross-species experiments highlights its exceptional adaptability and effectiveness in real-world annotation scenarios.

MINGLE demonstrates robustness to imbalance degrees and data sizes

scCAS datasets often exhibit varying degrees of imbalance, making the cell type annotation task highly challenging. We have shown the performance of MINGLE on datasets with various imbalance degrees in the previous section. To further explore the robustness of MINGLE in handling scCAS datasets with different imbalance degrees, we conducted an in-depth analysis using a specific dataset. Taking the ThymusA dataset as an example, we randomly removed different numbers of cells belonging to a randomly selected cell type to generate artificial datasets with varying degrees of imbalance. In total, we generated five artificial datasets with imbalance degrees ranging from 0.153 to 0.887. We then evaluated model performance on each of these artificial datasets, allowing us to systematically investigate whether MINGLE performs well under different degrees of imbalance.

As shown in Fig. 5A, MINGLE demonstrated consistently stable performance across various degrees of imbalance and excelled over other methods, particularly when facing high degrees of imbalance. Additionally, we observed that as the degree of imbalance increased, all methods exhibited a downward trend in metrics other than accuracy. This decline can be attributed to the increasing difficulty of annotating cell type as the imbalance becomes more pronounced. More specifically, when the imbalance degree exceeded 0.5, all methods except MINGLE showed a sharp decline across the three metrics, with the state-of-the-art method SANGO experiencing a particularly noticeable drop, indicating that SANGO struggles to provide accurate annotations on highly imbalanced datasets. Conversely, we noted that the Acc metric generally increased with the imbalance degree increasing, which is an anticipated outcome given the imbalance between the major and rare cell types. Because Acc often provides an overly optimistic assessment of model performance in such imbalanced scenarios, failing to accurately capture the model's ability to handle minority classes.

Moreover, single-cell data typically contains varying numbers of cells. Existing methods specifically designed for scCAS data are often based on deep learning and generally rely on large training datasets to achieve optimal performance. However, the data sizes of single-cell datasets can vary significantly across experiments, with some datasets having abundant cells while others have only a limited number. Therefore, assessing the robustness of methods across datasets with varying data sizes is necessary. We again used the ThymusA dataset containing 21,499 cells as an example to explore the robustness of MINGLE in handling scCAS datasets with varying numbers of cells. Specifically, we generated artificial datasets by sampling the dataset with sampling rates ranging from 20 to 80%. A lower sampling rate indicates that there are fewer cells in the artificial dataset. We performed sampling for each cell type individually to preserve the original degree of imbalance. As shown in Fig. 5B, with varying numbers of cells in the dataset, MINGLE achieved significantly better performance than baseline methods in almost all cases, demonstrating its enhanced generalization capabilities. Besides, we also provided an explanation about the decline in performance with more training data (Additional file 1: Text S3 and Additional file 1: Fig. S7).

Furthermore, as the data size increases, a significant challenge is the reduction in computational efficiency, especially for deep learning-based methods. An ideal model should achieve high annotation performance while maintaining computational efficiency. To address this, we further compared the running time of MINGLE with the two existing deep learning methods tailored to scCAS data (i.e., EpiAnno and SANGO) across varying data sizes. The results showed that MINGLE exhibited superior computational efficiency compared to the existing methods across different data sizes (Additional file 1: Fig. S8).

Taken together, MINGLE demonstrates superior robustness and high computational efficiency in handling scCAS datasets with varying degrees of imbalance and data sizes, highlighting its potential as a critical tool for cell type annotation in complex biological datasets.

The integration strategy in MINGLE is effective

In MINGLE, we incorporate an integration strategy which combines the two rounds of annotations from contrastive learning and GCN to yield improved final prediction results. To demonstrate the effectiveness of this integration strategy, we further conducted ablation experiments by comparing three different settings: straightforward annotation using the first-round results based on contrastive learning (CL), straightforward annotation using the second-round results based on GCN, and annotation through an integration of contrastive learning and GCN (MINGLE). Note that the “straightforward annotation using the second-round results based on GCN” refers to a scenario where the cell type annotations are determined solely based on the output of the GCN model $\widehat{\mathbf Y}^{\mathrm{test}\_ \text{GCN}}$, without integrating the first-round results obtained through contrastive learning. Here we again took the dataset ThymusA as an example to compare the model performance under the three settings.

As shown in Fig. 5C, CL exhibited the poorest performance across the three settings, which can be primarily attributed to the limitations of contrastive learning when used in isolation, as it may struggle to handle the complexity and variability inherent in the scCAS datasets. Conversely, GCN, when used without integrating it with contrastive learning annotation results, performed relatively better in annotation tasks. This improvement is largely due to the capability of GCN to effectively capture and utilize the topological structures inherent in the dataset, thereby enhancing its predictive accuracy and providing robustness against diverse characteristics of the dataset. After combining the contrastive learning and GCN results, we observed that the annotation performance was further improved, particularly in the Kappa and Jaccard metrics. This indicated the efficacy of combining contrastive learning with GCN in boosting model performance, demonstrating a synergistic effect that leverages the strengths of both approaches. We also explored an adaptive weighting scheme to integrate the results from CL and GCN, and assessed the impact of different weighting strategies on model performance. The results showed that equal weighting yields the most reliable and robust results, providing a balanced representation of both techniques (Additional file 1: Text S4 and Additional file 1: Fig. S9).

Discussion

The advent of single-cell chromatin accessibility sequencing (scCAS) has opened new avenues for understanding the intricate epigenomic landscape and the regulatory mechanisms underlying gene expression at the single-cell level. However, accurate cell type annotation for scCAS data is still a crucial yet challenging task. Existing methods often fail to fully capitalize on the similarities and topological relationships between cells, struggle to identify novel cell types, and lack the interpretability required for deeper biological insights. To efficiently and automatically annotate cell types in scCAS data, we propose MINGLE, a mutual information-based interpretable framework leveraging the inherent similarities and topological structures of cells for accurate annotation. With comprehensive experiments across a variety of datasets from multiple tissues, species, and sequencing technologies, we have shown the superiority of MINGLE in annotating cell types for scCAS data.

While MINGLE has shown great promise, there are several aspects where it could be further enhanced. Firstly, we can leverage the extensive existing scRNA-seq datasets to provide foundational prior knowledge for the model. Secondly, the method could be extended to annotate cell types across other epigenomic data types, such as single-cell DNA methylation data. Finally, incorporating continuous learning would enable the model to continuously integrate newly generated datasets.

Conclusions

MINGLE is a mutual information-based interpretable framework that leverages similarities and topological structures among cells for accurate cell type annotation. We also devise an innovative convex hull-based identification approach for identifying novel cell types. Using a diverse range of scCAS datasets across various tissues, species, and sequencing platforms, we demonstrate that MINGLE significantly outperforms existing methods in annotating known cell types, especially rare cell types. We also show that MINGLE can identify novel cell types in scCAS datasets effectively, enabling the discovery of previously unknown biological entities and the identification of new therapeutic targets or biomarkers. Comprehensive downstream analyses further confirm that MINGLE is not only accurate but also interpretable, proving its utility as a powerful tool for advancing single-cell epigenomics and contributing to broader biological discoveries. More importantly, MINGLE excels in cross-batch, cross-tissue, and cross-species annotations, and is robust to datasets with varying imbalance degrees and data sizes, highlighting its versatility and reliability in handling complex real-world scenarios.

MINGLE offers the following key advantages: (1) MINGLE introduces a masking-based class balancing strategy, which is inspired by the idea of masked autoencoders (MAE), to handle rare cell types in scCAS data. (2) MINGLE utilizes contrastive learning and graph convolutional networks (GCN) to perform cell type annotation based on the similarities and topological structures among cells, which can capture subtle heterogeneity between cell types effectively. (3) MINGLE introduces a convex hull-based identification approach for identifying novel cell types only appearing in the test set, which are crucial for uncovering previously unrecognized cellular entities and advancing our understanding of cellular differentiation. (4) MINGLE incorporates a mutual information-based interpretation scheme to interpret the annotation results and extract biologically meaningful insights.

Methods

Data preprocessing

Given a cell-by-peak scCAS count matrix $\mathbf X^{\mathrm{raw}\;}\in\mathbb{R}^{n_0\mathit\times p_0}$, where represents the number of cells and represents the number of peaks, we first filter out the peaks accessible in fewer than 1% of the cells to reduce noise [38, 40, 41], and obtain the filtered count matrix $\mathbf X^{\mathrm{filtered}}\boldsymbol\;\in\mathbb{R}^{n_0\times p}$. Afterwards, we utilize the term frequency-inverse document frequency (TF-IDF) transformation to reweight peaks, which has been commonly used in scCAS data analysis [14, 36, 42, 43]. The peak of cell will be processed by TF-IDF to:

$$x_{ij}^,=\frac{x_{ij}^{\mathrm{filtered}}}{\sum_{j=1}^px_{ij}^{\mathrm{filtered}}}\log\;\left(\frac{n_0}{\sum_{i=1}^{n_0}x_{ij}^{\mathrm{filtered}}}\right),$$

then, it is normalized via:

$$x_{ij}^{\mathrm{processed}}\;=\frac{x_{ij}^,}{\sqrt{\sum_{j=1}^{p} {x_{ij}^{,} }^2}}$$

The TF-IDF transformation serves two key purposes in scCAS data analysis: (1) it normalizes sequencing depth, and (2) it emphasizes the informative features by down-weighting those that are less variable across cells, thereby enhancing the model's ability to capture meaningful biological signals. Since peaks with higher variance across different cell types in scCAS data often capture key distinctions between cell populations, assigning them higher weights improves separation between these populations. We have also validated the necessity of the TF-IDF transformation in MINGLE by comparing models trained with raw data, TF-IDF transformed data, and normalized data (Additional file 1: Fig. S10).

Using aforementioned approach, we obtain the preprocessed set $\mathbf X^{\mathrm{processed}}\boldsymbol\;\in\mathbb{R}^{n_0\times p}$. Specifically, given a raw training set with cell type labels and a raw test set without cell type labels, we first combine the two sets as a combined set to perform data preprocessing. Subsequently, we split the preprocessed combined set and obtain the preprocessed training and test sets.

The model of MINGLE

For the preprocessed training and test sets, we first implement class balancing based on a masking strategy to handle rare cell types. Then we utilize contrastive learning to learn low-dimensional representations of each cell and use these representations to perform a first-round annotation based on the similarity between training and test cells in low-dimensional space. To utilize topological structures among cells for accurate annotation, we further employ the low-dimensional representations to construct a K nearest neighbors graph and train a GCN model to perform a second-round annotation. Subsequently, we integrate the results from the two rounds of annotation to obtain the final predictions. Moreover, MINGLE provides a convex hull-based strategy to identify novel cell types that only appear in the test set by constructing convex hull space for each cell type in the training set. Finally, to further assist researchers in exploring the cell-type-specific regulatory mechanisms based on the predictions, we incorporate a mutual information-based strategy to identify the subset of features that most influence each prediction, thereby determining cell-type-specific peaks. The interpretation phase lays a foundation for further investigation into the underlying biological processes.

The masking-based class balancing strategy of MINGLE

To address the common issue of class imbalance in scCAS datasets, we implement a class balancing technique to the training set. Specifically, we set a threshold K of 300. For cell types with cell count exceeding K, we perform downsampling to reduce the number of cells to K. For cell types with cell count fewer than K, we apply an oversampling technique using a masking strategy. In this strategy, we randomly set 15% of the non-zero elements to zero to generate synthetic cells. Afterwards, we obtain the final processed training set $\mathbf X^{\mathrm{train}}\boldsymbol\;\in\mathbb{R}^{n_{\mathrm{train}}\times p}$ and test set $\mathbf X^{\mathrm{test}}\boldsymbol\;\in\mathbb{R}^{n_{\mathrm{test}}\mathit\times p}$.

Low-dimensional representations learning with MINGLE

To effectively capture patterns specific to different cell types, we utilize contrastive learning to learn low-dimensional representations of cells and perform the first-round annotation.

Construction of the sample pairs

Contrastive learning works by using positive sample pairs, composed of cells from the same cell type, and negative sample pairs, composed of cells from different cell types, to effectively train the model to distinguish patterns between different cell types. Therefore, we first construct the positive and negative sample pairs. Specifically, for a specific cell ${\mathbf x}_{\mathit i}\boldsymbol\;\in\mathbb{R}^p$ in the processed training set $\mathbf X^{\mathrm{train}}$, we first divide all the remaining cells in $\mathbf X^{\mathrm{train}}$ into two subsets: one set denoted as $\mathbf X^{\mathrm{positive}}$, containing cells that belong to the same cell type as ${\mathbf x}_i$; and the other set denoted as $\mathbf X^{\mathrm{negative}}$, containing cells from all other cell types. Subsequently, we randomly select one cell $\mathbf x_i^{\mathrm{positive}}$ from $\mathbf X^{\mathrm{positive}}$, to construct a positive sample pair with the cell ${\mathbf x}_i$. The positive pair is denoted as $({\mathbf x}_{\mathit i},\mathbf x_i^{\mathrm{positive}})$. Similarly, we construct a negative sample pair $({\mathbf x}_i,\mathbf x_i^{\mathrm{negative}})$, where $\mathbf x_i^{\mathrm{negative}}$ is a cell randomly selected from $\mathbf X^{\mathrm{negative}}$. For each cell in the training set, we construct one positive sample pair and one negative sample pair following the steps outlined above, ultimately obtaining a total of $n_{\mathrm{train}}$ positive sample pairs and $n_{\mathrm{train}}$ negative sample pairs.

Supervised training based on contrastive learning

After constructing the sample pairs, MINGLE utilizes an MLP to learn the low-dimensional representations of cells based on the contrastive learning strategy. The detailed structure of the MLP is provided in Additional file 1: Text S5. The reason for using an MLP here is that the MLP effectively captures complex non-linear relationships and has been successfully employed in various contrastive learning-based algorithms [44, 45]. The fundamental principle of the contrastive learning strategy is to minimize the distance between cells in positive pairs while maximizing the distance between cells in negative pairs within the low-dimensional space. For this purpose, we design a special loss function to optimize the MLP, as follows:

$$\mathrm{Loss}=\;-\log\frac{\sum_{i=1}^{n_{\mathrm{train}}\;}\left(\exp\;\left(\cos\left({\widetilde{\mathbf x}}_i,\widetilde{\mathbf x}_i^{\mathrm{positive}}\right)\right)\;+\exp\;\left(\cos\left({\overset{\boldsymbol\sim}{\mathbf x}}_i,\widetilde{\mathbf x}_i^{\mathrm{negative}}\right)\right)\;\right)}{\sum_{i=1}^{n_{\mathrm{train}}\;}\exp\;\left(\cos\left({\overset{\boldsymbol\sim}{\mathbf x}}_i,\widetilde{\mathbf x}_i^{\mathrm{negative}}\right)\right)}$$

where $\overset{\boldsymbol\sim}{\mathbf X}_i$ represents the low-dimensional representations of the $i$-th cell, and $\text{cos}\left(\cdot\right)$ represents the cosine similarity, which can be calculated as follows:

$$\cos\left(\alpha,\;\beta\right)\;=\;\frac{\alpha^T\beta}{\left\|\alpha\right\|\cdot\left\|\beta\right\|}$$

Following the aforementioned process, we can obtain a trained MLP that can effectively learn the low-dimensional representation of each cell. Within the low-dimensional space, cells of the same cell type cluster together, while cells from different types are distinctly separated. Utilizing the trained MLP, we perform the first-round annotation. Specifically, we feed the processed set $\mathbf X^{\mathrm{train}}$ and $\mathbf X^{\mathrm{test}}$ to the trained MLP, and obtain the low-dimensional representations $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{train}}\in\mathbb{R}^{n_{\mathrm{train}}\times d}$ and $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{test}}\in\mathbb{R}^{n_{\mathrm{test}}\times d}$ where d represents the dimension of the output of the MLP. Subsequently, we compute the algebraic centroid $\overset{\boldsymbol\sim}{\mathbf x}_k^c$ for the $k$-th cell type using $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{train}}$ and stack the $K$ centroid vectors $\overset{\boldsymbol\sim}{\mathbf x}_k^c$ into a centroid matrix $\overset{\boldsymbol\sim}{\mathbf X}^c$ :

$$\begin{array}{c}\overset{\boldsymbol\sim}{\mathbf x}_k^c\;=\frac1{\left|J_k\right|}\sum_{i\in J_k}{\overset{\boldsymbol\sim}{\mathbf x}}_{ i},\\\overset{\boldsymbol\sim}{\mathbf X}^c\;=\begin{pmatrix}\overset{\boldsymbol\sim}{\mathbf x}_1^c\\\vdots\\\overset{\boldsymbol\sim}{\mathbf x}_K^c\end{pmatrix}\;\in\mathbb{R}^{K\times d}\end{array}$$

where $J_k$ represents the indices of cells in $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{train}}$ that belong to the $k$-th cell type, and $K$ represents the number of cell types in training set. Afterwards, for each test cell in $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{test}}$, we calculate the cosine similarity $s_{ik}$ between the test cell and each centroid in $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{c}}$, as follows:

$$s_{ik}\;=\;cos\mathit{\left({\overset{\boldsymbol\sim}{\mathbf x}_{ i}^{\mathrm{test}},\overset{\boldsymbol\sim}{\mathbf x}_{ k}^{\mathrm c}}\right)}\mathit,$$

then $s_{ik}$ is normalized via:

$${\widetilde s}_{ik}\;=\frac{s_{ik}}{\sum_{k=1}^Ks_{ik}}$$

We regard ${\widetilde s}_{ik}$ as the probability that the $i$-th test cell belongs to the $k$-th cell type. This process yields a normalized cosine similarity matrix $\overset{\boldsymbol\sim}{\mathbf S}$, and we denote it as $\widehat{\mathbf Y}^{\mathrm{test}\_\mathrm{CL}}$, which serves as a first-round soft label annotation based on the similarity in the low-dimensional space.

GCN model training with MINGLE

In the contrastive learning phase, we derive annotation results for each cell in the test set based on their similarities to the training cells in the low-dimensional space using a relatively straightforward approach. To the best of our knowledge, GCN has demonstrated remarkable performance in single-cell analysis, providing accurate and robust results even in highly heterogeneous datasets [46,47,48]. Therefore, to utilize the topological structures in scCAS data for accurate cell type annotation, we further construct a GCN model using the low-dimensional representations obtained in the contrastive learning phase to conduct a second-round annotation.

Construction of the graph

In the GCN model, a crucial step involves constructing an effective graph that incorporates both the training and test sets. In the graph, nodes represent individual cells, and an edge is formed between two nodes if the two corresponding cells are neighbors. Such a graph can be represented using an adjacency matrix for subsequent computations. Here, we utilize a K nearest neighbors graph construction strategy to construct the graph. Specifically, we first construct the training graph $\mathbf A^{\mathrm{train}}\boldsymbol\;\in\mathbb{R}^{n_{\mathrm{train}}\times n_{\mathrm{train}}}$ based on the low-dimensional representations $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{train}}$. Within the training set, we find the $m$ nearest cells to a specific cell $\overset{\boldsymbol\sim}{\mathbf x}_i^{\mathrm{train}}$ based on the cosine similarity. These $m$ nearest cells are denoted as $\left\{\overset{\boldsymbol\sim}{\mathbf x}_{i,1}^{\mathrm{train}},\;\overset{\boldsymbol\sim}{\mathbf x}_{i,2}^{\mathrm{train}},\dots,\;\overset{\boldsymbol\sim}{\mathbf x}_{i,m}^{\mathrm{train}}\right\}$. We treat these $m$ cells as the neighbors of cell $i$. Based on the results of the $m$ nearest neighbors, we create the adjacency matrix $\mathbf A^{\mathrm{train}}$, defined as:

$$A_{ij}^{\mathrm{train}}\;=\begin{Bmatrix}1,\;\mathrm{if}\;\mathrm{cell}\;j\;\mathrm{is}\;\mathrm a\;\mathrm{neighbor}\;\mathrm{of}\;\mathrm{cell}\;i\\0,\;\mathrm{otherwise}\end{Bmatrix}$$

To explore the topological structures between the training and test sets, we also construct a graph using all cells from both the training and test sets. Similarly, we utilize the K nearest neighbors graph construction strategy to the union of the two sets and obtain another adjacency matrix $\mathbf A^{\mathrm{all}}\in\;\mathbb{R}^{\left(n_{\mathrm{train}}+n_{\mathrm{test}}\right)\times\left(n_{\mathrm{train}}+n_{\mathrm{test}}\right)}$. Subsequently, we perform padding to the adjacency matrix $\mathbf A^{\mathrm{train}}$ to make its shape match that of $\mathbf A^{\mathrm{all}}$ by filling the additional rows and columns with zeros. After that, we add the two matrices together and then binarize the result to obtain the final adjacency matrix $\mathbf A^{\mathrm{final}}\in\;\mathbb{R}^{\left(n_{\mathrm{train}}+n_{\mathrm{test}}\right)\times\left(n_{\mathrm{train}}+n_{\mathrm{test}}\right)}$.

Semi-supervised training with GCN

The goal of GCN is to perform accurate cell type annotation on the test set by effectively leveraging both the training-test graph structure $\mathbf A^{\mathrm{final}}$ and the node features $\mathbf X^{\mathrm{input}}$. The node features is the stack of the training and test sets, represented as follows:

$$\mathbf X^{\mathrm{input}}=\begin{bmatrix}\mathbf X^{\mathrm{train}}\\\mathbf X^{\mathrm{test}}\end{bmatrix}\;\in\mathbb{R}^{\left(n_{\mathrm{train}}+n_{\mathrm{test}}\right)\times p}.$$

The adjacency matrix $\mathbf A^{\mathrm{final}}$, the node features $\mathbf X^{\mathrm{input}}$ and the cell type annotations of training set $\mathbf Y^{\mathrm{train}}$, are fed into GCN to perform semi-supervised learning. Specifically, we first standardize the adjacency matrix $\mathbf A^{\mathrm{final}}$ via:

$$\begin{array}{c} \widehat{\mathbf A} = \overset{\boldsymbol\sim}{\mathbf D}^{-\frac{1}{2}} \overset{\sim}{\mathbf A} \overset{\sim}{\mathbf D}^{-\frac{1}{2}}, \\ \overset{\boldsymbol\sim}{\mathbf A} = \mathbf A^{\mathrm{final}} + \mathbf{I},\end{array}$$

where $\overset{\sim}{\mathbf D}$ is the diagonal matrix of the adjacency matrix $\overset{\boldsymbol\sim}{\mathbf A},$ and $\mathbf I$ is the identity matrix. The GCN model is constructed with three convolutional layers. In each layer of GCN, the calculation of hidden layer is defined as:

$$\begin{array}{c}\mathbf H^{(l+1)}\;=\;f\;(\mathbf {H}^{(l)},\;\widehat{\mathbf A})\;=\;\mathrm\sigma\;({\mathbf A}\mathbf H^{(l)}\;\mathbf W^{( l +1)}\;),\\\mathbf H^{\left(0\right)}\;=\mathbf X^{\mathrm{input}},\end{array}$$

where $\mathbf {H}^{(l)}$ is the output of the $l$-th layer, $\mathbf W^{\left( l\right)}$ is the weight matrix of the $l$-th layer, and $\sigma\mathit{\left(\cdot\right)}$ is a non-linear activate function:

$$\sigma\left(x\right)=\mathrm{ReLU}\left(x\right)=\max\left(0,x\right)$$

Finally, we apply the softmax function in the final layer output, which is defined as:

$$\begin{array}{c}\widehat{\mathbf Y}=\mathrm{softmax}\left(\mathbf H^{\left(3\right)}\right),\\\mathrm{softmax}\left(\cdot\right)=\frac{\exp\left(\cdot\right)}{\sum\exp\left(\cdot\right)},\end{array}$$

where $\widehat{\mathbf Y}\in\mathbb{R}^{\left({\mathrm n}_{\mathrm{train}}+{\mathrm n}_{\mathrm{test}}\right)\times K}$ is the soft label annotation of both training set and test set. The soft label of the training set is denoted as $\widehat{\mathbf Y}^{\mathrm{train}}\in\mathbb{R}^{n_{\mathrm{train}}\times K}$, and the soft label of the test set is denoted as $\widehat{\mathbf Y}^{\mathrm{test}}\in\mathbb{R}^{n_{\mathrm{test}}\times K}$.

Given the cell type annotations of the training set, we optimize the GCN by minimizing the loss function:

$$Loss=-\sum\limits_{i=1}^{n_{\mathrm{train}}}\sum\limits_{k=1}^Ky_{i,k}^{\mathrm{train}}\log\;\widehat y_{i,k}^{\mathrm{train}}$$

where $y_{i\mathit.k}^{\mathrm{train}}$ represents the true label of the $i$-th cell of training set, that is if the $i$-th cell belongs to the $k$-th cell type, $y_{i,k}=1$; otherwise, $y_{i,k}=0$. $\widehat y_{i,k}^{\mathrm{train}}$ represents the probability that the $i$-th cell to be predicted as the $k$-th cell type. After completing GCN training, we obtain the final second-round soft label annotation $\widehat{\mathbf Y}^{\mathrm{test}\_ \text{GCN}}$.

The integration approach of MINGLE

To improve the accuracy and robustness of the model, we combine contrastive learning and GCN through an integration approach to determine the final prediction results. Specifically, after obtaining the two rounds of annotation results, we first sum the two soft label annotation matrices $\widehat{\mathbf Y}^{\mathrm{test\_CL}}$ and $\widehat{\mathbf Y}^{\mathrm{test}\_ \text{GCN}}$, to obtain the final soft label annotation matrix $\widehat{\mathbf Y}^{\mathrm{test}}\in\mathbb{R}^{n_{\mathrm{test}}\times K}$. For each cell in the test set, we assign its final label by the cell type corresponding to the maximum value within $\widehat{\mathbf y}_i^{\mathrm{test}}\in\mathbb{R}^K$. This maximum value indicates the most probable cell type for this cell.

The convex hull-based strategy to identify novel cell types

To effectively identify novel cell types that do not exist in the training set but appear in the test set, MINGLE incorporates a novel strategy which leverages the low-dimensional representations acquired from the training set during the contrastive learning phase to derive the convex hull of the existing cell types. By assessing whether cells in the test set fall within the convex hull, MINGLE determines whether the cells belong to novel cell types.

Here, we provide a more detailed explanation of how the convex hull is derived. After obtaining low-dimensional representations of training set $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{train}}\in\mathbb{R}^{n_{\mathrm{train}}\times d}$, we first derive several low-dimensional subspaces . The low-dimensional subspaces refer to partitioning the low-dimensional space obtained through contrastive learning into multiple subspaces, each of which consists of a subset of dimensions from the original low-dimensional space. For each cell type $\tau$, MINGLE calculates the convex hulls in the low-dimensional subspaces using :

where $\lambda_j$ is the coefficient that creates a convex combination of points, and ${\mathbf x}_{\tau,j}$ represents the low-dimensional representation of the $j$-th cell in the subspace associated with cell type $\tau$. Theoretically, $\lambda_j$ can be any value that satisfies the conditions:

$$\lambda_j\;\geq0,\sum\limits_{j=1}^{n_{\mathrm\tau}}\lambda_j=1.$$

Note that the coefficients $\lambda_j$ are not explicitly computed or set during convex hull construction. Instead, they represent the mathematical definition of the convex hull as the set of all possible convex combinations of the training points.

Specifically, the fundamental objective in constructing the convex hull of a given point set is to identify the subset of extremal points that define the minimal convex boundary enclosing all points. This methodology employs an iterative refinement approach: Initially, a provisional boundary is established using points exhibiting maximal coordinate values across dimensions, thereby defining a preliminary convex polytope. Subsequent iterations evaluate points residing outside this provisional boundary, with the farthest exterior point relative to the current convex surface being incrementally incorporated into the boundary definition. Each integration necessitates geometric reconfiguration to preserve convexity, achieved through the elimination of redundant interior regions and the redefinition of boundary facets. This approach avoids explicit computation of convex combination coefficients $\lambda_j$, instead leveraging geometric properties and efficient spatial queries.

For any cell in the test set $\overset{\boldsymbol\sim}{\mathbf X}^{\mathrm{test}}$, we utilize its low-dimensional representation to determine whether it lies within the convex hulls of each known cell type. If it lies in the convex hulls of existing cell types, it is regarded as not novel. Conversely, if not, MINGLE will mark this cell as novel cell type.

The mutual information-based interpretation strategy of MINGLE

To enhance user confidence in the model’s predictions and to assist researchers in understanding the relationships and distinctions between cell types, potentially uncovering cell-type-specific biological mechanisms or gaining new scientific insights, we have developed an interpretation phase in MINGLE. For this purpose, we incorporate a mutual information-based strategy designed to explain the predictions made by the GCN, providing a deeper understanding of the underlying decision process [19].

Given a trained GCN model and the corresponding prediction results, MINGLE will generate an explanation by identifying a subset of node features that are most influential for the model’s prediction. Specifically, to determine which node features are most important for prediction, MINGLE learns a feature selector $F$ for nodes in the graph. The feature subset $\mathbf X^{\mathrm F}$ is defined by the binary feature selector $F\in\left\{0,1\right\}^p$, which selects relevant features:

$$\mathbf x_i^F=\left[x_{\mathit i\mathit,{\mathit t}_{\mathit1}}\mathit,x_{\mathit i\mathit,{\mathit t}_{\mathit2}}\mathit,\mathit\dots\mathit,x_{\mathit i\mathit,{\mathit t}_{\mathit p}\mathit,}\right],\;\mathrm{for}\;F_{{\mathit t}_{\mathit j}}=1.$$

where $\mathbf x_{\mathit i}^{\mathit F}$ represents the features of the $i$-th cell that have not been masked out by $F$. MINGLE optimizes the feature selector $F$ by maximizing the mutual information (MI):

$$\max\nolimits_F\mathrm{MI}\left(\widehat{\mathbf Y},\mathbf X^{\mathbf F}\right)=\mathrm H\left(\widehat{\mathbf Y}\right)-\mathrm H\left(\widehat{\mathbf Y}\left|\mathbf X=\mathbf X^F\right.\right)\boldsymbol.$$

To further illustrate, we denote $\widehat{\mathbf Y}=F\left(\mathbf X\right)$, where $F\left(\cdot\right)$ is the predict function of GCN model; $\mathbf X^F={\mathbf X}_s\odot F$, where ${\mathbf X}_{\mathrm s}$ represents all node features. MI quantifies the change in the probability of prediction $F\left(\mathbf X^F\right)$ when the node features are limited to the binary feature selector $F$. In this case, the calculation of the condition entropy is defined as:

$$H\left(\widehat{\mathbf Y}\left|\mathbf X=\mathbf X^F\right.\right)=-{\mathrm E}_{\widehat{\mathbf Y}\left|\mathbf X^F\right.}\left({\mathrm{logP}}_F\left(\widehat{\mathbf Y}\left|\mathbf X=\mathbf X^{\mathit F}\right.\right)\right).$$

To maximize MI, we marginalize over all feature subsets and use a Monte Carlo estimate to sample from empirical marginal distribution for nodes in $\mathbf X_s$ during training. Furthermore, we use a reparametrization trick to backpropagate gradients to the feature selector $F$. In particular, we reparametrize $\mathbf X$ as:

$$\mathbf{X}=\mathbf{Z}+(\mathbf{X}_s-\mathbf{Z}) \odot F,s.t.\sum\nolimits_jF_j\leq K_F,$$

where $\mathbf Z$ is a random variable sampled from the empirical distribution, and $K_F$ is a parameter representing the maximum number of features to be kept in the explanation.

Finally, MINGLE applies the sigmoid function to the final feature selector, which compresses the values to a range between 0 and 1, making it easier to understand which node features are more important. We treat the normalized values as importance scores for peaks, and select the top K peaks with the highest importance scores within each cell type as cell-type-specific peaks.

Data collection

We collected 11 scCAS datasets generated from different species with different protocols, various sizes, dimensions, degrees of imbalance, and proportions of major types for systematic benchmarking.

Firstly, we collected the dataset Melanoma, which consists of 598 cells from melanoma cell lines post-SOX10 knockdown, derived from two short-term patient cultures [49]. For further evaluation across different human tissues, we collected SpleenA and ThymusA datasets, both generated from human fetal samples using sci-ATAC-seq3 [6]. Notably, these two datasets encompass a greater number of cells and peaks, enabling a more comprehensive assessment of annotation performance within complex datasets. Additionally, to investigate the applicability of methods to datasets from different species and sources, we further collected 6 datasets (ThymusB, Liver, Heart, LungA, LungB, and SpleenB) in Mouse sci-ATAC-seq Atlas [5]. These datasets were profiled by a combinatorial indexing assay (sci-ATAC-seq), varying in terms of cell counts and degrees of imbalance, providing a robust testing ground for our evaluation.

To conduct cross-species annotation experiments, we additionally collected datasets from both human and mouse brains. Specifically, the human brain dataset contains 130418 cells from postmortem human brain tissue, including samples from Alzheimer’s disease patients and cognitively healthy controls [39]. The dataset encompasses all major brain cell types: excitatory neurons (EX), inhibitory neurons (IN), astrocytes (AC), microglia (MG), oligodendrocytes (OC), and oligodendrocyte progenitor cells (OPC). Additionally, the human brain dataset includes three different batches. Furthermore, the mouse brain dataset is derived from the Mouse sci-ATAC-seq Atlas, containing cells from the cerebellum of 8-week-old mice [5]. We merged the subtypes of excitatory neurons and inhibitory neurons in the mouse brain dataset, as these undefined subtypes are not directly comparable across the human and mouse brain datasets. To enable cell type annotation between datasets from different species, we unified the peaks of the reference and query datasets. Specifically, for cases where the training and test datasets have different genomes, we used the tool liftover [50] to map the genomes by converting genome coordinates between assemblies. After mapping to the same genome, we used Signac [51] to unify the peaks between the two datasets, which treats overlapping peaks as equivalent features and adjusts their genomic boundaries to enable alignment.

A summary of the collected 11 scCAS datasets for benchmarking is provided in Additional file 1: Table S1. The imbalance degree of a dataset is defined by estimating the normalized entropy of the cell type size distribution as follows:

$$I=1+\frac1{\log C}\sum\limits_{c=1}^c\frac{n_c}N\log\frac{n_c}N,$$

where $C$ denotes the number of cell types in the dataset, $n_c$ denotes the number of cells in the $c$-th cell type, $N$ denotes the total number of cells in the dataset. The imbalance degree will have value 1 if one cell type has all cells and value 0 if all the cell types have the same number of cells. The proportion of major type refers to the proportion of the cell type with the highest cell counts in the dataset.

Model evaluation

In the experiments where there are no novel types in the test set, we assessed annotation performance using four metrics to provide a more comprehensive view of model annotation performance, including accuracy (Acc), macro F1 score (Macro-F1), Cohen’s kappa value (Kappa), and Jaccard. In the experiment where there exist novel types in the test set, we adopt the metrics recall, precision, and F1-score for binary classification evaluation. More detailed mathematical formulas for the aforementioned metrics are provided in Additional file 1: Text S1.

Baseline methods

We compared MINGLE with six baseline methods, including two methods specifically designed for scCAS data and four conventional machine learning methods. The two methods designed for scCAS data are SANGO [15], the recently proposed and high-performing method, and EpiAnno [14], the first method specifically designed for scCAS data. The four conventional machine learning methods include support vector machine (SVM), random forest (RF), and K nearest neighbors with 9 neighbors or 50 neighbors (KNN9, KNN50), which were recommended by recent benchmark studies [16, 17]. Additionally, for assessing the model performance of identifying novel cell types, we compared with SVM with rejection option (SVMrejection), which has been confirmed the ability to accurately identify novel cell types in the recent benchmark study [17].

SANGO

SANGO encodes the genome sequences of peaks into low-dimensional embeddings, which are then iteratively used to reconstruct the peak statistics of cells through a full connected network [15]. Subsequently, SANGO annotates query cells via a graph transformer network. We followed the tutorial on GitHub (https://github.com/biomed-AI/SANGO#Tutorial) and executed it with default parameters.

EpiAnno

EpiAnno is a probabilistic generative model integrated with a Bayesian neural network to annotate scCAS data automatically in a supervised manner [14]. We followed the steps provided on GitHub (https://github.com/xy-chen16/EpiAnno/blob/master/code/demo.ipynb) and conducted the experiments using default parameters.

SVM

SVM is a supervised learning algorithm that seeks to find the optimal hyperplane that best separates the data into distinct classes. We used the sklearn.svm module in the scikit-learn package, and selected the linear kernel, as suggested by the recent benchmark studies [17].

RF

RF is an ensemble learning method that constructs multiple decision trees during training and outputs the mode of the classification of the individual trees. We conducted experiments using the RandomForestClassifier class from the sklearn.ensemble module in the scikit-learn package with default parameters.

KNN

KNN is a simple, instance-based learning algorithm that classifies a data point based on the majority label among its closest neighbors in the feature space. We conducted experiments using the KNeighborsClassifier class from the sklearn.neighbors module in the scikit-learn package, with the n_neighbors parameter set to 9 and 50 for the KNN9 and KNN50 models, respectively.

SVMrejection

SVMrejection is an extension of the traditional SVM algorithm. The rejection option allows the model to identify and exclude uncertain or ambiguous cases during the classification process. A threshold of 0.7 was used on the posterior probabilities to assign cells as “Novel” which is recommended by the recent benchmark study [17].

Implementation details of downstream analyses

LDSC

Partitioned linkage disequilibrium score regression (LDSC) is a statistical method used to quantify the contribution of genetic variants to the heritability of traits or diseases across the genome [25]. Developed to interpret genome-wide association study (GWAS) results, LDSC leverages linkage disequilibrium information, the non-random association of alleles at different loci, to differentiate the signal due to polygenic traits from confounding biases such as population stratification and cryptic relatedness. We quantified the enrichment of heritability for immune-related phenotypes within cell-type-specific peaks for each cell type using partitioned LDSC with default settings.

SNPsea

SNPsea is an enrichment algorithm designed for analyzing single-nucleotide polymorphisms (SNPs) to pinpoint specific cell types, tissues, and biological pathways that are influenced by risk loci associated with traits [26]. It tests trait-associated genomic loci for enrichment of specificity to conditions (cell types, tissues and pathways). We quantified the enrichments of cell-type-specific peaks in tissue-specific accessibility profiles across 79 tissues. The top 30 significantly enriched tissues are illustrated.

GREAT

Gene Regulatory Enrichment Analysis Tool (GREAT) is a bioinformatics approach used to link genomic regions to known biological pathways and functions [27]. We submitted the cell-type-specific peaks identified by MINGLE to the GREAT server with the default settings to identify significant biological processes associated with the cell-type-specific peaks and thus obtain functional insights for the corresponding cell subpopulation. The top 10 significant biological processes are illustrated.

Data availability

The Melanoma dataset is available at GEO with the accession number GSE114557 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE114557] [52]. The SpleenA and ThymusA datasets are available at GEO with the accession number GSE149683 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE149683] [53]. The ThymusB, Liver, Heart, LungA, LungB, SpleenB and mouse brain datasets are available at [https://atlas.gs.washington.edu/mouse-atac/data] [54]. The human brain dataset is available at GEO with the accession number GSE174367 [https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE174367] [55].

The source codes are publicly available in the GitHub repository at https://github.com/BioX-NKU/MINGLE [56] and Zenodo [https://doi.org/10.5281/zenodo.15221734] [57] under an MIT license. We have released MINGLE as a python package for easy installation, and provided detailed tutorials on GitHub.

References

Zhu C, Preissl S, Ren B. Single-cell multimodal omics: the power of many. Nat Methods. 2020;17:11–4.
Article CAS PubMed Google Scholar
Buenrostro JD, Corces MR, Lareau CA, Wu B, Schep AN, Aryee MJ, Majeti R, Chang HY, Greenleaf WJ. Integrated single-cell analysis maps the continuous regulatory landscape of human hematopoietic differentiation. Cell. 2018;173(1535–1548): e1516.
Google Scholar
Buenrostro JD, Wu B, Litzenburger UM, Ruff D, Gonzales ML, Snyder MP, Chang HY, Greenleaf WJ. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature. 2015;523:486–90.
Article CAS PubMed PubMed Central Google Scholar
Klemm SL, Shipony Z, Greenleaf WJ. Chromatin accessibility and the regulatory epigenome. Nat Rev Genet. 2019;20:207–20.
Article CAS PubMed Google Scholar
Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, Filippova GN, Huang X, Christiansen L, DeWitt WS. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell. 2018;174(1309–1324): e1318.
Google Scholar
Domcke S, Hill AJ, Daza RM, Cao J, O’Day DR, Pliner HA, Aldinger KA, Pokholok D, Zhang F, Milbank JH. A human cell atlas of fetal chromatin accessibility. Science. 2020;370: eaba7612.
Article CAS PubMed PubMed Central Google Scholar
Preissl S, Fang R, Huang H, Zhao Y, Raviram R, Gorkin DU, Zhang Y, Sos BC, Afzal V, Dickel DE. Single-nucleus analysis of accessible chromatin in developing mouse forebrain reveals cell-type-specific transcriptional regulation. Nat Neurosci. 2018;21:432–9.
Article CAS PubMed PubMed Central Google Scholar
Chen J, Xu H, Tao W, Chen Z, Zhao Y, Han J-DJ. Transformer for one stop interpretable cell type annotation. Nat Commun. 2023;14:223.
Article CAS PubMed PubMed Central Google Scholar
Fischer F, Fischer DS, Mukhin R, Isaev A, Biederstedt E, Villani A-C, Theis FJ. scTab: Scaling cross-tissue single-cell annotation models. Nat Commun. 2024;15:6611.
Article CAS PubMed PubMed Central Google Scholar
Kiselev VY, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat Methods. 2018;15:359–62.
Article CAS PubMed Google Scholar
Pliner HA, Shendure J, Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat Methods. 2019;16:983–6.
Article CAS PubMed PubMed Central Google Scholar
Xie P, Gao M, Wang C, Zhang J, Noel P, Yang C, Von Hoff D, Han H, Zhang MQ, Lin W. SuperCT: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles. Nucleic Acids Res. 2019;47:e48–e48.
Article CAS PubMed PubMed Central Google Scholar
Chen H, Lareau C, Andreani T, Vinyard ME, Garcia SP, Clement K, Andrade-Navarro MA, Buenrostro JD, Pinello L. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biol. 2019;20:1–25.
Article Google Scholar
Chen X, Chen S, Song S, Gao Z, Hou L, Zhang X, Lv H, Jiang R. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat Mach Intell. 2022;4:116–26.
Article Google Scholar
Zeng Y, Luo M, Shangguan N, Shi P, Feng J, Xu J, Chen K, Lu Y, Yu W, Yang Y. Deciphering cell types by integrating scATAC-seq data with genome sequences. Nat Comput Sci. 2024;4:1–14.
Abdelaal T, Michielsen L, Cats D, Hoogduin D, Mei H, Reinders MJ, Mahfouz A. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol. 2019;20:1–19.
Article CAS Google Scholar
Ma W, Su K, Wu H. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction. Genome Biol. 2021;22:1–23.
Article Google Scholar
He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022:16000–16009.
Ying Z, Bourgeois D, You J, Zitnik M, Leskovec J. Gnnexplainer: generating explanations for graph neural networks. Adv Neur Inform Process Syst. 2019;32:9240–51.
Google Scholar
Hilligan KL, Ronchese F. Antigen presentation by dendritic cells and their instruction of CD4+ T helper cell responses. Cell Mol Immunol. 2020;17:587–99.
Article CAS PubMed PubMed Central Google Scholar
Huppa JB, Davis MM. T-cell-antigen recognition and the immunological synapse. Nat Rev Immunol. 2003;3:973–83.
Article CAS PubMed Google Scholar
Gameiro J, Nagib P, Verinaud L. The thymus microenvironment in regulating thymocyte differentiation. Cell Adh Migr. 2010;4:382–90.
Article PubMed PubMed Central Google Scholar
Wang H-X, Pan W, Zheng L, Zhong X-P, Tan L, Liang Z, He J, Feng P, Zhao Y, Qiu Y-R. Thymic epithelial cells contribute to thymopoiesis and T cell development. Front Immunol. 2020;10: 3099.
Article PubMed PubMed Central Google Scholar
Surh CD, Ernst B, Sprent J. Growth of epithelial cells in the thymic medulla is under the control of mature T cells. J Exp Med. 1992;176:611–6.
Article CAS PubMed Google Scholar
Finucane HK, Bulik-Sullivan B, Gusev A, Trynka G, Reshef Y, Loh P-R, Anttila V, Xu H, Zang C, Farh K. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat Genet. 2015;47:1228–35.
Article CAS PubMed PubMed Central Google Scholar
Slowikowski K, Hu X, Raychaudhuri S. SNPsea: an algorithm to identify cell types, tissues and pathways affected by risk loci. Bioinformatics. 2014;30:2496–7.
Article CAS PubMed PubMed Central Google Scholar
McLean CY, Bristor D, Hiller M, Clarke SL, Schaar BT, Lowe CB, Wenger AM, Bejerano G. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501.
Article CAS PubMed PubMed Central Google Scholar
Duke-Cohan JS, Akitsu A, Mallis RJ, Messier CM, Lizotte PH, Aster JC, Hwang W, Lang MJ, Reinherz EL. Pre-T cell receptor self-MHC sampling restricts thymocyte dedifferentiation. Nature. 2023;613:565–74.
Article CAS PubMed Google Scholar
Siamishi I, Iwanami N, Clapes T, Trompouki E, O’Meara CP, Boehm T. Lymphocyte-specific function of the DNA polymerase epsilon subunit Pole3 revealed by neomorphic alleles. Cell Rep. 2020;31:31.
Article Google Scholar
den Haan JM, Arens R, van Zelm MC. The activation of the adaptive immune system: cross-talk between antigen-presenting cells, T cells and B cells. Immunol Lett. 2014;162:103–12.
Article Google Scholar
Kinner-Bibeau LB, Sedlacek AL, Messmer MN, Watkins SC, Binder RJ. HSPs drive dichotomous T-cell immune responses via DNA methylome remodelling in antigen presenting cells. Nat Commun. 2017;8: 15648.
Article CAS PubMed PubMed Central Google Scholar
St-Pierre C, Morgand E, Benhammadi M, Rouette A, Hardy M-P, Gaboury L, Perreault C. Immunoproteasomes control the homeostasis of medullary thymic epithelial cells by alleviating proteotoxic stress. Cell Rep. 2017;21:2558–70.
Article CAS PubMed Google Scholar
Sun L, Luo H, Li H, Zhao Y. Thymic epithelial cell development and differentiation: cellular and molecular regulation. Protein Cell. 2013;4:342–55.
Article CAS PubMed PubMed Central Google Scholar
Žuklys S, Handel A, Zhanybekova S, Govani F, Keller M, Maio S, Mayer C, Teh H, Hafen K, Gallone G. Foxn1 regulates in postnatal thymic epithelial cells key target genes essential for T cell development. Nat Immunol. 2016;17:1206–15.
Danese A, Richter ML, Chaichoompu K, Fischer DS, Theis FJ, Colomé-Tatché M. EpiScanpy: integrated single-cell epigenomic analysis. Nat Commun. 2021;12:5228.
Article CAS PubMed PubMed Central Google Scholar
Cao Y, Zhao X, Tang S, Jiang Q, Li S, Li S, Chen S. scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders. Nat Commun. 2024;15:2973.
Article CAS PubMed PubMed Central Google Scholar
Chen X, Li K, Wu X, Li Z, Jiang Q, Cui X, Gao Z, Wu Y, Jiang R. Descart: a method for detecting spatial chromatin accessibility patterns with inter-cellular correlations. Genome Biol. 2024;25:322.
Article CAS PubMed PubMed Central Google Scholar
Li S, Li Y, Sun Y, Li Y, Chen X, Tang S, Chen S. EpiCarousel: memory-and time-efficient identification of metacells for atlas-level single-cell chromatin accessibility data. Bioinformatics. 2024;40:btae191.
Article CAS PubMed PubMed Central Google Scholar
Morabito S, Miyoshi E, Michael N, Shahin S, Martini AC, Head E, Silva J, Leavy K, Perez-Rosendahl M, Swarup V. Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s disease. Nat Genet. 2021;53:1143–55.
Article CAS PubMed PubMed Central Google Scholar
Jia Y, Li S, Jiang R, Chen S. Accurate annotation for differentiating and imbalanced cell types in single-cell chromatin accessibility data. IEEE/ACM Trans Comput Biol Bioinform. 2024;21:461-71.
Li S, Tang S, Wang Y, Li S, Jia Y, Chen S. Accurate cell type annotation for single-cell chromatin accessibility data via contrastive learning and reference guidance. Quantitative Biology. 2024;12:85–99.
Article Google Scholar
Cui X, Chen X, Li Z, Gao Z, Chen S, Jiang R. Discrete latent embedding of single-cell chromatin accessibility sequencing data for uncovering cell heterogeneity. Nat Comput Sci. 2024;4:1–14.
Tang S, Cui X, Wang R, Li S, Li S, Huang X, Chen S. scCASE: accurate and interpretable enhancement for single-cell chromatin accessibility sequencing data. Nat Commun. 2024;15:1629.
Article CAS PubMed PubMed Central Google Scholar
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. International conference on machine learning. 2020:1597–1607.
Kalantidis Y, Sariyildiz MB, Pion N, Weinzaepfel P, Larlus D. Hard negative mixing for contrastive learning. Adv Neural Inf Process Syst. 2020;33:21798–809.
Google Scholar
Song Q, Su J, Zhang W. scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics. Nat Commun. 2021;12:3826.
Article CAS PubMed PubMed Central Google Scholar
Yang J, Wang W, Zhang X. scSemiGCN: boosting cell-type annotation from noise-resistant graph neural networks with extremely limited supervision. Bioinformatics. 2024;40:btae091.
Article CAS PubMed PubMed Central Google Scholar
Yuan Z, Li Y, Shi M, Yang F, Gao J, Yao J, Zhang MQ. SOTIP is a versatile method for microenvironment modeling with spatial omics data. Nat Commun. 2022;13:7330.
Article CAS PubMed PubMed Central Google Scholar
Bravo González-Blas C, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, Davie K, Wouters J, Aerts S. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nat Methods. 2019;16:397–400.
Article PubMed Google Scholar
Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F. The UCSC genome browser database: update 2006. Nucleic Acids Res. 2006;34:D590–8.
Article CAS PubMed Google Scholar
Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat Methods. 2021;18:1333–41.
Article CAS PubMed PubMed Central Google Scholar
Bravo González-Blas C, Minnoye L, Papasokrati D, Aibar S, Hulselmans G, Christiaens V, Davie K, Wouters J, Aerts S. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Datasets: Gene Expression Omnibus; 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114557.
Google Scholar
Domcke S, Hill AJ, Daza RM, Cao J, O’Day DR, Pliner HA, Aldinger KA, Pokholok D, Zhang F, Milbank JH. A human cell atlas of fetal chromatin accessibility. Datasets: Gene Expression Omnibus; 2020. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE149683.
Book Google Scholar
Cusanovich DA, Hill AJ, Aghamirzaie D, Daza RM, Pliner HA, Berletch JB, Filippova GN, Huang X, Christiansen L, DeWitt WS. A single-cell atlas of in vivo mammalian chromatin accessibility. Datasets. 2018. https://atlas.gs.washington.edu/mouse-atac/data.
Morabito S, Miyoshi E, Michael N, Shahin S, Martini AC, Head E, Silva J, Leavy K, Perez-Rosendahl M, Swarup V. Single-nucleus chromatin accessibility and transcriptomic characterization of Alzheimer’s disease. Datasets: Gene Expression Omnibus; 2021. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE174367.
Google Scholar
Li S, Huang Y, Chen S. MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data. Github. 2025. https://github.com/BioX-NKU/MINGLE.
Li S, Huang Y, Chen S. MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data. Zenodo. 2025. https://doi.org/10.5281/zenodo.15221734.
Article Google Scholar

Download references

Acknowledgements

Not applicable.

Peer review information

Claudia Feng was the primary editor of this article and managed its editorial process and peer review in collaboration with the rest of the editorial team. The peer-review history is available in the online version of this article.

Funding

This work was supported by the National Natural Science Foundation of China [62203236, 62473212], and the Young Elite Scientists Sponsorship Program by CAST [2023QNRC001].

Author information

Siyu Li and Yifan Huang contributed equally.

Authors and Affiliations

School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
Siyu Li, Yifan Huang & Shengquan Chen

Authors

Siyu Li
View author publications
Search author on:PubMed Google Scholar
Yifan Huang
View author publications
Search author on:PubMed Google Scholar
Shengquan Chen
View author publications
Search author on:PubMed Google Scholar

Contributions

S.C. conceived and supervised the project. S.L., Y.H. and S.C. designed, implemented, and validated MINGLE and wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Shengquan Chen.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Contains supplementary texts, supplementary figures and supplementary Table 1.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Li, S., Huang, Y. & Chen, S. MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data. Genome Biol 26, 162 (2025). https://doi.org/10.1186/s13059-025-03603-9

Download citation

Received: 17 January 2025
Accepted: 02 May 2025
Published: 11 June 2025
DOI: https://doi.org/10.1186/s13059-025-03603-9

MINGLE: a mutual information-based interpretable framework for automatic cell type annotation in single-cell chromatin accessibility data

Abstract

Background

Results

Overview of MINGLE

MINGLE achieves accurate cell type annotation for scCAS data

MINGLE effectively identifies novel cell types in scCAS data

MINGLE enables interpretable cell type annotation

MINGLE is superior in cross-batch, cross-tissue, and cross-species annotation

MINGLE demonstrates robustness to imbalance degrees and data sizes

The integration strategy in MINGLE is effective

Discussion

Conclusions

Methods

Data preprocessing

The model of MINGLE

The masking-based class balancing strategy of MINGLE

Low-dimensional representations learning with MINGLE

Construction of the sample pairs

Supervised training based on contrastive learning

GCN model training with MINGLE

Construction of the graph

Semi-supervised training with GCN

The integration approach of MINGLE

The convex hull-based strategy to identify novel cell types

The mutual information-based interpretation strategy of MINGLE

Data collection

Model evaluation

Baseline methods

SANGO

EpiAnno

SVM

RF

KNN

SVMrejection

Implementation details of downstream analyses

LDSC

SNPsea

GREAT

Data availability

References

Acknowledgements

Peer review information

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Supplementary Information

Additional file 1: Contains supplementary texts, supplementary figures and supplementary Table 1.

Rights and permissions

About this article

Cite this article

Share this article

Genome Biology

Contact us