Introduction

Gene regulation orchestrates the underlying biological processes during normal physiology and disease conditions. Gene regulatory factors such as transcription factors (TF) and microRNAs (miRNA) play a significant role in gene regulation at the transcriptional and post-transcriptional levels, respectively. TFs are proteins that bind to a specific position in the DNA sequence to inhibit or induce transcription. miRNAs are a type of short noncoding RNA that bind to RNAs to inhibit their translation. TFs and miRNAs can regulate the same gene and interfere with each other’s activity as well1. Dysregulation of gene regulation can lead to diseases such as cancer through disruption in normal physiological processes. Since a biological process is carried over by a group of genes and a gene can be co-regulated by TFs and miRNAs, it is an important problem to unveil the complex relationship among TFs, miRNAs, genes, and biological processes.

Inference of gene regulatory networks of TFs and miRNAs have been studied extensively by employing different approaches and objectives2,3,4,5,6,7,8,9,10. Chen et al. generated a cross-adjacency matrix for coregulation pairs (specifically TF-TF, TF-miRNA, and miRNA-miRNA pairs) based on the common target genes of the regulators utilizing Gene Ontology (GO) distribution4. They showed that combinatorial regulation of TFs and miRNAs could activate/deactivate some biological processes and breaking that co-regulation could be associated with cancer. Baur et al. proposed a new algorithm using copy number aberration and DNA methylation data to infer gene regulatory networks based on dynamic Bayesian network analysis. They achieved improvement using those data instead of gene expression data9. Bose et al. proposed a tool to infer copy number derived miRNA – gene interactions in cancer and found some novel miRNA-gene interactions. They were also able to find some miRNAs that are related to tumor survival and progression11,12. Moreover, some studies focused on active subnetwork inference or module detection6,13,14,15. FGMD is a tool that uses statistical methods to find modules in cancer15. Some studies focus on other disease-specific regulatory networks such as brain cancer and Alzheimer’s disease7,8. However, identifying disease-specific subnetworks and modules is still challenging. Jiang et al. identified active TF and miRNA regulatory pathways in Alzheimer’s disease8based on expression profiles of mRNA and miRNA from multiple datasets. Based on genes that are differentially expressed (DE) consistently in multiple datasets, differentially expressed miRNAs, and their first-hop neighbors as main factors, they generated active subnetworks and pathways. Pan et al. identified TF-miRNA coregulation modules applying a biclique modularity approach to a cancer-specific regulator-target network6. Starting with a limited number of genes with mutation evidence in previous studies, Sun et al. generated the first miRNA-TF regulatory network for Glioblastoma multiforme (GBM), which consists of known human TFs, GBM-related miRNAs, and their target GBM-related genes7. Bozdag et al. identified GBM subtype-specific-master regulators and gene regulatory networks based on pathway enrichment analysis and gene regulatory network inference10. Even though these studies consider the combinatorial effect of TF and miRNA regulation, there is still a gap to improve existing studies. First, assigning a gene to only one module is a limitation for most studies since genes could have multiple functionalities. Also, existing studies mostly focus on known interactions, however, the curated regulation databases are far away from being complete, therefore we can add some flexibility to include target genes in a module even though they are not the immediate target of all regulators in that module. Moreover, disease-specific modules mostly consider the common target genes between TFs and miRNAs, however, we can consider the similarities of target genes in biological functions as an additional requirement to have more functionally-relevant modules.

To decipher disease-specific interplay between regulators (TFs and miRNAs) and their targets with similar biological functions, we previously published a computational tool, CanMod, which builds disease-specific gene regulatory modules that consist of TF and miRNA regulators and their co-regulated targets with similar biological functions16. CanMod is different from other studies by inferring modules that consist of a group of regulators that controls a group of genes with similar biological functions. CanMod also considered the effect of DNA methylation and copy number aberration while obtaining potential regulators of genes.

In this study, we introduce CanMod2, which follows a similar pipeline as CanMod with several algorithmic enhancements to ensure more robust results. CanMod2 starts with clustering DE genes based on GO-based similarity. Integrating transcriptional and post-transcriptional regulation could provide a comprehensive regulation for complex diseases like cancer. Thus, for each gene, CanMod2 applies LASSO17 to compute regulators (i.e., TFs and miRNAs) that associate with the gene’s expression. During LASSO, CanMod2 considers DNA methylation and copy number aberration as additional independent variables. Then CanMod2 clusters the regulators based on their shared target similarity and merges regulator and gene clusters to generate the candidate modules. After obtaining those modules, CanMod2 goes through an iterative process of refining and merging the modules until a stopping criterion is met. CanMod2 allows a gene to be present in multiple modules considering that genes could have multiple biological functions. However, it also minimizes the risk of generating repetitive modules with high overlaps with other modules.

There are several algorithmic improvements in CanMod2 compared to CanMod. Specifically,

  • To ensure consistent results in the clustering steps, we framed the clustering problem as community detection problem in a biological network. We built gene/regulator similarity networks and employed Walktrap algorithm18 to detect Gene Clusters (GCs), Regulator Clusters (RCs), and to merge modules with overlapping regulators and target genes. In CanMod, we utilized Affinity Propagation (AP) clustering algorithm to get GCs and agglomerative hierarchical clustering algorithm to generate RCs and merge modules.

  • CanMod utilizes Sparse Canonical Correlation (SCCA) to select regulators and targets that are co-expressed in each candidate module. We observed that SCCA analysis is highly stochastic with different results for each run. Thus, in CanMod2, when refining the modules, we replaced the SCCA step with a biclustering technique19 to make the results more consistent.

  • We made the module filtration and merging (step 5 and 6) steps run iteratively until reaching to a certain criterion. In CanMod, these two steps are applied only once.

In addition, we applied CanMod2 on five different cancer datasets, namely Breast invasive carcinoma (BRCA), Kidney renal clear cell carcinoma (KIRC), Lung squamous cell carcinoma (LUSC), Prostate adenocarcinoma (PRAD) and Uterine Corpus Endometrial Carcinoma (UCEC), and performed extensive evaluations to show its robustness. The modules identified by CanMod2 had more uniqueness and complementarity in the enrichment of GO terms as compared to other tools in most cases for all five cancer types. The percentages of significantly enriched modules with GO terms, KEGG20,21,22 pathway and CH terms for CanMod2 were highest compared to CanMod and FGMD in all five cancer types. Additionally, CanMod2 demonstrated higher consistency and less overlap among modules as compared to CanMod and FGMD. We showed that regulators inferred in the CanMod2 modules were highly associated with cancer. Inferred modules were highly enriched in experimentally-validated interactions. The source code for CanMod2 and documentation are available at https://github.com/bozdaglab/CanMod2 under Creative Commons Attribution Non-Commercial 4.0 International Public License.

Materials and methods

CanMod2 is a computational tool to identify gene regulatory modules, which consist of a group of regulators and their target genes that are involved in similar biological processes. To identify gene regulatory modules, CanMod2 first clusters DE genes based on their functional similarity. Then for each gene, CanMod2 computes its TF and miRNA regulators and clusters the regulators based on their shared target genes. Candidate gene regulatory modules are built by combining gene and regulator clusters, and they are further refined in an iterative process. In what follows, each step of CanMod2 is explained in detail. The CanMod2 pipeline is shown in Fig. 1.

Fig. 1
figure 1

CanMod2 Pipeline. Step 1: Identify gene clusters (GCs) by clustering genes based on GO annotation similarity. Step 2: Select regulators based on expression change for each gene. Step 3: Identify regulatory clusters (RCs) by clustering regulators based on their target similarity. Step 4: Generate modules combing GCs and RCs. Step 5: Refine modules using a biclustering technique. Step 6: Merge modules whose regulators have similar targets. Steps 5 and 6 are executed iteratively until there is a negligible change between iterations. [CNA: Copy Number Aberration, DM: DNA Methylation, TF: transcription factor].

Step 1: group genes with similar biological functions

To find the disease-specific modules, CanMod2 utilizes DE genes between normal and disease samples. To find the group of genes involved in similar biological processes, we clustered them based on their Gene Ontology Biological Process (GO-BP) terms. We computed a GO-BP-based similarity matrix between all pairs of DE genes by employing the R package GoSemSim23. Based on this similarity matrix, we built a gene network where each vertex was a gene, and two genes were connected if they had ≥\(\:s\%\) similarity. In our work, we set \(\:s=45\)based on some empirical tests (Supplementary Methods). We applied a community detection algorithm named Walktrap to this network to cluster the genes with similar functionalities18.

Walktrap is a random walk-based algorithm to compute structures in a graph. Let us consider a graph \(\:G=\:(V,\:E)\) where \(\:V\) is the set of vertices and \(\:E\) is the set of edges. A community structure in graph \(\:G=\:(V,\:E)\) is a partition \(\:\{{C}_{1},\dots\:,{C}_{k}\}\) of the vertices \(\:V\) \(\:(\forall\:i\:{C}_{i}\subseteq\:V)\) where the proportion of internal edges in \(\:{C}_{i}\) is higher compared to the proportion of edges between other partitions. At each time step, a walker is on one of the vertices of \(\:V\)and randomly and uniformly chooses its next step among its neighbors. The sequence of visited vertices forms a Markov chain where the states of the Markov chain are the vertices of the graph16. Let us consider an adjacency matrix, \(\:A\in\:\:{\mathbb{R}}^{n\times\:n}\) of \(\:G\) where \(\:n\:\)is the number of nodes, \(\:{A}_{uv}=1\:\)if vertices \(\:u\) and \(\:v\:\)are connected, \(\:{A}_{uv}=0\) otherwise and the degree \(\:d\left(u\right)\:\)of vertex \(\:u\) is the number of its neighbors (including itself). The transition probability from vertex \(\:u\) to \(\:v\:\)can be defined as \(\:{P}_{uv}=\frac{{A}_{uv}}{d\left(u\right)}\). The transition matrix \(\:P\) of the random walk process can also be written as \(\:P\:=\:{D}^{-1}A\), where \(\:D\) is the diagonal matrix of the degrees \(\:(\:{D}_{uu}=\:d(u)\:\)and\(\:\:{D}_{uv}=\:0\:\)if\(\:\:u\:\ne\:\:v)\). The probability of going from \(\:u\) to\(\:\:v\) through a random walk of length \(\:t\) is \(\:{P}_{uv}^{t}\). The information about vertex \(\:u\) is encoded in \(\:{P}^{t}\) resides in the \(\:n\) probabilities \(\:{\left({P}_{uk}^{t}\right)}_{1\le\:k\le\:n}\), that is \(\:{u}^{th}\) row of the matrix Pt, denoted by \(\:{P}_{u}^{t}\). We can define the distance between two vertices as

$$\:{r}_{uv}=\sqrt{\sum\:_{k=1}^{n}\frac{{(P}_{uk}^{t}-{{P}_{vk}^{t})}^{2}\:}{d\left(k\right)}}={\left|\right|\:D}^{-\frac{1}{\:2\:}}{P}_{u}^{t}.\:-{D}^{-\frac{1}{\:2\:}}{P}_{v}^{t}.\left|\right|$$
(1)

where \(\left|\right|\cdot\left|\right|\) is the Euclidean norm of \(\:{R}^{n}\).

The distance of vertices (Eq. 1) can be generalized to the distance of communities. Let us consider that a random walker starts its walk from a node in the community, \(\:C\). The starting vertex is chosen randomly and uniformly from the vertices in \(\:C\). The probability of going from community \(\:C\) to vertex \(\:v\) in \(\:t\) steps:

$$\:{P}_{Cv}^{t}=\frac{1}{\left|C\right|}\sum\:_{u\epsilon\:C}{P}_{uv}^{t}$$
(2)

Let \(\:{C}_{1}\) and \(\:{C}_{2}\) be two communities where \(\:{C}_{1},\:{C}_{2}\subset\:V\). We define the distance between these two communities as:

$$\:{r}_{{C}_{1}{C}_{2}}=\sqrt{\sum\:_{k=1}^{n}\frac{{(P}_{{C}_{1}k}^{t}-{{P}_{{C}_{2}k}^{t})}^{2}\:}{d\left(k\right)}}={\left|\right|\:D}^{-\frac{1}{\:2\:}}{P}_{{C}_{1}}^{t}.\:-{D}^{-\frac{1}{\:2\:}}{P}_{{C}_{2}}^{t}.\left|\right|$$
(3)

Using the distance in Eq. 3, vertices are grouped into communities based on their attributes. In our case, each vertex was a gene, and each edge was an attribute that indicated the similarity between genes. Since clustering is a technique to group similar objects based on their attributes, finding communities in this gene network is equivalent to gene clustering. We used the Walktrap implementation in the R package igraph for detecting and analyzing the communities24. At the end of this step, we had several gene clusters (GCs) with similar biological functions.

Step 2: select regulators associated with the expression change of each target

In this step, we aim to compute the regulators that are associated with the expression change of genes. Hence, we employed a LASSO-based regression method where potential regulators of a gene (i.e., TF, miRNA, CNA, and DNA Methylation) were used as independent variables and the expression of the target gene was used as the dependent variable. CanMod2 employed the R package HDCI to perform LASSO25. LASSO was applied for each gene 100 times. For each gene, we considered the variables as the regulator of that gene if they were selected more than 75% of the time with non-zero coefficients. As there is no statistical measurement associated with the LASSO selected variables, we employed the bootstrap procedure introduced in our earlier work26 to build confidence intervals of the selected variables’ coefficients. We kept only TFs and miRNAs as the regulator of the genes for further steps. Based on the outcome of this step, we cluster the regulators in Step 3 and assign them to modules in Step 4.

Step 3: group regulators with similar targets

In this step, we aim to find groups of regulators called regulator clusters (RCs) that co-regulate a group of target genes. These RCs are building blocks of the modules as explained in later steps. To compute RCs, we perform community detection on a regulatory network. For this, we constructed a regulatory-target association matrix \(\:Z\in\:\:{\mathbb{R}}^{rxt}\) where \(\:r\) is the total number of regulators (i.e., TFs and miRNAs) and \(\:t\) is the total number of target genes. Based on Step 2 results, the value of each cell \(\:{Z}_{ij}=\:1\) if regulator \(\:i\) is the regulator of gene \(\:j\) and \(\:0\) otherwise. Then we built a graph of regulators where each node is a regulator and weighted edges represent the Jaccard similarity between the regulators based on matrix \(\:Z\). To build communities in this network, we first eliminated the edges whose weights were \(\:<\:0.1\). The justification for choosing this value is described in Supplementary Methods. Then we applied the Walktrap algorithm on that regulator network, similar to the gene clustering in the first step. Each community from the Walktrap algorithm was kept as an RC.

Step 4: obtain candidate modules for each RC by recruiting co-expressed target genes in the same GCs

To obtain the candidate modules, GCs and RCs were merged in this step. For each RC, we computed the union of the target genes of each regulator in that RC based on Step 2. These target genes were named seed genes. Stratifying the seed genes based on their GC membership, we merged all the regulators in the RC and the group of seed genes for each GC as a candidate module. In other words, for each RC, we generated multiple candidate modules, one per GC that had at least one seed gene.

In the candidate modules, initially only regulators and their target genes computed in step 2 are present. However, there might be other genes that might belong to candidate modules even though they failed the LASSO step. To add more potential target genes to the candidate modules that were co-expressed and had similar biological functions as the seed genes in that module, we considered the remaining genes in the corresponding GCs. For each module, additional target genes called partner genes were added if they met two conditions: the partner genes need to be present in the same GC as the seed genes in that candidate module (to ensure that these genes are involved in the same biological process as the other target genes in the module) and the absolute expression correlation between the partner and any seed gene or regulator must be in the top 99% of the absolute correlation distribution (to ensure that there is a high co-expression pattern). For the correlation between the partner gene and seed gene, we used the correlation threshold from all gene pairs, while between the partner gene and regulator we used the correlation threshold from the pairs of all genes and regulators. After adding the partner genes, we had candidate modules that consist of regulators and co-expressed targets with similar biological functions.

Step 5: select the regulators and the targets that are co-expressed in each candidate module

The candidate modules computed in Step 4 could have spurious regulators or targets. In this step, we eliminated the regulators and the targets that do not contribute to the co-expression. To do this, we first computed an expression correlation matrix, \(\:E\in\:\:{\mathbb{R}}^{txr}\)where \(\:t\:\)is the number of target genes, \(\:r\:\)is the number of regulators and \(\:{\text{E}}_{\text{p}\text{q}}\) is Pearson’s correlation between the gene \(\:p\) and the regulator \(\:q\). Then, utilizing a biclustering technique19, we selected the regulators and the target genes that were important for co-expression. Here, biclustering means finding one or multiple submatrices of \(\:E\) by adding or deleting genes or regulators with a high mean squared residue score. Given \(\:{\text{E}}_{\text{P}\text{Q}},\:\)a submatrix of \(\:E\) with the genes \(\:P\) and the regulators \(\:Q\), the mean squared residue score of \(\:{\text{E}}_{\text{P}\text{Q}}\), that is S(\(\:{\text{E}}_{\text{P}\text{Q}}\)), is computed as:

$$\:\text{S}\left({\text{E}}_{\text{P}\text{Q}}\right)=\frac{1}{\left|P\right|\left|Q\right|}\sum\:_{p\in\:P,q\in\:Q}{\left({\text{E}}_{\text{p}\text{q}}-{\text{E}}_{\text{p}\text{Q}}-{\text{E}}_{\text{P}\text{q}}\:+\:{\text{E}}_{\text{P}\text{Q}}\right)}^{2}$$
(4)

where

$$\:{\text{E}}_{\text{p}\text{Q}}=\frac{1}{\left|Q\right|}\sum\:_{q\in\:Q}{E}_{pq},$$
(5)
$$\:{\text{E}}_{\text{P}\text{q}}=\frac{1}{\left|P\right|}\sum\:_{p\in\:P}{E}_{pq},$$
(6)

and

$$\:{\text{E}}_{\text{P}\text{Q}}=\frac{1}{\left|P\right|\left|Q\right|}\sum\:_{p\in\:P,q\in\:Q}{E}_{pq}=\frac{1}{\left|P\right|}\sum\:_{p\in\:P}{E}_{pQ}=\:\frac{1}{\left|Q\right|}\sum\:_{q\in\:Q}{E}_{Pq}.$$
(7)

If we define \(\:\delta\:\) as the maximum acceptable mean squared residue score, a submatrix \(\:{\text{E}}_{\text{P}\text{Q}}\) is called a \(\:\delta\:\) bicluster if \(\:S\left({\text{E}}_{\text{P}\text{Q}}\right)\:\le\:\:\delta\:\) for some \(\:\delta\:\:\ge\:\:0\). After this step, we obtained submatrices, computed their mean correlation distribution, and selected the submatrices that have a mean correlation of at least the first quartile of the mean correlation distribution across all submatrices. We repeated this analysis for refining targets and regulators by separate biclustering runs. Then we refined the modules by removing the module members that were not in any selected submatrices.

Step 6: merge modules with overlapping regulators and target genes

There could be some modules with a high number of overlapping targets and regulators. In this step, we merged such modules to reduce redundancy in modules. To merge modules, we built a graph, where each node represents a module and the weighted edges represent the overlap score of modules. The overlap score of two modules \(\:a\) and \(\:b\) is defined as \(\:\frac{\left|M\left(a\right)\:\cap\:\:M\left(b\right)\right|}{\left|M\left(a\right)\:\cup\:\:M\left(b\right)\right|}\) where \(\:M(\cdot)\) is a function that returns the set of all regulators and target genes in the module. After filtering out the edges whose weight is < 0.8, we applied the Walktrap algorithm to compute communities of modules. The modules in each community were merged into a single module.

Steps 5 and 6 were executed iteratively until the drop rate of the total number of regulators and target genes across all modules is less than or equal to \(\:x\%\), where \(\:x\) is a hyperparameter. In this study, we set \(\:x\) to 5.

Dataset description and preprocessing

We downloaded gene expression and miRNA expression data including normal and tumor samples of five cancer types, namely BRCA, KIRC, LUSC, PRAD, and UCEC from TCGA (The Cancer Genome Atlas) database using the R package TCGAbiolinks27. The RNA-seq gene expression data was downloaded as read count. The genes whose read count was ≤ 10 for all the samples were filtered out. We computed DE genes between normal and tumor samples using DESeq2(adjusted p-value ≤ 0.05 and absolute log 2 of fold change ≥ 1)28. We also downloaded miRNA expression data as read count from TCGA. We filtered the miRNAs whose read count is 0 for all the samples. We performed the DE analysis for miRNAs between normal and tumor samples using DESeq2 (adjusted p-value ≤ 0.05). We downloaded FPKM values for the DE genes and RPM values for the DE miRNAs to use for further analysis. The number of DE genes, DE miRNAs, and samples are given in Table 1.

Table 1 Number of differentially expressed genes, differentially expressed miRNAs, TFs, and samples in five cancer types. [TF: transcription factor].

We retrieved CNA data by Affymetrix SNP Array 6.0 platform from TCGA. We used maskedCNA where Y chromosome and probesets associated with frequent germline variation were filtered out. To be able to utilize the CNA signal as a regulatory factor for its own gene in Step 2 of CanMod2, we computed the gene-centric copy number value compatible with hg38 using the R package CNTools29.

We downloaded DNA Methylation by Infinium HumanMethylation27 Bead-Chip (27 K) and Infinium HumanMethylation450 Bead-Chip (450 K) platforms from TCGA. As for CNA, we computed gene-centric DNA methylation values to use them as a potential regulatory factor in CanMod2. Gene-specific DNA methylation calculation was performed for each platform separately. To compute gene-centric DNA methylation, for the 450 K platform, average beta values for promoter-specific probes were calculated as these probes are known for their role in transcriptional silencing30. For the 27 K platform, since the number of probes per gene was few, we used all the probes to compute the gene-centric DNA methylation. Then we used the common genes for both sample sets for further analysis.

Putative miRNA-gene interactions were obtained from StarBase v2.031and TargetScan 7.232databases and putative TF-target interactions were obtained from TRED33and TRRUST (version 2)34 databases.

Results

We ran CanMod2 on five cancer types from TCGA: BRCA, KIRC, LUSC, PRAD, and UCEC. CanMod2-inferred modules had regulators that were enriched in cancer-related genes and miRNAs, had experimentally validated regulator-target interactions, and had co-expression target genes enriched in GO terms, KEGG pathways, and CH terms.

We reported the descriptive statistics of CanMod2 output after each step in Table 2. In the first step, we eliminated genes that did not appear in any GC and never considered them in successive steps. In the second step, approximately 54 − 60% of the remaining genes had no TF or miRNA regulators. Thus, these genes did not initiate new modules but were considered to be added to the existing modules. We observed that most of the genes had few regulators and most of the regulators had few target genes based on LASSO (Fig. 2). In Step 3, we clustered the regulators that had at least one target in Step 2. A regulator that had > 10% common targets with at least one regulator was considered for the clustering. Approximately, 23 − 31% of the regulators did not satisfy this criterion and were discarded in this step. In Step 4, candidate modules were computed by combining GCs and RCs. Partner genes (i.e., the genes that were not selected in the LASSO step) that satisfied the inclusion criteria were added to the candidate modules. We had 200 to 485 modules generated in this step. The total number of target genes (i.e., seed and partner genes) varied between 1,013 and 3,305, whereas the number of regulators varied between 149 and 367 across the five cancer types. Steps 5 and 6 were executed iteratively to refine and merge the modules, respectively. For all cancer datasets, the number of genes, regulators, and modules remained the same or dropped slightly after the second round of these iterative steps, thus we stopped after the first iteration. The distribution of the number of miRNAs, TFs, and targets in each module and the module occurrence of miRNA, TF, and the target are shown in Fig. 3. Our results show that most modules had a similar number of miRNA and TFs, except for some modules, which had a high number of TFs (Fig. 3A). On the other hand, as expected the number of target genes was more than the regulators (Fig. 3A). In addition, our tool’s assumption on multiple functionality of regulators and genes was supported by our results showing most of the miRNAs, TFs, and genes occurred in multiple modules (Fig. 3B).

Table 2 Number of genes, miRNAs, and samples in each step of CanMod2 for five cancer types. At start column indicates the initial number of genes and regulators that CanMod2 starts with. The number of regulators after step 1 is the number of all regulators CanMod2 starts with. The number of genes after step 2 includes the genes that had at least one TF or miRNA regulator. For the steps before module generation in step 4, the number of modules refers to the number of gene clusters (denoted by *) or regulator clusters (denoted by **).
Fig. 2
figure 2

Density plot for (A) Number of genes per regulator (B) Number of regulators per gene after applying LASSO in Step 2 of CanMod2.

Fig. 3
figure 3

Boxplot for (A) the number of miRNAs, TFs, and targets in each module (B) the number of modules a miRNA, a TF, or a target occurs in. [TF: transcription factor].

Hub regulators were associated with cancer

To test if the TF and miRNA regulators that occurred in many modules were associated with cancer, we performed enrichment analysis of TFs and miRNAs in known cancer-related genes and miRNAs. We sorted TFs and miRNAs separately based on their module degree (i.e., the number of modules they were present). We took the top 20% of TFs and miRNAs and named them hub TFs and hub miRNAs, respectively. The rest of TFs and miRNAs were named non-hub TFs and non-hub miRNAs, respectively. We performed a hypergeometric test for enrichment analysis of both hub and non-hub TFs and miRNAs in known cancer-related genes and miRNAs. The cancer-related genes were obtained from Cancer Gene Census in COSMIC v91, Bushman’s cancer gene list v3 and human oncogenes were collected from ONGene and Network of Cancer Genes 6.0. The cancer-related miRNAs were obtained from oncomiRDB35. For comparison, we also checked the enrichment of non-inferred TFs and miRNAs (i.e., TFs and miRNAs that were discarded in one of the steps of CanMod2, and thus do not occur in any module).

Our enrichment results showed that hub TFs were significantly enriched in cancer-related genes in BRCA, LUSC, and PRAD datasets, whereas non-hub TFs and not-inferred TFs were not significantly enriched (Table 3). Hub miRNAs were significantly enriched for BRCA, LUSC, and UCEC datasets, and non-hub miRNAs were significantly enriched in BRCA, LUSC, and KIRC datasets. On the other hand, non-inferred miRNAs were not significantly enriched for any cancer datasets. These results suggest that regulators that were excluded in any step of CanMod2 are mostly not cancer-related ones, while the inferred regulators were highly associated with cancer.

Table 3 Enrichment results of the hub, non-hub, and not-inferred TFs and miRNAs in cancer-related genes and miRNAs, respectively. P-values were calculated based on the hypergeometric test. Significant p-values are shown in bold. [TF: transcription factor].

CanMod2-computed TF-gene interactions had a higher proportion of high confidence interactions

To check whether TF-target interactions computed by CanMod2 were among the experimentally validated interactions, we utilized experimentally validated TF-target from Dorothea36. In Dorothea, there are five categories ranging from level A (the highest confidence) to level E (the lowest confidence). To compute the percentage of high-confidence interaction in a list of interactions \(\:L\), we define TF confidence as:

$$\:TF\_confidence\left(L\right)=\frac{\left|{L}_{A}\right|+\left|{L}_{B}\right|}{\left|{L}_{E}\right|}$$

where \(\:{|L}_{A}|\), \(\:\left|{L}_{B}\right|\), and \(\:\left|{L}_{E}\right|\) denotes the number of interactions in \(\:L\) having levels A, B, and E in Dorothea, respectively. To assess whether CanMod2-computed interactions had a higher proportion of high-confidence interactions, for each CanMod2 step, we computed the TF confidence ratio as:

$$\:TF\_confidence\_ratio\:=\frac{TF\_confidence\left({L}_{CanMod2}\right)}{TF\_confidence\left({L}_{Baseline}\right)}$$

where \(\:{L}_{CanMod2}\) and \(\:{L}_{Baseline}\) represents the set of CanMod2-computed interactions in a step and the set of all possible interactions between the initial TFs and DE genes used in CanMod2. We observed that CanMod2’s TF confidence ratio (i.e. high-confidence interaction ratio between CanMod2 and baseline) was more than 1 for all the cancer types except for PRAD, indicating that CanMod2 tends to select among high-confidence interactions from the initial list of putative interactions (Table 4). For PRAD, we had a small number of TF-target interactions to start with, which made the TF confidence ratio value very sensitive to small changes in interactions. In Fig. 4, we show that for all cancer types except for PRAD, the TF Confidence Ratio has continuously increased during CanMod2 steps, which indicate the selection processes in each step of CanMod2 were likely to include high-confidence TF-target interactions.

Table 4 TF confidence ratio for each cancer type. We computed the proportion of high-confidence interactions at the final CanMod2 modules and the baseline. The TF confidence ratio \(\:>1\) indicates that CanMod2 tends to select among high-confidence interactions from the baseline. TF confidence ratio \(\:>1\) cases are shown in bold. [TF: transcription factor].
Fig. 4
figure 4

TF Confidence Ratio change at the beginning, after Step 2, after Step 4, and at the final. To compute the TF Confidence Ratio of CanMod2 at the beginning, we used TRED and TRRUST interactions between TF and DE genes. To compute TF Confidence Ratio after Step 2, we used TF-gene interactions computed by LASSO. After Step 4, all possible interactions between TFs and target genes in each module was considered. [TF: transcription factor].

We also performed a hypergeometric test between the inferred TF-target interactions and the high confidence interactions (Dorothea level A) and observed that inferred interactions had significant enrichment for all cancer types (p-values < 8.8e-15).

CanMod2-computed miRNA-target gene interactions were enriched in experimentally-validated interactions

To evaluate miRNA-target interactions computed by CanMod2, we first evaluated each miRNA separately by checking if their computed targets were enriched in experimentally-validated targets of that miRNA in miRTarBase database37. Since the CanMod2-computed targets could also be indirect targets of miRNA (i.e., the miRNA regulates a TF, which regulates the target), we also checked the enrichment in an extended target set, which included the direct and indirect targets as in miRDriver11. Briefly, if a miRNA has a TF as a target then we included the experimentally-validated targets of this TF as indirect targets of the miRNA. The enrichment test with direct targets and extended (i.e., direct and indirect) targets showed that for most cancer types at least 24% of miRNAs had significant enrichment in the experimentally-validated target set (Table 5). We observed that CanMod2 discovered several indirect targets as the percentage of significantly enriched miRNAs increased for the extended target test for all cancer types except for PRAD. We observed the largest increase, which was 10% increase, in KIRC. We observed the lowest percentage for PRAD (15% for direct targets and 10% for extended targets), which was possibly due to the small number of DE genes, regulators, and samples in this dataset (Table 1).

We also performed the enrichment of miRNA-target interactions for both extended miRTarBase and only miRTarBase. For this, we run the hypergeometric test to measure the significance of the overlap between CanMod2-computed interactions and experimentally-validated direct and extended interactions. We observed that for all cancer types, our computed interactions were significantly enriched in experimentally-validated direct and extended interactions (Table 5).

Table 5 Enrichment of miRNA-target interactions. Enrichment was computed separately for each miRNA and collectively in the experimentally-validated targets in MiRTarBase. The significant miRNAs columns show the percentage of miRNAs whose targets computed by CanMod2 were significantly enriched in the experimentally-validated targets for that miRNA in each cancer type (p-value ≤ 0.1). P-value columns show the hypergeometric test p-value between CanMod2-inferred miRNA-target interactions and experimentally-validated interactions in MiRTarBase. Significant p-values are shown in bold. [DT: direct targets, IT: indirect targets].

We also compared our interactions with interactions of random modules. For each CanMod2 module that had at least one experimentally-validated regulator-target interactions (i.e., Dorothea level A and B for TF-gene interactions and miRTarBase for miRNA-gene interactions), we created a random module, which consisted of the same regulators but a different set of target genes from the DE gene list. There were 58, 43, 58, 11, 75 such modules in BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. We made sure that the size of the random module was the same as the real module size. Then, we counted how many of the regulator-target pairs in the random module were among the experimentally-validated interactions. We repeated this process 1000 times and then computed an empirical p-value as the percentage of cases where the number of experimentally-validated interactions in the random module was higher than the number of experimentally-validated interactions in the real module. A module was called significant if its empirical p-value ≤ 0.05. We found that among the modules that had at least one experimentally-validated regulator-target interaction, 57%, 74%, 64%, 91%, and 63% of them were significant for BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. We observed the highest performance in PRAD, which suggests that although the PRAD modules were smaller compared to other cancer type modules, PRAD modules maintained high precision in terms of co-expression. PRAD modules had the highest mean correlation among genes across modules, too (Supplemental Fig. S1).

Expression of the target genes in the CanMod2 modules were significantly correlated

CanMod2 aims to compute modules where target genes in each module are co-expressed. In order to verify this, for each module, we computed the mean Pearson’s correlation between the expression profiles of all target gene pairs. For comparison, we created a random module for each module by randomly picking the same number of genes from DE genes without including any genes that existed in the real module and computed the same metric. For each module, we created random modules 1000 times and computed the percentage of these cases where the CanMod2 modules’ mean gene pair correlation was less than the random modules’ mean gene pair correlation and termed it as an empirical p-value for our calculation. If the empirical p-value is ≤ 0.05 then we considered the module significant. We observed that 95.9%, 94.7%, 96.3%, 90.7%, and 93.9% of modules from BRCA, KIRC, LUSC, PRAD, and UCEC were significant, respectively. We also observed that CanMod2-inferred modules had a higher mean correlation than the random ones for all cancer types (Fig. 5 and Supplemental Fig. S1). These results show that CanMod2 was able to maintain high co-expression in its computed modules.

Fig. 5
figure 5

Distributions of the mean absolute correlation among genes across the modules in BRCA. The red vertical solid line indicates the mean of this distribution, while the purple dashed one indicates the corresponding mean obtained from 1000 random modules.

CanMod2 modules were significantly enriched in GO, KEGG, and CH terms

To verify if CanMod2 modules had target genes in similar biological functionalities, we performed an enrichment analysis of target genes in each CanMod2 module for GO terms, KEGG pathways, and CH terms. We only considered the modules with > 10 genes as required by the clusterProfiler R package. We considered a GO, KEGG pathway, or CH term significant if its adjusted p-value is ≤ 0.05. We calculated the percentage of modules that contain at least one GO term, KEGG pathway, and CH term for CanMod, CanMod2 and FGMD across all five cancer types. The results show that compared to CanMod and FGMD, CanMod2 modules had higher percentage of modules with significantly enriched terms (Table 6).

Table 6 Percentage of modules significantly enriched terms. Percentage of modules with at least one GO term, KEGG pathway, and CH term for CanMod2, CanMod, and FGMD across all five cancer types. NA: not available as CanMod was unable to process the KIRC dataset due to elimination of all interactions.

The distribution of the number of enriched GO terms, KEGG pathways, and CH terms in modules for each cancer type are shown in Fig. 6. The results show that our inferred modules had target genes that were highly involved in GO terms, KEGG pathways, and CH terms, showing their roles in known biological functions. We observed that the number of enriched terms per CanMod2 modules were higher than the number of enriched terms for the FGMD modules except for LUSC (Supplemental Fig. S2). We excluded K-Means clustering and hierarchical clustering (HC) from this analysis because they produced inadequate module characteristics, such as generating very few large clusters while the remaining modules were excessively small.

To evaluate the uniqueness and complementarity of modules in terms of their enriched GO terms, we introduced a term called semantic uniqueness. To compute the semantic uniqueness of a module \(\:M\), we measure its average similarity to top k modules in terms of semantic similarity of their enriched GO terms using the formula:

$$\:U\left(M\right)=1-\frac{1}{k}{\sum\:}_{{M}_{j}{\epsilon}Top\_k}^{}\text{S}\:(\text{M},{\text{M}}_{j})\:$$

where \(\:S(M,{M}_{j})\) denotes semantic similarity of GO terms enriched for modules M and \(\:{M}_{j}\) computed using GoSemSim package (e.g., Wang similarity in mgoSim()), \(\:Top\_k\) denotes the set of most similar k modules to \(\:M\) based on semantic similarity. Intuitively, a module has a high uniqueness score when it is enriched in highly unique set of GO terms, thereby it has low similarity to other modules. Here, we used \(\:k=5\). For this analysis, we only considered the modules that have at least one significantly enriched GO term as if a module does not have any enriched GO term, their uniqueness would not be applicable. The uniqueness scores of CanMod, CanMod2 and FGMD modules for all cancer types are shown in Fig. 7. The figure shows that the median semantic uniqueness scores for CanMod2 modules are greater than CanMod and FGMD modules in every cancer type except for LUSC. This trend is also shown for \(\:k=6\) and \(\:k=7\) (Supplemental Fig. S3).

We also computed uniqueness of modules in terms of enriched KEGG pathways. We used the same uniqueness formula except that the S function was Jaccard similarity (i.e., the ratio between the number of common enriched KEGG pathways to the number of union of enriched KEGG pathways). We refer to this score as Jaccard uniqueness score. Based on \(\:k=5\), we computed Jaccard uniqueness score for CanMod, CanMod2, and FGMD across all five cancer types (Fig. 8). The results show that CanMod2 modules were more unique than FGMD modules in terms of KEGG pathways and slightly less unique than CanMod modules. Similar trends were observed for \(\:k=6\) and \(\:k=7\), as shown in Supplemental Fig. S4. CanMod modules were possibly more unique than CanMod2 modules due the number of enriched KEGG pathways in CanMod modules (Supplemental Fig. S5).

Fig. 6
figure 6

Distribution of the number of enriched GO terms, KEGG pathways, and CH terms in CanMod2 modules for each cancer type (adjusted p-value ≤ 0.05).

Fig. 7
figure 7

Distribution of Semantic Uniqueness Score. Semantic uniqueness score, which measures uniqueness of a module in terms of its enriched GO terms was calculated for modules computed using CanMod2, CanMod, and FDMD across all modules. Since CanMod was unable to run on KIRC dataset, its boxplot is missing.

Fig. 8
figure 8

Distribution of Jaccard Uniqueness Score. Jaccard uniqueness score, which measures uniqueness of a module in terms of its enriched KEGG pathways was calculated for modules computed using CanMod2, CanMod, and FGMD across all modules. Since CanMod was unable to run on KIRC dataset, its boxplot is missing.

CanMod2 exhibited higher consistency and less overlap among modules compared to CanMod and FGMD

To assess the consistency of modules computed by CanMod and CanMod2, we ran both tools multiple times and computed the similarity between multiple runs of each tool. We ran CanMod and identified 427, 635, 130, and 567 modules in BRCA, LUSC, PRAD, and UCEC, respectively. CanMod was unable to process the KIRC dataset due to elimination of all interactions, resulting in no remaining data for further analysis.

To compute similarity between two runs, we computed the intersection proportion of modules in two runs (Supplemental Fig. S6). The intersection proportion of two modules is the ratio between the number of common genes and regulators to the number of genes and regulators in each of these modules. This metric evaluates the overlap between modules obtained from two independent executions of the same tool. In an ideal scenario, a module is expected to have high similarity to one or few modules of the second run and have low similarity to the rest. We observed that CanMod2 modules in the first run had high similarity to only few modules in the second run, suggesting robust and reproducible clustering with consistent identification of similar modules in specific instances (Supplemental Fig. S6 A). On the other hand, CanMod modules had high similarity to a larger number of modules in the second run, indicating that CanMod has more overlapping modules and captures similar elements more broadly (Supplemental Fig. S6B). For instance, in PRAD a module in CanMod had high similarity to almost 50% of the modules in the second run. (Supplemental Fig. S6B). To evaluate if CanMod2 modules were consistent across more than two runs, we ran CanMod2 ten times and performed pairwise comparison for all cancer types (see Supplemental Fig. S7-S11 for BRCA, KIRC, LUSC, PRAD, and UCEC, respectively). The results show that CanMod2 runs were consistent where a module in a run was similar to only few modules in another run.

We also applied the same metric to assess module consistency within a single run (Supplemental Fig. S12). Specifically, we compared the intersection proportion of each module with all other modules from the same run. A higher proportion indicates a greater overlap between modules. Ideally, each module should have low overlap with other modules, highlighting distinct modules with minimal similarity. We observed that CanMod2 demonstrated high consistency, with only a few modules showing high similarity to each other, while the remaining modules had minimal gene set overlap. This highlights the distinct and reproducible nature of CanMod2’s clustering (Supplemental Fig. S12 A). In contrast, CanMod modules showed substantial similarity with many other modules within the same run, suggesting a tendency to capture overlapping or repeated gene sets (Supplemental Fig. S12B). Also, we observed that FGMD shows increased gene sharing, resulting in high module overlap (see Supplemental Fig. S12 C), highlighting FGMD’s repetitive nature. FGMD enriched terms span more modules for GO, KEGG, and CH enrichments across five cancer types, except for GO enrichment in KIRC, where there are fewer modules. In all other cases, FGMD demonstrates a clear difference in module distribution.

Overall, these findings underscore CanMod2’s strength in producing unique and reproducible modules both within a single run and across multiple runs. In contrast, CanMod and FGMD’s broader overlaps may reflect its sensitivity to identifying related modules more broadly.

Discussion

In this study, we proposed a computational gene regulatory module tool, CanMod2 that identifies gene regulatory modules where genes in the same module have similar biological functionalities and are co-regulated. We performed an extensive analysis of five cancer datasets from TCGA. Our results showed that CanMod2 had more unique and complementary modules in the enrichment of GO terms, KEGG pathways, and CH terms as compared to the other methods. CanMod2 showed higher consistency and less overlap among modules as compared to its earlier version, CanMod. Our hub regulators were significantly enriched in cancer-related genes and miRNAs for most cancer types, expression of target genes in modules had a high correlation, and regulatory interactions of CanMod2 were significantly enriched in ground truth interactions.

Compared to other methods, CanMod2 has fewer hyperparameters needed to be tuned. One of the parameters is the drop.thr, which controls the stopping criterion for iterating steps 5 and 6. The default value is 5%, which means that if the module size drops ≤ 5% in a consecutive run of steps 5 and 6, the iterative process is terminated. The other two hyperparameters are GC_thr and RC_thr, which indicates minimum similarity among genes and the regulators during clustering steps, respectively. The default value for GC_thr is 0.45 and RC_thr is 0.10.

At the GO (Biological Process) enrichment analysis of CanMod2-inferred modules, there were 2,191 unique GO terms enriched in BRCA, 2,549 in KIRC, 3,344 in LUSC, 1,368 in PRAD, and 2,746 in UCEC. Among them, 363, 411, 324, 340, and 466 terms were uniquely enriched in only one module in BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. We calculated the frequency of modules for each enriched GO term and checked the top 10 most frequent GO terms for five cancer types, which were terms related to cell cycle, chromosome organization, catabolic process, and metabolic process. For instance, 43 BRCA modules were enriched in ‘GO:0007049’ (cell cycle) term, 61 LUSC and 67 UCEC modules were enriched in ‘GO:0022402’ (cell cycle process) term, and 60 UCEC modules were enriched in ‘GO:0010564’ (regulation of cell cycle process) term. These results support the known importance of cell cycle proteins and their regulation in cancer therapy38,39,40,41. We also found some significantly enriched GO terms that are common among all cancer types such as ‘GO:0006281’ (DNA repair), ‘GO:0006974’ (DNA damage response), ‘GO:000910’ (cytokinesis), ‘GO:0006338’ (chromatin remodeling), ‘G0:0007094’ (spindle assembly checkpoint), ‘GO:0001525’ (angiogenesis), ‘GO:0006260’ (DNA replication), and ‘GO:0072089’ (stem cell proliferation) in our modules. These GO terms are related to one or multiple cancer types and have some implications in cancer and cancer treatments. For example, defects of ‘GO:0006281’ (DNA repair) can cause genomic instability and contribute to tumorigenesis and ‘GO:000910’ (cytokinesis) failure is shown to lead to tumorigenesis in experimental model systems42,43. Specific genes involved in the ‘GO:0006974’ (DNA damage response) such as BRCA1/2 and P53 are mutated during prostate cancer progression and also associated with various subtypes of breast cancer44,45. Dysfunctions of ‘GO:0006338’ (chromatin remodeling) proteins, mutations and other genetic alternations of chromatin remodelers exist in almost all human tumor types46. We found the gene ‘BUB1B’ in the ‘G0:0007094’ (Spindle assembly checkpoint (SAC)) in our modules which is essential in breast cancer survival and SAC components are potential targets for breast cancer therapies as well47,48. Angiogenesis is the process of forming new blood vessels and this can become dysregulated in breast cancer due to an imbalance between pro-angiogenic and anti-angiogenic factors49. DNA replication stress is a hallmark of cancer that plays a critical role in driving tumorigenesis and progression50,51,52. Notably, our modules across five cancer types consistently identified this as a significant GO term that underscores its pervasive relevance in cancer biology. Our modules identified the GO term ‘GO:0072089’ (stem cell proliferation), which is strongly linked to cancer. This term involves key transcription factors such as OCT4, SOX2, and KLF4, which play critical roles in tumor initiation, progression, and therapy resistance in cancers like breast, prostate, and oral squamous cell carcinoma53.

We also observed unique and common KEGG pathways for each cancer type. Specifically, 6, 27, 3, and 20 KEGG pathways exist as significantly enriched only for BRCA, LUSC, PRAD, and UCEC, respectively. Among the significantly enriched KEGG pathways, 45, 34, 32, 33, and 26 pathways were significantly enriched in only one module of BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. The frequency of modules for each KEGG pathway was calculated for all five cancer types and among the top 10 most frequently enriched KEGG pathways, the ‘hsa04110’ (cell cycle) pathway was enriched for BRCA, LUSC, and UCEC, consistently with GO enrichment results. Moreover, the ‘hsa04611’ (Platelet activation) pathway was among the top 10 enriched pathways for LUSC only. Studies found that platelet activation is predominant in lung squamous cell carcinoma (LUSC) and lung adenocarcinoma (LUAD)54. We had ‘hsa04022’ (cGMP-PKG signaling pathway) pathway among the highly occurring 10 pathways of CanMod2-computed modules in PRAD, which shows that our modules were able to involve disease-specific modules55,56,57.

There were 31, 35, 41, 12, and 37 unique CH terms enriched in CanMod2-inferred modules for BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. There were 8, 6, 7, 4, and 9 CH terms that were specific to only one module for BRCA, KIRC, LUSC, PRAD, and UCEC, respectively. We also calculated the frequency of modules for each CH term for five cancer types and among the enriched CH terms, the ‘HALLMARK_G2M_CHECKPOINT’, the ‘HALLMARK_E2 F_TARGETS’, the ‘HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION’, the ‘HALLMARK_ALLOGRAFT_REJECTION’, and the ‘HALLMARK_MITOTIC_SPINDLE’ terms were common for the top 10 most frequently enriched terms of five cancer types. These analyses suggest that CanMod2-computed modules are functionally relevant and are enriched with biological processes and pathways that have known roles in cancer.

The key differences between CanMod and CanMod2 are the algorithms that we use in various steps for clustering and selecting regulators and genes that are crucial for co-expression, updating the criteria for adding partner genes and iteratively running step 5 & 6 until converging to a certain threshold. These differences help us find more hub regulators related to cancer, more experimentally validated regulator- target interactions and more unique and complementary modules that are significantly enriched in GO, KEGG and CH terms. In our earlier work, CanMod was applied on only one cancer type (BRCA) while in this work CanMod2 is applied on five cancer types (BRCA, KIRC, LUSC, PRAD and UCEC).

We made CanMod2 as a publicly available and user-friendly tool for researchers and clinicians to utilize. Inferring modules in a specific disease using tools like CanMod2, we could help clarify the multiple biological roles that a gene could have and contribute to the comprehensive understanding of complex diseases like cancer. In the future, more data modalities such as ChIP-Seq and ATAC-Seq could be utilized to find more precise and context-specific regulator-target interactions.