Introduction

Somatically acquired mutations are one of the major sources driving tumorigenesis1. Computational approaches have been developed to assign pathogenicity scores to given mutations, indicating their phenotypic effects on an organism2,3,4,5,6,7,8. Complementary to these approaches, understanding the mechanisms driven by each mutation—from altering genomic sequences to changing key amino acid residues to dysregulating relevant cellular pathways—is key to developing effective therapeutic strategies. Efforts have been made to interpret the effects of mutations at specific scales9,10,11,12,13,14,15,16,17. Some studies focus on the molecular effects and look for spatial clustering of mutations within critical regions of proteins9,10,11,12,13,14,15,18,19. Others focus on cancer pathways and look for significantly mutated subnetworks of proteins16,17,20. Studies at the molecular and pathway levels offer complementary insights into the underlying mechanisms of cancer.

At the 3D protein structural level, the spatial clustering of mutations on 3D protein structures can reveal functionally important protein regions and can thus assist in identifying cancer driver mutations9,10,11,12,13,14,15,19. Given that the overwhelming majority of somatic mutations in cancer are non-functional passengers21, 3D clustering analysis narrows down potential driver mutations and thus significantly boost the signal-to-noise ratio. However, previous 3D clustering algorithms either limit their scope to the experimentally-determined structures9,11,12,15, or specifically focus on single proteins10,11,12,14 or protein-protein interaction (PPI) interfaces22,23. No approach yet fully examines the 3D structures of every single protein as well as the binding interfaces of all known PPIs in humans, leaving many spatial clusters yet to be identified. The bottleneck has been the limited coverage of 3D structural information: only ~36% of single proteins and ~6% of known PPIs in humans have experimentally-determined structures24. Nonetheless, recent breakthroughs in deep learning technologies for highly accurate 3D structure prediction, covering both single proteins25,26,27,28,29 and multi-protein complexes30,31,32,33,34, are rapidly filling these gaps.

At the PPI network level, various methods have been developed to identify significantly mutated subnetworks by integrating genetic mutation data with network topology16,17,35,36. These strategies have revealed many key pathways and protein complexes in cancer. Furthermore, sophisticated analyses can construct a hierarchy of altered subnetworks20,37,38, offering a nuanced, multi-layered perspective on the cancer-related biological processes across various subnetwork levels.

The insights gained from 3D protein structural level and PPI network level methodologies are largely non-overlapping, thereby offering complementary strengths. Integrating these methodologies is key to comprehensively delineate cancer mechanisms. In this work, we establish NetFlow3D, a unified framework that integrates methodologies across 3D structural and PPI network levels to systematically map the multiscale functional effects of somatic mutations across atomic, residue, protein and network scales. To enable this integration, we compile the Human Protein Structurome, a comprehensive repository encompassing the 3D structures of every single protein as well as the binding interfaces of all known protein interactions in humans. NetFlow3D initially identifies potential driver mutations through 3D clustering analysis applied to the Human Protein Structurome, and exclusively propagates these clustering signals across the PPI network, significantly enhancing the signal-to-noise ratio. It then accounts for the fact that a protein often interacts with different partners via distinct 3D structural interfaces, and accordingly weights the impact of 3D clusters at a specific PPI interface on different interaction partners differently. This end-to-end integration of protein structure and network topology leads to the identification of a much greater number of likely functional mutations and a more extensive range and larger scale of disease-associated network modules, which demonstrate molecular, cellular, and clinical significance. The NetFlow3D tool39, the Human Protein Structurome, and the results40 from applying NetFlow3D to TCGA pan-cancer data, can be accessed through our interactive web server (http://netflow3d.yulab.org/).

Results

NetFlow3D maps the functional effects of somatic mutations across multiple scales

We compiled and processed a TCGA pan-cancer dataset of 1,038,899 somatic protein-altering mutations across 9,946 tumor samples spanning 33 cancer types (Fig. 1a; Methods). Of these mutations, 82% were expected to change only one or a few amino acid residues in the encoded proteins (i.e. missense mutations and in-frame indels), and are thus collectively referred to as in-frame mutations. Without further biological contexts, it is particularly difficult to interpret the varying downstream functional effects based on these subtle changes to the protein sequences.

Fig. 1: Framework of mapping the multiscale functional effects of somatic mutations.
figure 1

a Preprocessed TCGA (The Cancer Genome Atlas) pan-cancer mutation dataset, consisting of in-frame and loss-of-function (LOF) mutations. b Overview of the Human Protein Structurome, which incorporates three-dimensional (3D) structures of 20,431 canonical isoforms (as shown in the figure) and 165,328 non-canonical isoforms (not shown), as well as the binding interfaces for 146,137 known protein-protein interactions (PPIs). c The first part of NetFlow3D, a 3D clustering algorithm for identifying both intra- and inter-protein 3D clusters of in-frame mutations. d The second part of NetFlow3D, a network propagation model for identifying interconnected modules. All 3D protein structures in this figure were visualized using PyMOL.

Mounting evidence has demonstrated the efficaciousness of identifying functional in-frame mutations by detecting spatial clusters on 3D protein structures9,10,11,12,13,14,15,18,19. In order to achieve full-coverage spatial mapping of mutations on 3D protein structures, we compiled a comprehensive repository that contains the structures of all human proteins as well as the binding interfaces of all known human PPIs and available multi-protein complex structures, which we named “the Human Protein Structurome” (Fig. 1b; Methods). Importantly, the 3D structural data of 64% of canonical human proteins and 94% of known human PPIs were generated by recent deep-learning approaches, including AlphaFold225 and PIONEER24, which were not available to previous 3D clustering algorithms.

The first part of NetFlow3D is a 3D clustering algorithm that identifies spatial clusters of in-frame mutations throughout the entire Human Protein Structurome (Fig. 1c; Methods). Our algorithm looks for both 3D clusters within single proteins (intra-protein 3D clusters) and 3D clusters spanning interacting proteins (inter-protein 3D clusters). Unlike most existing 3D clustering algorithms, (i) our method models the varying local background mutation rate across the genome by accounting for replication timing, mRNA expression level, HiC chromatin compartment, local GC content, and local gene density, an approach adapted from MutSigCV41 (Methods). This differs from the common practice in many 3D clustering algorithms that determine the significance of 3D clusters by randomly shuffling mutations within the same protein structure. (ii) Our method determines the physical contact between every pair of amino acid residues by accounting for their varying 3D distances across all available structures instead of solely based on a single snapshot represented by one structure (Methods).

The second part of NetFlow3D employs a heat diffusion model adapted from HotNet216 to propagate 3D clustering signals (“heat”) through the PPI network (“diffusion”) (Fig. 1d; Methods). Importantly, our method goes beyond traditional PPI network analyses by incorporating 3D structural information in two crucial aspects: (i) NetFlow3D assigns an initial heat score to each node (protein) based on the 3D clustering signals on that protein, unlike traditional PPI network analyses that rely on gene mutation frequency or other gene-level statistics, thereby significantly boosting the signal-to-noise ratio. (ii) When NetFlow3D propagates heat from one node to its neighbors (representing the impact of 3D mutation clusters), it assigns additional propagation weight to the edges (PPIs) that have 3D mutation clusters on their corresponding PPI interfaces (i.e., anisotropic) (Supplementary Fig. 1). This strategy is grounded in the “edgetic effect” of functional missense mutations, indicating that mutations at the interface are more likely to disrupt the corresponding PPI than non-interface mutations. This effect has been observed in both germline42,43,44 and somatic mutations (Supplementary Fig. 2; Supplementary Data 1). NetFlow3D’s weighted propagation strategy differs from traditional PPI network analyses that typically treat all edges connected to a given node as equal. Subsequently, NetFlow3D identifies “interconnected modules” within the network, i.e., subnetworks characterized by densely interconnected 3D clusters. To be in the same module, two proteins, \(u\) and \(v\), should both have substantial 3D clustering signals that significantly impact each other. This method is designed to prevent the formation of “star graphs”, which are centered around well-studied cancer proteins but include surrounding proteins with minimal 3D clustering signals and biological relevance.

As a complement to the first and second parts that focus on in-frame mutations, NetFlow3D also accounts for loss-of-function (LOF) mutations. These mutations, which drastically alter protein sequences, are generally less specific about where they occur within protein structures to exert their effect. Therefore, NetFlow3D evaluates the enrichment of LOF mutations scattered across the entire sequence of each protein, and incorporates these protein-specific LOF enrichment signals as additional initial heat scores into the heat diffusion model in the second part (Fig. 1a and d; Methods).

Overall, NetFlow3D maps the functional effects of somatic mutations across multiple scales: from atomic-resolution 3D clustering of mutations, to perturbations of key proteins/PPIs, to the dysregulation of network modules and cellular pathways. As a coherent and unified framework, NetFlow3D integrates information across all these levels, thereby reinforcing confidence in discoveries at each scale: For example, 3D clustering of mutations across atomic and residue levels allows network propagation of only likely driver mutations and pinpoints their specific impacts on different interaction partners; While network propagation and topological analysis further boost confidence in those 3D mutation clusters that are significantly interconnected within the same module, and shed light on complex biological processes underlying disease etiology.

Significant intra- and inter-protein 3D clusters throughout the Human Protein Structurome

We applied the 3D clustering algorithm in NetFlow3D to the 849,690 somatic in-frame mutations in the TCGA pan-cancer dataset. This analysis led to the identification of 7,634 significant intra-protein 3D clusters and 6,810 significant inter-protein 3D clusters throughout the Human Protein Structurome (Fig. 2a; Supplementary Data 2). Notably, 60% of intra-protein clusters and 50% of inter-protein clusters were identified using 3D structural data from deep learning databases. For example, within the 3D structure of PPP2R5B protein generated by AlphaFold 2, we identified an intra-protein 3D cluster composed exclusively of rarely mutated residues (i.e., mutated in no more than two tumor samples) (Fig. 2b). These residues would not have been identified through individual analysis. Impressively, 99.1% of residues in our significant 3D clusters do not exhibit significant recurrent mutations when analyzed individually (Supplementary Fig. 3a). However, these infrequently mutated residues demonstrate a significant enrichment for catalytic residues (Supplementary Fig. 3b). The use of AlphaFold 2-generated structures was crucial in identifying these potentially functional, yet infrequently mutated residues in proteins without experimentally-resolved structures. Moreover, single protein structures alone (even if covering every human protein) are still not enough for the comprehensive identification of all 3D clusters. This is because many driver mutations accumulate at the binding interfaces of cancer-related PPIs22,45,46. Only looking at individual proteins will split inter-protein 3D clusters into smaller fragments on individual proteins, making them harder to identify. This is demonstrated by the fact that, among the identified residues within our significant inter-protein 3D clusters, 55.8% would not have been identified if we only searched for significant intra-protein 3D clusters. Such situations are exemplified by an inter-protein cluster on the PPI interface between RHOC and ARHGAP1 proteins, as revealed by PIONEER (Fig. 2c). These results highlight the importance of knowing PPI interfaces, which are mostly generated by our deep learning framework PIONEER, in identifying potential driver mutations. Overall, 91.6% of TCGA tumor samples with somatic in-frame mutations have at least one mutation incorporated by our significant 3D clusters, demonstrating the thoroughness of our 3D cluster identification.

Fig. 2: Significant 3D clusters identified by NetFlow3D and performance evaluation.
figure 2

a Summary of intra- and inter-protein clusters identified by NetFlow3D. b, c Examples of significant 3D clusters identified using deep-learning-generated 3D structural data. Significance is determined by adjusted p-values (<0.05), derived from Bonferroni correction of raw p-values calculated using one-sided Poisson tests (Methods). The 3D protein structures are visualized using Python NGLview package. b An intra-protein cluster identified using AlphaFold 2-generated structure of PPP2R5B. All mutations incorporated by this cluster are on the residues with “very high” or “confident” model confidence. c An inter-protein cluster identified using PIONEER-generated interaction interface between RHOC and ARHGAP1. For visualization purposes, a 3D structure of this protein complex is generated using AlphaFold Multimer. d, e Performance comparison between NetFlow3D and state-of-the-art 3D clustering algorithms. Performance curves are drawn for the top 1–500 genes, ranked by each algorithm based on the highest scoring 3D cluster on each gene. Source data are provided as a Source Data file. d Intra-protein 3D clustering results. e Inter-protein 3D clustering results. f Enrichment was calculated as the ratio of the observed fraction of catalytic residues among the residues under investigation over the fraction of catalytic residues on corresponding proteins (expected fraction). The error bars indicate standard error, calculated using the delta method. P values for each bar were calculated using two-sided Z-tests (****P < 0.0001). Residues in significant 3D clusters: n = 101,704; Other mutated residues: n = 682,471. P-value for comparing the observed fraction of catalytic residues between the two groups was calculated using a two-sided two-proportion Z-test. Source data are provided as a Source Data file.

We then evaluated the performance of NetFlow3D and compared it with four state-of-the-art 3D clustering algorithms9,10,11,13 which represent major sources of 3D cluster identification (Methods). We applied each algorithm to the same TCGA pan-cancer dataset, and compared the 3D clusters identified by different algorithms. Considering that (i) some algorithms only focus on intra-protein clusters10,11 while some others identify both9,13, and (ii) some algorithms only use experimentally-determined structures9,11 while some others also include comparative protein structure models10,13, we therefore make coherent comparisons by (i) assessing the intra- and inter-protein clusters separately, and (ii) limiting the comparisons to the 3D clusters identified on experimentally-resolved structures. Genes were ranked by each algorithm according to the highest score obtained from all the 3D clusters present on them. As a result, within the same number of top genes ranked by each algorithm, NetFlow3D-ranked genes consistently include a higher number of known cancer genes listed by the Cancer Gene Census (CGC)47,48 (Supplementary Data 3) as well as a lower number of non-cancer-associated genes49,50,51 (Supplementary Data 4), demonstrating our advanced sensitivity and specificity (Fig. 2d-e). This was further validated using an independent pan-cancer dataset from the Catalogue of Somatic Mutations in Cancer (COSMIC)48, where NetFlow3D maintained its leading performance (Supplementary Fig. 4; Methods).

Beyond 3D clustering algorithms, we benchmarked NetFlow3D against other methods for identifying cancer driver mutations, including single-residue-based (“hotspot”) and whole-gene-based methods. The test unit size of 3D clustering algorithms falls between these two extremes. Notably, NetFlow3D outperforms these methods, demonstrating the highest precision and recall (Supplementary Fig. 5). While the hotspot method is highly precise, it lacks power when background mutation rates are low or sample sizes are small. The whole-gene-based method, which considers the entire gene as the test unit, can dilute statistical power and lacks precision when only specific regions within the gene are responsible for driving cancer. In contrast, our 3D clustering algorithm in NetFlow3D provides flexible test unit sizes at submolecular resolution, achieving a balance of higher precision and better power.

Overall, the 3D clusters identified by NetFlow3D demonstrate a significant enrichment for catalytic residues52, while mutated residues outside these clusters exhibit a significant depletion (Fig. 2f; Methods). This pattern remains robust and is not sensitive to variations in p-value cutoffs (Supplementary Fig. 6). Notably, this robust pattern is consistent across 3D clusters identified from both experimentally-determined structures and deep-learning-generated 3D structural data (Supplementary Fig. 7a). Considering the intrinsic bias of inter-protein clusters towards functional residues, as PPI interface residues are known to be enriched for such residues22,23,24,45,53,54,55, we specifically excluded these inter-protein clusters from our analysis and strictly focused on intra-protein clusters. Our refined analysis shows that the previously identified pattern persists (Supplementary Fig. 7b). Moreover, proteins involved in our 3D clusters demonstrate a significant enrichment for known cancer genes, whereas proteins not involved in any 3D clusters show a significantly depletion (Fig. 3a). This pattern is robust, remaining consistent across a range of p-value cutoffs (Supplementary Fig. 8).

Fig. 3: The advantages of integrating 3D structural information and PPI network topology over using either alone.
figure 3

a Enrichment was calculated as the ratio of the observed fraction of known cancer genes among the genes under investigation over the fraction of known cancer genes among all genes covered by the TCGA dataset (expected fraction). The error bars indicate standard error, calculated using the delta method. P values for each bar were calculated using two-sided Z-tests (****P  <  0.0001). NetFlow3D-identified significantly interconnected modules: n = 561 genes; Significant 3D clusters: n = 5698 genes; Other mutated genes: n = 8738 genes. P-values for comparisons between the observed fractions of known cancer genes in different groups were calculated using two-sided two-proportion Z-tests. Source data are provided as a Source Data file. b Consistency with established biological processes was evaluated for NetFlow3D-identified significantly interconnected modules (n = 26), random groups of proteins with significant 3D clusters matched in number and size (n = 26 × 10 replicates), and randomly-selected connected components in the network with matched number and sizes (n = 26 × 10 replicates). The box plots indicate the medians (centerlines), first and third quartiles (bounds of boxes) and 1.5× interquartile range (whiskers). Any data point outside this range is considered an outlier and plotted individually. P-values for comparisons between groups were calculated using two-sided Mann-Whitney U test. Source data are provided as a Source Data file. c Results from systematically removing two key strategies that NetFlow3D used to incorporate 3D structural information via nodes and edges. The color scale for nodes represents their initial heat score. In the first and second rows, node color intensity reflects the sum of the -log10-transformed p-value of the protein’s most significant 3D cluster and the -log10-transformed p-value of the protein’s LOF enrichment (see Methods). In the third row, node color intensity corresponds to the number of tumor samples in which the protein has mutations.

Importance of our end-to-end integration of 3D structural information and PPI network topology

The critical innovation of NetFlow3D over previous methods lies in its seamless, end-to-end integration of 3D structural information with PPI network topology. To underscore the additional insights this integration provides, we compared the outcomes of NetFlow3D with those from methods that use only information in either 3D protein structures or PPI network topology.

The advantage of our end-to-end integration over solely relying on 3D protein structural information manifests in two key aspects. First, the dense interconnections among 3D clusters within the same module further reinforce their validity, bolstering confidence in molecular-level discoveries. This is evidenced by the observation that proteins within NetFlow3D-identified significantly interconnected modules (Supplementary Data 5) contain a significantly higher fraction of known cancer genes compared to those identified solely by significant 3D clusters, even though the latter already demonstrate significant enrichment (Fig. 3a). Second, by extending the analysis beyond identifying crucial 3D structural regions within proteins, the propagation of 3D mutation clustering signals throughout the PPI network provides deeper insights into the dysregulated biological processes underlying tumorigenesis. This is demonstrated by the observation that significantly interconnected modules identified by NetFlow3D align more closely with established biological processes56,57,58,59,60 than do random groups of those proteins with significant 3D clusters which were organized to match the NetFlow3D-identified modules in number and sizes (Fig. 3b). However, this closer alignment is not just an outcome of the PPI network’s topology, as randomly selected connected components with matched number and sizes show significantly lower consistency with established biological processes (Fig. 3b). Thus, it’s the effective integration of molecular-level 3D clustering information and the PPI network’s topology that plays a key role in uncovering critical biological processes that are potentially central to cancer development.

The advantage of our end-to-end integration over the methods relying solely on PPI network topology is the significant improvement in statistical power. This improvement is demonstrated by the outcomes of systematically removing the two key strategies that NetFlow3D used to incorporate 3D structural information via nodes and edges (Fig. 3c; Methods). Initially, the edge weight in NetFlow3D, determined by 3D clustering signals on PPI interfaces, was removed, leading to uniform propagation from each node to all its neighbors. As a result, the significantly interconnected modules identified thereafter contain ~¼ of the proteins found in the original NetFlow3D-identified significantly interconnected modules. Next, the initial heat scores assigned to each node, determined by the 3D clustering signals on each protein, was replaced by gene mutation frequency. This further change fully reverted the original NetFlow3D framework to a standard PPI network approach. Consequently, the significantly interconnected modules identified thereafter contain only ~1/8 of the proteins identified by the original NetFlow3D framework.

Biological significance of NetFlow3D-identified significantly interconnected modules

We benchmarked NetFlow3D-identified significantly interconnected modules (hereafter called “NetFlow3D modules”) against well-established cancer signaling pathways61 (positive controls) (Supplementary Data 6) and Gene Ontology (GO) biological processes (BPs)56,57,58,59,60 (background reference) (Supplementary Data 7). Enrichment analysis for known cancer genes demonstrated that NetFlow3D modules exhibit enrichment levels comparable to those of well-established cancer pathways and significantly surpass those found in GO BPs (Fig. 4a). Furthermore, we analyzed mutation patterns within each entity—whether a NetFlow3D module, a well-known cancer pathway, or a GO BP—by calculating enrichment for two distinct mutation categories: (i) mutations within significant 3D clusters, and (ii) all mutations (Methods). Consequently, well-known cancer pathways and NetFlow3D modules consistently demonstrate pronounced enrichment trends for both mutation categories, with a particularly striking increase when switching from all mutations to the mutations within significant 3D clusters (Fig. 4b). In contrast, GO BPs exhibit no obvious trend of enrichment for all mutations and a much compromised enrichment for those within significant 3D clusters, with only a minor increase when contrasting the two mutation categories. Notably, upon splitting NetFlow3D modules into two groups based on whether they contain known cancer genes, the mutation patterns across the two groups are strikingly consistent (Fig. 4c), both resembling well-known cancer pathways (Fig. 4b). In contrast, GO BPs present a different picture: even those GO BPs that include known cancer genes exhibit much weaker mutation enrichment trends for both mutation categories. Meanwhile, GO BPs lacking known cancer genes display virtually no trend of mutation enrichment at all (Fig. 4c).

Fig. 4: Evaluating the biological significance of NetFlow3D modules.
figure 4

a Enrichment comparison for known cancer genes among well-known cancer pathways (n = 748 genes), NetFlow3D modules (n = 561 genes), and Gene Ontology (GO) biological processes (n = 12,523 genes). Enrichment was calculated as the ratio of the observed fraction of known cancer genes in each group over the fraction of known cancer genes among all genes covered by the TCGA dataset (expected fraction). The error bars indicate standard error, calculated using the delta method. P values for each bar were calculated using two-sided Z-tests (****P  <  0.0001; *P  <  0.05). P-values for comparisons of the observed fractions between different groups were calculated using two-sided two-proportion Z-tests. Source data are provided as a Source Data file. b, c Enrichment was calculated as the ratio of the observed fraction of mutations within each pathway/module/process to the expected fraction, determined by the relative length of their proteins compared to the total length of all proteins covered by the TCGA dataset. The box plots indicate the medians (centerlines), first and third quartiles (bounds of boxes) and 1.5 × interquartile range (whiskers). Fold change for each pathway/module/process was calculated as the ratio of their enrichment for mutations within significant 3D clusters compared to all mutations. P-values for comparisons of fold changes between groups were calculated using two-sided Mann-Whitney U tests. Source data are provided as a Source Data file. b n = 32 well-known cancer pathways; n = 26 NetFlow3D modules; n = 7524 GO biological processes. c NetFlow3D modules: with known cancer genes: n = 14; without known cancer genes: n = 12. GO biological processes: with known cancer genes: n = 5493; without known cancer genes: n = 2031. d Association of NetFlow3D-identified mutations with patients’ survival. P-values and coefficients were derived from the Cox model (see Methods). Hazard ratios (HR) were calculated by exponentiating the coefficients. Source data are provided as a Source Data file.

To demonstrate downstream molecular consequences of NetFlow3D-identified mutations, we evaluated their statistical association with protein abundance. Initially, we performed multiple linear regression analysis, controlling for gene-specific and tissue-specific baseline expression levels, as well as clinical covariates including sex, age, tumor stage, and TMB (Supplementary Note 1). This analysis revealed significant associations between the presence of NetFlow3D-identified mutations and protein abundance (t-test: t(171109) = 6.0, p = 1.6e-9, coefficient = 0.073, 95% CI = [0.049, 0.096]). In contrast, no significant association was observed when conducting the same analysis using other mutations not identified by NetFlow3D (t-test: t(1034939) = 0.38, p = 0.70, coefficient = 0.0026, 95% CI = [-0.011, 0.016]). For a more detailed perspective, we conducted a fine-grained analysis comparing protein abundance for each gene in each cancer type, under scenarios with and without NetFlow3D-identified mutations. As a control, we repeated the analysis using other mutations. Our results revealed a significantly higher proportion of cases with differential protein abundance for NetFlow3D-identified mutations than for other mutations (Supplementary Fig. 9).

To further evaluate the impact of genes with NetFlow3D-identified mutations on cellular fitness, we utilized core fitness (CF) genes identified from genome-scale CRISPR-Cas9 screens in 324 human cancer cell lines spanning 30 cancer types62. We analyzed the enrichment of these core fitness genes in NetFlow3D results. Our results consistently showed that, across various cancer types, genes with NetFlow3D-identified mutations are most enriched for core fitness genes (Supplementary Fig. 10). Additionally, genes with mutations identified in isolation by our 3D clustering analysis also showed significant enrichment in every cancer type, whereas genes without mutations in 3D clusters did not exhibit significant enrichment in any cancer type.

To demonstrate the clinical significance of NetFlow3D findings, we compared the overall survival between patients with somatic in-frame mutations in our preprocessed TCGA dataset, grouping them by whether their mutations were identified by NetFlow3D (Methods). We used a Cox regression model to evaluate the statistical association between NetFlow3D-identified mutations and patient survival, controlling for clinical covariates including age, sex, tumor stage, and tumor mutational burden (TMB). Our analysis revealed significant negative survival associations across multiple cancer types, including Thyroid carcinoma (THCA), Kidney renal clear cell carcinoma (KIRC), Adrenocortical carcinoma (ACC), and Brain Lower Grade Glioma (LGG) (Fig. 4d). The hazard ratios (HR) derived from the Cox model coefficients were consistently >1.5 across all four cancer types (Supplementary Data 8).

Next, we assessed NetFlow3D’s capability to uncover additional insights beyond known cancer genes. Remarkably, 80% (447 out of 559) of the proteins identified within NetFlow3D modules are not encoded by known cancer genes listed in the CGC. Moreover, even after removing 3D clustering and LOF enrichment signals from known cancer genes and subsequently re-applying our 3D structurally-informed network propagation framework, the resulting significantly interconnected modules still cover 23 out of the 26 original NetFlow3D modules (Supplementary Fig. 11; Supplementary Note 2).

A pan-cancer functional map of somatic mutations across scales

Applying NetFlow3D to the TCGA pan-cancer dataset has yielded a multiscale functional map of somatic mutations in cancer (Fig. 5). From a biological perspective, this map encompasses a broad spectrum of cellular processes and functions, spanning well-established cancer pathways, components that are increasingly recognized through recent evidence, and biological entities with less-characterized roles in cancer (Supplementary Data 9). (i) Well-established cancer pathways. Examples include p53 signaling, regulation of apoptosis, regulation of E2F-dependent transcription, and intracellular signaling cascades like Ras, PI3K, mTOR, and TGF-β. (ii) Increasingly recognized pathways and protein complexes. Examples include Rho GTPase signal transduction63, chromatin remodeling (e.g. PRC264, MLL complex65), immune processes (e.g. antigen processing and presentation66), and DNA repair mechanisms67 (e.g. interstrand cross-link repair and DNA non-homologous end joining). (iii) Biological entities with less-characterized roles in cancer. Examples include protein K11-linked ubiquitination68, eIF2 activity69,70, TRiC71, TFIID complex72, and calcineurin73. Notably, 46% of the NetFlow3D modules do not contain any known cancer genes listed in CGC. These modules are particularly intriguing as they represent unexplored opportunities for cancer pathway identification. Their significance has been underscored by the observation that these modules show mutation patterns strikingly similar to those of well-established cancer pathways (Fig. 4b, c).

Fig. 5: A multiscale functional map of somatic mutations in cancer.
figure 5

The significantly interconnected modules identified by NetFlow3D are displayed here. The largest module is divided into 11 core biological entities, connected by gray arrows indicating mutation impacts between them. Importantly, red stars on the edges and the bolding of these edges indicate the presence of significant inter-protein 3D mutation clusters on the binding interfaces between interacting protein pairs. Drug targets are labeled based on the full list of U.S. Food and Drug Administration (FDA)-approved drugs, as detailed in Supplementary Data 10.

Shifting the focus to the methodological advancements, this map generated by NetFlow3D not only aligns with key discoveries from traditional PPI network analyses, but also provides additional insights achieved through integrating 3D structural information. Specifically, the map’s composition is threefold: (i) Components that are also identifiable via traditional PPI network approaches. For example, our map includes a vast majority of biological entities identified by HotNet216, such as pathways like p53, PI3K, and KEAP1-NFE2L2. It also encompasses protein complexes such as MLL, cohesin, and SWI/SNF, as well as “linker genes” including regulators of Ras signaling and elements of MAPK signaling. Furthermore, our map also includes the complement system, as identified by Olcina et al.74 in their analysis of 69 cancer mutation datasets using HotNet2. (ii) Components emerging from combining PPI network topology and orthogonal data/analyses. For example, Wang et al.75 integrated PPI network topology with GWAS and identified the spliceosome. Gupta et al.76 integrated PPI network topology with gene co-expression network and external pathway annotations such as KEGG/Reactome/GO/IPA and identified Rho GTPase signal transduction. (iii) Components uniquely identified by NetFlow3D through integrating PPI topology with 3D structural information, such as PP2A, CCR4-NOT complex, mitochondrial ribosome, and calcineurin, etc. This distinct category highlights the additional insights provided by the end-to-end integration of the local spatial organization of mutations on 3D protein structures and their global topological organization in the network.

Integrator-PP2A complex

To showcase how NetFlow3D reveals deeper insights into cancer biology, we presented one NetFlow3D module as an example, which corresponds to two established biological entities: the integrator complex77 and the PP2A complex78 (Fig. 6a). These two biological entities work collaboratively: PP2A is recruited to transcription sites by the integrator complex, where PP2A functionally counteracts CDKs-driven cell-cycle progression, thereby maintaining cell homeostasis79,80,81,82 (Fig. 6b).

Fig. 6: Example of a NetFlow3D module highlighting biological insights with proof-of-concept experimental validation.
figure 6

a The biological entities incorporated by this module include the integrator complex and the protein phosphatase 2A (PP2A) complex. b Functionality of these entities: PP2A is recruited to transcription sites by the integrator complex, where PP2A functionally counteracts cell-cycle progression, thereby maintaining cell homeostasis. c, d A potential driver mutation highlighted in this module, PPP2R1A p.Arg258Cys, identified based on the significant 3D clusters at the binding interfaces between PPP2R2A and PPP2R3A. Significance is determined by adjusted P < 0.05, derived from Bonferroni correction of raw P-values calculated using one-sided Poisson tests (Methods). The 3D protein structures are visualized using the Python NGLview package. e TMT-IP-MS-based quantitative proteomics analysis confirmed that PPP2R1A p.Arg258Cys diminished PPP2R1A’s interactions with almost all of its interactors in HEK 293 T cells. The volcano plot summarizes the quantitative results for the identified interactors that co-purify with PPP2R1A p.Arg258Cys compared to PPP2R1A wildtype (WT) (n = 3 biologically independent experiments with similar results). One-sided two-sample t-tests were used to calculate raw P-values, which were then adjusted using the Benjamini-Hochberg method. Interactors were considered significantly depleted if they had a fold change <½ and an adjusted P < 0.05. Source data are provided as a Source Data file. f Co-immunoprecipitation (Co-IP) confirming that PPP2R1A p.Arg258Cys disrupted PPP2R1A–PPP2R3A interaction in HEK 293 T cells (n = 3 biologically independent experiments with similar results). Uncropped western blot images are provided as a Source Data file. g Summary of experimentally validated disrupted PPP2R1A subnetwork.

Towards the identification of cancer driver mutations, we focused on the p.Arg258Cys mutation in the PPP2R1A protein within this module. NetFlow3D identified this mutation as being part of the significant 3D clusters at the binding interfaces between PPP2R1A/PPP2R2A proteins and PPP2R1A/PPP2R3A proteins (Fig. 6c-d). In our TCGA study, this mutation originated from a patient with uterine corpus endometrial carcinoma (UCEC). Additionally, mutations at the PPP2R1A codon 258 have been observed in serous and endometrioid carcinomas as reported in several non-TCGA studies83,84,85. Our quantitative-proteomics-based TMT-IP-MS and co-immunoprecipitation experiments showed that PPP2R1A p.Arg258Cys mutation diminished PPP2R1A’s interactions with almost all of its interactors (Fig. 6e). Particularly, it disrupted the interactions with other PP2A subunits within this module (Fig. 6e-g). Therefore, it is plausible to speculate that PPP2R1A p.Arg258Cys mutation diminished PP2A function by disrupting its subunit interactions. Given the previous evidence showing that the inactivation or inhibition of PP2A promotes cancer development86,87,88, our experimental validation of the disrupted PPP2R1A subnetwork (Fig. 6g) caused by the PPP2R1A p.Arg258Cys mutation underscores the value of NetFlow3D in identifying cancer driver mutations and illuminating potential tumorigenic mechanisms.

Importantly, NetFlow3D’s ability to identify the PP2A complex hinges upon our end-to-end integration of 3D structural information with PPI network topology. On one hand, PP2A was completely overlooked using the standard PPI network approach. On the other hand, >90% of PP2A subunits do not contain any single-residue hotspots, indicating that relying solely on mutation recurrence fails to capture this full biological entity. In contrast, by utilizing 3D structural insights from our Human Protein Structurome, NetFlow3D successfully identified significant 3D clusters on every PP2A subunit within this module, affirming their significant association with cancer individually. Furthermore, the dense interconnectivity among these significant 3D clusters, as revealed by NetFlow3D, further reinforces the overall functional significance of the PP2A complex in cancer biology.

Discussion

Our work demonstrates the effective integration of 3D protein structural information with PPI network topology as achieved by NetFlow3D, our end-to-end 3D structurally-informed network propagation framework. This integration provides additional insights that can not be gained from each component in isolation. NetFlow3D applied 3D clustering analysis across the entire Human Protein Structurome, which not only identified > 100-fold more potentially functional residues than using the single-residue-based hotspot method, but also discovered over twice as many significant 3D clusters compared to traditional 3D clustering analysis using only experimentally-resolved structures. Moreover, our strategy of 3D-structurally-informed network propagation led to the identification of a much higher number of significantly interconnected modules. These modules not only incorporated ~ 8 times more proteins than those identified by standard PPI network analyses, but also demonstrated a 2.6-fold greater enrichment of known cancer genes compared to solely leveraging 3D structural information, thereby revealing many aspects of cancer biology that were poorly understood.

In addition to pan-cancer studies, NetFlow3D is also applicable to studies focusing on specific cancer types. It enables users to not only input somatic mutation data, but transcriptome and interactome data tailored to a particular cancer tissue context. NetFlow3D then applies its 3D clustering algorithm to a subset of 3D structural data in the Human Protein Structurome, filtered based on the context-specific expression profile. Given that our Structurome contains the 3D structures of all human protein isoforms, it offers a great capacity to adapt to a variety of cellular contexts. Following this, NetFlow3D propagates 3D clustering signals through a context-specific PPI network. Considering the current limitations in interactome data, where most PPIs are mapped in generic contexts such as using yeast or HEK293 cell lines, NetFlow3D addresses this by filtering the general human PPI network with context-specific transcriptome data, thus focusing on the subnetwork of genes that are actually expressed. Looking ahead, as experimentally-determined cell-type-specific interactome data become available, we anticipate further improvement in NetFlow3D’s performance for these targeted applications.

Furthermore, the core principles of NetFlow3D are not confined to somatic mutations in cancer, but can be extended to understanding germline variants in various diseases. Recent studies have demonstrated that permutation-based 3D clustering analysis, when applied to neurodevelopmental disorders89,90 and the Human Gene Mutation Database (HGMD)19, can effectively identify rare disease-associated variants. Adapting NetFlow3D to utilize the latest genome-wide models of germline mutation rates at base pair resolution91,92 represents an advancement to these approaches. Additionally, NetFlow3D’s context-specific analyses are particularly well suited for studying diseases that manifest in specific tissues or cell types.

Despite these strengths, NetFlow3D’s performance is limited by the quality of available 3D structural data, especially those generated by advanced deep learning algorithms. Our Human Protein Structurome now contains atomic-resolution 3D structures for all individual human protein isoforms. However, for most PPIs, the Structurome is limited to interface residue data. Advanced deep learning algorithms, including various AlphaFold-based methods (such as AlphaFold-Multimer32, AF2Complex31, and others30,33,34) have begun producing atomic-resolution 3D structures for multi-protein complexes. Yet, these methods are currently capable of producing high-confidence models for only a very limited subset of PPIs. Therefore, updating the PPI interfaces in our existing Structurome with these atomic-resolution structural models is still a considerable challenge. Continued advancements in these techniques are expected to extend their coverage, and we foresee further enhancement in NetFlow3D’s performance as we integrate these evolving resources.

NetFlow3D also has limitations in fully accounting for all types of driver mutations. This includes in-frame mutations that, despite not clustering on 3D protein structures, are still functional in cancer. For example, mutations impacting protein stability often occur within the core of proteins, altering function without targeting specific residues. Similarly, mutations in intrinsically disordered regions (IDRs) can markedly disrupt overall protein flexibility. Moreover, copy number variations (CNVs), structural variants (SVs), and noncoding mutations – especially those affecting regulatory elements93,94,95,96,97,98,99, contribute to altering gene dosage or expression, thereby diversifying cancer mechanisms. Expanding NetFlow3D to integrate these mutation types would improve its ability to offer a more complete understanding of cancer biology, representing a crucial area for future development.

Methods

Data collection

The Cancer Genome Atlas (TCGA)

3.6 M somatic mutations across 10,295 tumor samples and 33 cancer types were obtained from the standard MC3 analysis100. We included an additional 178 tumor samples in the current TCGA program (https://portal.gdc.cancer.gov/), not covered by the MC3 dataset but provided by Chang et al.1. RNA-seq data were obtained from Repository on the GDC data portal101 (https://portal.gdc.cancer.gov/).

The Catalogue of Somatic Mutations in Cancer (COSMIC)

We obtained coding point mutations from genome-wide screens (including whole exome sequencing) under genome assembly GRCh37, along with sample data and cancer classification information from COSMIC release v9848 (https://cancer.sanger.ac.uk). All TCGA tumor samples were excluded from this dataset to ensure independence. We further filtered the dataset to retain only primary tumor samples, i.e., retaining those labeled as “primary” in the “TUMOUR_SOURCE” column and excluding “cell-line”, “xenograft”, “organoid culture”, or “short-term culture” in the “SAMPLE_TYPE” column. To eliminate redundancy in COSMIC, we used the drop_duplicates() function in pandas, with key identifiers including “CHROMOSOME”, “GENOME_START”, “GENOMIC_WT_ALLELE”, “GENOMIC_MUT_ALLELE”, and “TUMOUR_ID”. The cancer classification information in this dataset was aligned with TCGA projects using subject matter expertize. Tumor samples that could not align with any TCGA projects were categorized as having an unknown cancer type.

Data preprocessing

VEP annotation

A single canonical effect per mutation was reported using Variant Effect Predictor (VEP) version 107102, following the approach used by Chang et al.1. Additionally, to evaluate the consequence of accounting for proteoform diversity, we conducted analysis on the same TCGA mutation dataset, but mapping each mutation to all possible protein isoforms. Details on this are provided in the Supplementary Note 3. According to VEP annotations, we only retained protein-altering mutations, including LOF mutations (“Consequence” column: frameshift_variant, stop_gained, stop_lost, start_lost, splice_acceptor_variant, splice_donor_variant, splice_donor_5th_base_variant) and in-frame mutations (“Consequence” column: missense_variant, inframe_deletion, inframe_insertion).

Excluding germline variants

According to VEP annotations, we removed mutations with non-zero allele frequencies in gnomAD103 (“gnomADe_AF” column), which were identified as germline variants present in the general population. The consequences of applying this filter are detailed in the Supplementary Fig. 12.

Excluding mutations in unexpressed genes

We defined expressed genes of a specific cancer type as those with RNA expression levels ≥1 FPKM in ≥80% of tumor samples within that cancer type. We only retained the mutations in those expressed genes of their cancer types. For the tumor samples of unknown cancer types, we only retained their mutations in the genes that are expressed in ≥ 80% of TCGA cancer types. Following the approach by Leiserson et al.16, mutations in 18 well-known cancer genes (AR, CDH4, EGFR, EPHA3, ERBB4, FGFR2, FLT3, FOXA1, FOXA2, MECOM, MIR142, MSH4, PDGFRA, SOX1, SOX9, SOX17, TBX3, WT1) that have low transcript detection levels were exempted from the aforementioned RNA expression filter. The consequences of applying this filter are detailed in the Supplementary Fig. 13.

UniProt ID mapping

We obtained the ID mapping data from UniProt104, which incorporates the mapping between UniProt IDs and VEP-annotated Ensembl gene, transcript, and protein IDs. We mapped each mutation to UniProt entries, initially based on their annotated Ensembl protein IDs, then sequentially using Ensembl transcript and gene IDs if protein IDs are not available.

After data preprocessing, the TCGA dataset yielded 1,038,899 somatic protein-altering mutations across 9,946 tumor samples in 33 cancer types. The COSMIC dataset yielded 571,789 somatic protein-altering mutations across 12,352 tumor samples that were aligned to 27 TCGA cancer types.

Construction of the human protein structurome

Data collection

Experimentally-determined structures were obtained from the Protein Data Bank105,106 (PDB, http://www.rcsb.org/), specifically focusing on asymmetric units. Predicted 3D structures of all human protein isoforms were obtained from the AlphaFold Protein Structure Database25,107 (AlphaFold DB), encompassing both 20,431 canonical isoforms (Fig. 1b), and 165,328 non-canonical isoforms. Interface residue data for 146k known human PPIs were obtained from PIONEER24.

Processing of experimentally-determined structures

Residue-level mapping between UniProt and PDB entries were obtained from the Structure Integration with Function, Taxonomy and Sequences108,109 (SIFTS). Based on the PDB structures, we constructed two undirected graphs \({G}_{1}=\left({V}_{1},{E}_{1}\right)\) and \({G}_{2}=\left({V}_{2},{E}_{2}\right)\). \({G}_{1}\) describes the physical contacts between residues within the same polypeptide chains, while \({G}_{2}\) describes the physical contacts between residues across different polypeptide chains. \({V}_{1}\) includes the UniProt residues covered by at least one PDB structure. \({E}_{1}\) is the set of residue pairs in the same polypeptide chains whose minimal three-dimensional (3D) distances among all relevant PDB structures are no larger than 6 Å. The 3D distance between two residues in a given PDB structure is defined as the euclidean distance between their closest atoms in that structure. \({E}_{2}\) is the set of inter-chain residue pairs whose minimal 3D distances are no larger than 9 Å. \({V}_{2}\) is the set of residues involved in \({E}_{2}\). \({G}_{1}\) and \({G}_{2}\) were added to the Human Protein Structurome.

Processing of 3D structural data from deep learning algorithms

Using the atomic-resolution 3D protein structures from AlphaFold DB, we constructed \({G}_{3}=\left({V}_{3},{E}_{3}\right)\) following the same procedures used for constructing \({G}_{1}\). Residues with all levels of model confidence in these structures were taken into account. \({G}_{3}\) was added to the Human Protein Structurome. For the interface residue data from PIONEER, we used “very high” confidence predictions. This dataset was also added to the Human Protein Structurome.

NetFlow3D: Identifying 3D clusters of mutated residues

3D cluster identification based on atomic-resolution 3D protein structures

(i) Intra-protein clusters: In-frame mutations were mapped to \({G}_{1}\) and \({G}_{3}\) respectively. Vertices affected by these mutations and the edges between them were extracted as subgraph \({g}_{1}\) and \({g}_{3}\). \({C}_{1}\) and \({C}_{3}\) are the sets of connected components in \({g}_{1}\) and \({g}_{3}\), which were considered intra-protein clusters identified based on PDB and AlphaFold DB structures, respectively. We removed the intra-protein clusters in \({C}_{3}\) that have at least one residue overlapping with any 3D clusters in \({C}_{1}\) to avoid reporting redundant 3D clusters. (ii) Inter-protein clusters: We obtained 119,526 high-quality binary PPIs for Homo Sapiens from HINT110 (http://hint.yulab.org/), a dataset released in August 2021. We restricted our focus to these PPIs (denoted as \(H\)) for the inter-protein 3D cluster identification. \({e}_{2}\) is a subset of edges in \({G}_{2}\) whose endpoints are both affected by in-frame mutations. For a PPI between protein \(A\) and \(B\), \({g}_{1A}=\left({v}_{1A},{e}_{1A}\right)\) and \({g}_{1B}=\left({v}_{1B},{e}_{1B}\right)\) are subgraphs extracted from \({g}_{1}\) incorporating the vertices and edges in \(A\) and \(B\), respectively. \({e}_{2{AB}}\) is a subset of \({e}_{2}\) where each edge connects one vertex in \({v}_{1A}\) and one vertex in \({v}_{1B}\). In the merged graph \({g}_{2{AB}}=\left({v}_{1A}\cup {v}_{1B},{e}_{1A}\cup {e}_{1B}\cup {e}_{2{AB}}\right)\), \({C}_{2{AB}}\) is the set of connected components having at least one edge in \({e}_{2{AB}}\). \({C}_{2}={\bigcup}_{\left\{A,B\right\}\in H}{C}_{2{AB}}\) represents all inter-protein clusters identified based on PDB structures. Overall, \({C}_{{{{\rm{structure}}}}}={C}_{1}\cup {C}_{2}\cup {C}_{3}\) represents the set of 3D clusters identified based on atomic-resolution 3D protein structures.

3D cluster identification based on PPI interface residue data from PIONEER

For a PPI between protein \(A\) and \(B\), in-frame mutations were mapped to the interface residues, and the set of mutated interface residues was defined as an inter-protein 3D cluster, denoted as \({C}_{4{AB}}\). \({C}_{{{{\rm{interface}}}}}=\left\{{C}_{4{AB}}|\left\{A,B\right\}\in H\right\}\) represents the set of inter-protein 3D clusters identified based on the PPI interface residue data from PIONEER.

NetFlow3D: Background mutability model

To accurately model the background mutation rate (BMR) that varies extensively across the genome, we used a model that includes five covariates of mutation tendency: mRNA expression level, DNA replication timing, chromatin status as indicated by HiC mapping, local GC content, and gene density. The fundamental concept of this model originated from MutSigCV41: each gene \(g\) was positioned in a high-dimensional covariate space, estimating its local BMR based on its own silent and noncoding mutations, and, if necessary, those of its closest neighbors in this covariate space. Here, \({x}_{g}^{{{{\rm{SNV}}}}}\) denotes the sum of silent and noncoding single nucleotide variants (SNVs) in gene \(g\) and its neighbors, and \({X}_{g}^{{{{\rm{SNV}}}}}\) represents the total number of sequenced bases where silent and noncoding SNVs can occur in gene \(g\) and its neighbors. Consequently, the local BMR of coding SNVs in gene \(g\) is calculated as:

$${{{{\rm{BMR}}}}}_{g}^{{{{\rm{SNV}}}}}=\frac{{x}_{g}^{{{{\rm{SNV}}}}}}{{X}_{g}^{{{{\rm{SNV}}}}}}$$
(1)

Similarly, \({x}_{g}^{{{{\rm{indel}}}}}\) accounts for insertions and deletions (indels) within gene \(g\) and its neighbors, and \({X}_{g}^{{{{\rm{indel}}}}}\) represents the total bases sequenced in the same regions. The local BMR for coding indels in gene \(g\) is calculated as:

$${{{{\rm{BMR}}}}}_{g}^{{{{\rm{indel}}}}}=\frac{{x}_{g}^{{{{\rm{indel}}}}}}{{X}_{g}^{{{{\rm{indel}}}}}}$$
(2)

To estimate the expected number of in-frame and LOF mutations in gene \(g\), we calculate the total number of covered bases in the coding region where mutation type \(t\) (\(t={{{\rm{missense}}}},{{{\rm{nonsense}}}},{{{\rm{splice\; site}}}}\)) can occur, denoted as \({N}_{g}^{t}\). One base may contribute fractionally to multiple mutation types. For example, a covered C base might count 2/3 toward missense and 1/3 toward nonsense if mutations to A or G change the amino acid, while a mutation to T creates a stop codon. The probability of a random SNV in gene \(g\) falling into mutation type \(t\) is calculated as:

$${A}_{g}^{t}=\frac{{N}_{g}^{t}}{{N}_{g}^{{{{\rm{coding}}}}}}\,(t={{{\rm{missense}}}},{{{\rm{nonsense}}}},{{{\rm{splice\; site}}}})$$
(3)

where \({N}_{g}^{{{{\rm{coding}}}}}\) is the coding length of gene \(g\) in base pairs. Given that \(\alpha=9\%\) (51,164 out of 56,031) of coding indels are in-frame and the rest are frameshift, we calculated the expected number of in-frame and LOF mutations in gene \(g\) as:

$${E}_{g}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}={{A}_{g}^{{{{\rm{missense}}}}}\cdot {{{\rm{BMR}}}}}_{g}^{{{{\rm{SNV}}}}}+{\alpha \cdot {{{\rm{BMR}}}}}_{g}^{{{{\rm{indel}}}}}$$
(4)
$${E}_{g}^{{{{\rm{LOF}}}}}={\left({A}_{g}^{{{{\rm{nonsense}}}}}+{A}_{g}^{{{{\rm{splice\; site}}}}}\right)\cdot {{{\rm{BMR}}}}}_{g}^{{{{\rm{SNV}}}}}+{\left(1-\alpha \right) \cdot {{{\rm{BMR}}}}}_{g}^{{{{\rm{indel}}}}}$$
(5)

To avoid false positives due to exceedingly small local BMR in some genes, we set lower thresholds for \({E}_{g}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\) and \({E}_{g}^{{{{\rm{LOF}}}}}\) at the 0.01 quantile (1st percentile) of all \(\left\{{E}_{g}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\right\}\) and all \(\left\{{E}_{g}^{{{{\rm{LOF}}}}}\right\}\), respectively. For a UniProt entry \(u\), its expected number of in-frame and LOF mutations are calculated as:

$${E}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}={\sum}_{g\in U}{E}_{g}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}$$
(6)
$${E}_{u}^{{{{\rm{LOF}}}}}={\sum}_{g\in U}{E}_{g}^{{{{\rm{LOF}}}}}$$
(7)

where \(U\) is the set of genes encoding this UniProt entry \(u\). In cases where \({E}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\) (or \({E}_{u}^{{{{\rm{LOF}}}}}\)) is absent, we adopted the median \(\left\{{E}_{g}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\right\}\) (or \(\left\{{E}_{g}^{{{{\rm{LOF}}}}}\right\}\)) of all genes as default for that UniProt entry \(u\).

NetFlow3D: Determination of cluster significance

For an intra-protein 3D cluster \(C\) composed of \(k\) residues in UniProt entry \(u\), the expected number of in-frame mutations across \({n}_{p}\) patients in \(C\) is calculated as:

$${E}_{C}={E}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\cdot \frac{k}{{l}_{u}}\cdot {n}_{p}$$
(8)

\({l}_{u}\) is the length of UniProt entry \(u\) in amino acids.

For an inter-protein 3D cluster \(C\) spanning across the PPI interface of UniProt entry \(u\) and \(v\), incorporating \({k}_{u}\) residues in \(u\) and \({k}_{v}\) residues in \(v\), the expected number of in-frame mutations across \({n}_{p}\) patients in \(C\) is calculated as:

$${E}_{C}=\left({E}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\cdot \frac{{k}_{u}}{{l}_{u}}+{E}_{v}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\cdot \frac{{k}_{v}}{{l}_{v}}\right)\cdot {n}_{p}$$
(9)

\({O}_{C}\) denotes the observed number of in-frame mutations across \({n}_{p}\) patients in \(C\). The significance of \(C\) is determined by the one-sided p-value from Poisson test:

$${p}_{C}=P\left(X\ge {O}_{C}\right)=1-P\left(X < {O}_{C}\right)=1-{\sum }_{x=0}^{{O}_{C}-1}\frac{{{E}_{C}}^{x}}{x!}{e}^{-{E}_{C}}$$
(10)

The Poisson test was applied to all 3D clusters. Bonferroni correction was separately applied to the 3D clusters in \({C}_{{{{\rm{structure}}}}}\) and \({C}_{{{{\rm{interface}}}}}\). 3D clusters with adjusted \({p}_{C} < 0.05\) were considered significant. Additionally, we benchmarked these p-values via permutation tests (Supplementary Note 4), and observed a strong correlation between these p-values from NetFlow3D and those from permutation tests, with \({R}^{2}=0.75\) (Supplementary Fig. 14).

NetFlow3D: Protein-specific LOF enrichment signals

For a UniProt entry \(u\), the expected number of LOF mutations across \({n}_{p}\) patients is calculated as:

$${E}_{u}={E}_{u}^{{{{\rm{LOF}}}}}\cdot {n}_{p}$$
(11)

The significance of LOF enrichment in \(u\) is determined by the one-sided p-value from Poisson test:

$${p}_{u}=P\left(X\ge {O}_{u}\right)=1-P\left(X < {O}_{u}\right)=1-{\sum }_{x=0}^{{O}_{u}-1}\frac{{{E}_{u}}^{x}}{x!}{e}^{-{E}_{u}}$$
(12)

\({O}_{u}\) denotes the observed number of LOF mutations across \({n}_{p}\) patients in \(u\). Bonferroni correction was applied, and the UniProt entries with adjusted \({p}_{u} \, < \, 0.05\) were considered significantly enriched for LOF mutations.

NetFlow3D: Network propagation model

Construction of the PPI network

The initial PPI network was built out of the aforementioned high-quality binary human PPIs from HINT. We filtered this network to encompass only genes expressed in any of the input cancer types. The aforementioned 18 well-known cancer genes with low transcript detection levels were considered expressed. The resulting PPI network was represented by an undirected graph \({G}_{{{{\rm{PPI}}}}}=\left({V}_{{{{\rm{PPI}}}}},{E}_{{{{\rm{PPI}}}}}\right)\).

Heat definition

The initial amount of heat assigned to protein \(u\) was calculated as:

$${h}_{u}={h}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}+{h}_{u}^{{{{\rm{LOF}}}}}$$
(13)

where

$${h}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}=-{\log }_{10}\left(\min \left({p}_{C}{|C}\in {C}_{u}\right)\right)$$
(14)
$${h}_{u}^{{{{\rm{LOF}}}}}=-{\log }_{10}\left({p}_{u}\right)$$
(15)

\({C}_{u}\) denotes the set of 3D clusters that contain at least one residue in protein \(u\). Both \({h}_{u}^{{{{\rm{in}}}}-{{{\rm{frame}}}}}\) and \({h}_{u}^{{{{\rm{LOF}}}}}\) are constrained to a maximum value of 300 to prevent infinite heat scores that could have resulted from zero p-values. The initial heat distribution is described by a diagonal matrix \({D}_{h}\) where the \(\left(i,i\right)\) entry is the amount of heat placed on protein \(i\).

Heat transfer weight

At each time step, proteins in the PPI network pass to and receive heat from their neighbors, while retaining a fraction \(\beta\) of their heat. Notably, when a protein transfers its remaining \(1-\beta\) fraction of heat to its neighbors, the heat is unevenly distributed. The amount of heat transferred along the edge between protein \(i\) and \(j\) is proportional to the weighting factor defined as:

$${w}_{i,j}=-{\log }_{10}\left(\min \left({p}_{C}{|C}\in {C}_{i,j}\right)\right)+{w}_{0}$$
(16)

\({C}_{i,j}\) denotes the set of inter-protein 3D clusters that are specific to the PPI between protein \(i\) and \(j\). \({w}_{0}=1\) is a baseline value, ensuring no edge has zero weight. \({w}_{i,j}\) is also constrained to a maximum value of 300.

Heat diffusion and identification of interconnected modules

Once the initial heat assigned to each protein is determined, and the heat transfer weight along each edge is determined, the model is run until steady state is reached. If a unit of heat is placed on protein \(j\), the net heat transferred from protein \(j\) to protein \(i\) is given by the \(\left(i,j\right)\) entry of the diffusion matrix \({{{\boldsymbol{F}}}}\) defined by:

$${{{\boldsymbol{F}}}}={\beta \left({{{\boldsymbol{I}}}}-\left(1-\beta \right){{{\boldsymbol{W}}}}\right)}^{-1}$$
(17)

where

$${W}_{i,j}=\frac{{w}_{i,j}}{{\sum}_{k\in {Z}_{j}}{w}_{k,j}}$$
(18)

\({Z}_{j}\) refers to the neighbors of protein \(j\). The initial heat distribution is described by a diagonal matrix \({{{{\boldsymbol{D}}}}}_{{{{\boldsymbol{h}}}}}\) where the \(\left(i,i\right)\) entry is the amount of heat placed on protein \(i\). The exchanged heat matrix \({{{\boldsymbol{E}}}}\) is then defined by:

$${{{\boldsymbol{E}}}}={{{\boldsymbol{F}}}}{{{{\boldsymbol{D}}}}}_{{{{\boldsymbol{h}}}}}$$
(19)

\(E(i,j)\) is the net heat transferred from protein \(j\) to protein \(i\), given the initial heat \({h}_{j}\) at protein \(j\). We constructed a weighted directed graph based on \({{{\boldsymbol{E}}}}\): If \(E\left(i,j\right) \, > \, \delta\), there was a directed edge from protein \(j\) to protein \(i\) of weight \(E\left(i,j\right)\). We then identified strongly connected components in this graph, which we term “interconnected modules”. A strongly connected component \(C\) in a directed graph is a set of vertices such that for every pair \(u,v\) of vertices in \(C\) there is a directed path from \(u\) to \(v\) and a directed path from \(v\) to \(u\). Leiserson et al.16 have demonstrated that the identification of strongly connected components within a directed graph substantially reduced reporting “star graphs”, which are centered around well-studied, highly mutated cancer proteins, but include surrounding proteins with few mutations and little biological relevance. Our method strictly aligns with this principle, ensuring that the identified “interconnected modules” do not present with any one-way configurations. Thus, proteins in our “interconnected modules” are not merely passive recipients of influence from others’ 3D mutation clusters but also act as significant sources of influence.

Parameter determination

(i) Insulating parameter \(\beta\): We used \(\beta=0.5\), as used by HotNet216. (ii) Edge weight parameter: The rationale behind selecting a \(\delta\) is based on the fact that randomized data will typically not yield large “interconnected modules”. Therefore, choosing an appropriate value of \(\delta\) can help identify “interconnected modules” that are statistically significant and likely to be biologically relevant. To generate a random undirected graph \({G}_{{{{\rm{random}}}}}\), we randomly swapped \(\left|{E}_{{{{\rm{PPI}}}}}\right|\) edge pairs in \({G}_{{{{\rm{PPI}}}}}\) while keeping the initial amount of heat on each protein fixed. The weighting factors \(\left\{{w}_{i,j}\right\}\) were then randomly assigned to the newly swapped edges. Edge swapping was used to maintain the degree of each protein constant during the randomization process. We then applied the aforementioned heat diffusion model to \({G}_{{{{\rm{random}}}}}\) and identified the minimum \(\delta\) such that all “interconnected modules” had size \(\le 5\). We generated 20 such random directed graphs and identified a \(\delta\) for each of them. We used the smallest value among these \(\delta\)’s as the final value of \(\delta\). “Interconnected modules” exceeding a size of 5 were deemed significant, termed as “significantly interconnected modules” in our study.

Implementing state-of-the-art 3D clustering algorithms

We applied four state-of-the-art 3D clustering algorithms to our pre-processed TCGA and COSMIC dataset, namely HotSpot3D9, 3D hotspot11, HOTMAPS13, and PSCAN10. Given that these approaches compiled protein structures from different resources but they all used PDB, we restricted the focus to the 3D clusters identified based on PDB structures to make fair comparisons. For all four algorithms, default parameters were used if not specified. We applied HotSpot3D to our mutation data through the HotSpot3D web server111. For PSCAN, we tested both the mean and variance of the genetic effects within each scan window using the PSCAN R package, ultimately plotting the performance curve using the variance test results because they yielded a much better performance curve than the mean test. PSCAN input files were generated from SCORE-Seq112 as suggested. For the mutations in each gene, SCORE-Seq was applied to the corresponding genotypes in the affected tumor samples and their matched normal samples. When specifying SCORE-Seq parameters, we set the minor allele frequency (MAF) upper bound to 1 and minor allele count (MAC) lower bound to 0 to include all mutations. For HotMAPS, Tippett’s method was employed to aggregate the p-values of hotspot residues within each 3D cluster.

Standard PPI network analyses

The initial amount of heat placed on each protein in \({G}_{{{{\rm{PPI}}}}}\) was determined by the number of tumor samples where the protein had mutation(s). Both in-frame and LOF mutations were accounted for. The weighting factors \(\left\{{w}_{i,j}\right\}\) were all set to 1. The remaining settings were identical to those used in the heat diffusion model described earlier.

Compiling catalytic residues

Catalytic residues were obtained from M-CSA52 (https://www.ebi.ac.uk/thornton-srv/m-csa/). We used the dataset that incorporates both the manually curated catalytic residues and their sequence homologs.

Compiling benchmark gene sets

Known cancer genes

A list of 738 known cancer genes (tier 1 + tier 2) was obtained from CGC47,48 (https://cancer.sanger.ac.uk/census, 10/4/2023 release).

Non-cancer-associated genes

Non-cancer-associated genes were compiled from three sources: (i) 1,297 genes from Reva et al.49 in their category iv, i.e., genes with no functional mutations and no available associations with cancer; (ii) 129 genes annotated as “nonfunctional” by Saito et al.50, including genes frequently affected by passenger hotspot mutations and olfactory genes; (iii) 194 genes confidently under neutral selection in human cancers identified by Hess et al.51. By combining these three datasets we got a total of 1574 unique genes, from which we removed 47 genes in CGC. The remaining 1,527 genes were considered non-cancer-associated.

Compiling well-known cancer pathways and GO biological processes

Cancer signaling pathways

32 manually curated cancer signaling pathways were obtained from NetSlim61 (http://www.netpath.org/netslim). Specifically, we extracted DataNode from the GPML file of each pathway, from which we excluded DataNode without type or whose type is “Metabolite” or “Complex”.

General biological processes

7,530 Gene Ontology (GO) biological processes were obtained from the Molecular Signatures Database (MSigDB) C5 collection56,57,58,59,60 (http://www.gsea-msigdb.org/gsea/msigdb). We excluded GO biological processes that did not contain any protein-coding genes.

Consistency with established biological processes

For every significantly interconnected module identified by NetFlow3D, we assessed the overlap between the module’s proteins and the genes of each GO BP, computing a Jaccard similarity coefficient. The alignment of a NetFlow3D-identified significantly interconnected module with established biological processes is determined by the highest Jaccard similarity coefficient between this module and any GO BPs. This criterion was also employed for evaluating randomly selected connected components and random groups of protein with significant 3D clusters.

Mutation pattern analysis

Mutation enrichment was determined by the ratio of observed fraction of mutations over the relative length of proteins within each NetFlow3D module, well-known cancer pathway or GO BP. Relative length is defined as the sum of protein sequence lengths within each set divided by the total sequence length of all proteins with in-frame mutations.

Definition of NetFlow3D-identified potential driver mutations

Within the significantly interconnected modules identified by NetFlow3D, we identified potential driver mutations: For each protein, we identified its most significant 3D cluster that surpasses the significance threshold and consider mutations within this cluster as potential driver mutations. This aligns with how we determine the initial heat score at each node. For each PPI, we identified the most significant 3D cluster that surpasses the significance threshold at its binding interface. The mutations within this cluster are designated as potential driver mutations. This aligns with how we determine heat propagation weight along each edge.

Survival analysis

Patient clinical data were obtained from the TCGA Pan-Cancer Clinical Data Resource113 (TCGA-CDR). Patients without valid tumor status were excluded from the analysis. The overall survival (OS) data was used as the clinical outcome endpoint. Our analysis focused on the patients with in-frame mutations in our preprocessed TCGA pan-cancer dataset. We compared the overall survival between patients grouped by whether their mutations were identified as potential driver mutations by NetFlow3D. Kaplan-Meier estimation was used to generate survival curves for both groups. Cox regression was used to evaluate the statistical association between the presence of NetFlow3D-identified mutations and OS, with tumor stage, age, sex, and tumor mutation burden (TMB) included as covariates. For brain lower grade glioma (LGG), tumor stage was excluded from the Cox regression analysis due to unavailable data. The regression coefficients of NetFlow3D-identified mutations indicate their impact on hazard, with their exponential values representing the hazard ratio (HR) and p-values indicating the significance of the association.

Cloning and plasmid construction

Single-colony-derived mutant clones were constructed using Clone-seq43. Wild-type PPP2R1A clones, sourced from the hORFeome v8.1 collection114, were used as the template for site-directed mutagenesis conducted by Eurofins Scientific. Mutagenesis of the PPP2R1A c.772 C > T (p.Arg258Cys) mutation was performed at 96-well scales using site-specific mutagenesis primers and full-length human ORF templates. Primers for mutagenesis were designed using the webtool http://primer.yulab.org, and a list of all primers used in this study is provided in Supplementary Data 13. PCR product was digested overnight using DpnI (NEB) without a ligation step to maximize throughput and then transformed directly into competent cells to isolate single colonies. Then, 4 colonies per mutagenesis reaction were hand-picked and arrayed into 96-well plates. After 21 h incubation at 37 °C, glycerol stocks were generated and then clones were pooled into 4 respective bacterial pools. Maxiprepped DNAs from each of the 4 pools were then combined through multiplexing (NEBNext) and then sequenced in a single 1 × 100 single-end Illumina HiSeq run. Properly mutated clones were then identified by next-generation sequencing analysis and recovered from single-colony glycerol stocks. We employed the Gateway cloning technology to insert the PPP2R1A or its mutant form into the pHAGE-CMV-GAW-3xFlag-IRES-PURO vector, and PPP2R3A into the pHAGE-CMV-GAW-3xMyc-IRES-PURO vector for subsequent analyses.

Affinity purification

HEK293T cells (Catalog Number: CRL-3216) were obtained from ATCC. HEK 293T cells were maintained in DMEM medium supplemented 10% Fetal Bovine Serum. 8 µg of PPP2R1A or PPP2R1A c.772 C > T were transfected into the cells with 40 µl of 1 mg/ml-1 PEI (Polysciences, 23966) and 1.2 ml OptiMEM (Gibco, 31085-062). After 48 hrs incubation, cells were washed three times in 10 ml DPBS (VWR, 14190144), resuspended in 500 µl of RIPA buffer (50 mM Tris pH7.5, 150 Mm NaCl, 5 mM EDTA, 1.0% NP-40, 0.25% Sodium Deoxycholate) and incubated on the ice for 30 min. The whole lysate is subjected to 120 s of 40% amplitude sonication using a sonifier cell disruptor (BRANSON,500-220-180). Centrifugation was used for 15 min at 16,100 g and 4 °C to separate the extracts. 500 µl of cell lysate per sample reaction was incubated with 15 µl of EZ view Red Anti-FLAG M2 Affinity Gel (Sigma, F2426) at 4 °C overnight using a nutator in order to facilitate co-immunoprecipitation. Following incubation, bound proteins were eluted in 200 µl of elution solution (10 mM Tris-Cl pH 8.0, 1% SDS) at 65 °C for 15 min after being washed three times in cell RIPA buffer.

Cell culture, co-immunoprecipitation and western blotting

HEK293T cells were cultured in 10 cm plates until they reached 40-50% confluency. 4 µg of bait construct (PPP2R1A or PPP2R1A c.772 C > T), 4 µg of prey construct (PPP2R3A), 40 µl of 1 mg/ml-1 PEI (Polysciences, 23966), and 1.2 ml of OptiMEM (Gibco, 31085-062) were used to transfect the cells. After 48 hrs incubation, cells were washed three times in 10 ml DPBS (VWR, 14190144), resuspended in 500 µl RIPA buffer (50 mM Tris pH7.5, 150Mm NaCl, 5 mM EDTA, 1.0% NP-40, 0.25% Sodium Deoxycholate) and incubated on the ice for 30 min. Whole lysate is sonicated on a sonifier cell disruptor (BRANSON,500-220-180) for 120 s at 40% amplitude. Extracts were cleared by centrifugation for 15 min at 16,100 g at 4 °C. 500 µl of cell lysate per sample reaction was incubated with 15 µl of EZ view Red Anti-FLAG M2 Affinity Gel (Sigma, F2426) at 4 °C overnight using a nutator. After incubation, bound proteins were eluted in 200 µl of elution solution (10 mM Tris-Cl pH 8.0, 1% SDS) at 65 °C for 15 min after being washed three times in cell RIPA buffer. Following an 8% SDS-PAGE gel run on FLAG-co-purified samples, the proteins were transferred to PVDF membranes. Anti-FLAG (Sigma, F1804, M2), and Anti-MYC (Invitrogen, 13-2500, 9E10) at both 1:5000 dilutions were used for immunoblotting analysis. Uncropped and unprocessed scans are supplied in the Source Data40 file.

Proteomic sample preparation

IP eluates were subjected to reduction with 200 mM TCEP for 1 h at 55 °C. Subsequently, alkylation was performed for 30 min at room temperature in darkness using 375 mM iodoacetamide. The samples were then digested using Trypsin Gold, mass spectrometry grade (catalog no. V5280; Promega), at an enzyme-to-substrate ratio of 1:100. The samples were incubated overnight at 37 °C. Following this, the concentrations of peptides were quantified using the Pierce Quantitative Colorimetric Peptide Assay (catalog no. 23275; Thermo Scientific). For TMT tests, samples were resuspended and normalized using 1 M triethylammonium bicarbonate (catalog no. 90114; Thermo Scientific). Samples were labeled using TMT10plex Isobaric Mass Tagging Kit (catalog no. 90113; Thermo Scientific) at a (w/w) label-to-peptide ratio of 20:1 for 1 h at room temperature. Labeling reactions were quenched by the addition of 5% hydroxylamine for 15 min and pooled and dried using a SpeedVac. Labeled peptides were enriched and fractionated using Pierce High pH Reversed-Phase Peptide Fractionation Kit according to the manufacturer’s protocol (catalog no. 84868; Thermo Scientific). Liquid chromatography–tandem mass spectrometry Fractions were analyzed using an EASY-nLC 1200 System (catalog no. LC140; Thermo Scientific) equipped with an in-house 3 μm C18 resin-(Michrom BioResources) packed capillary column (125 μm × 25 cm) coupled to an Orbitrap Fusion Lumos Tribrid Mass Spectrometer (catalog no. IQLAAEGAAPFADBMBHQ; Thermo Scientific). The mobile phase and elution gradient used for peptide separation were as follows: 0.1% formic acid in water as buffer A and 0.1% formic acid in 80% acetonitrile as buffer B; 0–5 min, 5%-8% B; 5–65 min, 8–45% B; 65–66 min, 45%–95% B; 66–80 min, 95% B; with a flow rate set to 300 nl min-1. MS1 precursors were detected at m/z = 375–1500 and resolution = 120,000. A CID-MS2-HCD-MS3 method was used for MSn data acquisition. Precursor ions with charge of 2+ to 7+ were selected for MS2 analysis at resolution = 50,000, isolation width = 0.7 m/z, maximum injection time = 50 ms and CID collision energy at 35%. 6 SPS precursors were selected for MS3 analysis and ions were fragmented using HCD collision energy at 65%. Spectra were recorded using Thermo Xcalibur Software v.4.4 (catalog no. OPTON-30965; Thermo Scientific) and Tune application v.3.4 (Thermo Scientific). Raw data were searched using Proteome Discoverer Software 2.3 (Thermo Scientific) against an UniProtKB human database.

Downstream proteomic analysis

We employed our computational tool Magma115 to analyze mass spectrometry proteomics data. Magma quantifies the differences in protein abundance between two experimental conditions by calculating fold-change (FC) and p-values. By comparing each bait protein (PPP2R1A or PPP2R1A c.772 C > T) against untransfected HEK293T cells, we identified the bait protein’s interactors using criteria of fold change (FC) > 2, adjusted p-value < 0.05, and peptide-spectrum matches (PSM) > 10. Our analysis was then narrowed to the combined set of interactors for both PPP2R1A and its mutant form PPP2R1A c.772 C > T. To elucidate the specific effects of the c.772 C > T mutation, we generated a volcano plot using the FC and adjusted p-values derived from the comparison between the mutant variant PPP2R1A c.772 C > T and the wild-type PPP2R1A. Known contaminants in AP-MS experiments, including keratin (KRT), myosins (MYO), small ribosomal subunit proteins (RPS), heat shock-related 70 kDa proteins (HSPA), and large ribosomal subunit proteins (RPL), were excluded from the analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.