- Review
- Open access
- Published:
Algorithms and tools for data-driven omics integration to achieve multilayer biological insights: a narrative review
Journal of Translational Medicine volume 23, Article number: 425 (2025)
Abstract
Systems biology is a holistic approach to biological sciences that combines experimental and computational strategies, aimed at integrating information from different scales of biological processes to unravel pathophysiological mechanisms and behaviours. In this scenario, high-throughput technologies have been playing a major role in providing huge amounts of omics data, whose integration would offer unprecedented possibilities in gaining insights on diseases and identifying potential biomarkers. In the present review, we focus on strategies that have been applied in literature to integrate genomics, transcriptomics, proteomics, and metabolomics in the year range 2018–2024. Integration approaches were divided into three main categories: statistical-based approaches, multivariate methods, and machine learning/artificial intelligence techniques. Among them, statistical approaches (mainly based on correlation) were the ones with a slightly higher prevalence, followed by multivariate approaches, and machine learning techniques. Integrating multiple biological layers has shown great potential in uncovering molecular mechanisms, identifying putative biomarkers, and aid classification, most of the time resulting in better performances when compared to single omics analyses. However, significant challenges remain. The high-throughput nature of omics platforms introduces issues such as variable data quality, missing values, collinearity, and dimensionality. These challenges further increase when combining multiple omics datasets, as the complexity and heterogeneity of the data increase with integration. We report different strategies that have been found in literature to cope with these challenges, but some open issues still remain and should be addressed to disclose the full potential of omics integration.
Background
Systems biology is an inter-disciplinary approach to science, which tackles the complexity of biological structures by a comprehensive study of different molecular layers of living systems [1]. Even though a single definition of systems biology has not yet been defined, it is often characterized by the use of computational and mathematical modelling to analyse interactions between diverse components of biological systems [2]. Research in this field often focuses on networks of genes, proteins, or metabolites to investigate the ‘omics cascade’ [3]. This cascade represents the sequential flow of biological information, where genes encode the potential phenotypic traits of an organism, but the regulation of proteins and metabolites is further influenced by physiological or pathological stimuli [4], as well as environmental factors such as diet, lifestyle, pollutants and toxic agents [5, 6]. This complex regulation makes biological systems complex and challenging to disentangle into their individual components (Fig. 1A). By examining variations at different levels of biological regulation [7], researchers can deepen their understanding of pathophysiological processes and the interplay between omics layers.
A Schematic overview of the systems biology approach. Environmental factors influence the so-called omics cascade, constituted by genes, transcripts, proteins, and metabolites. Through appropriate experimental setups, omics data can be investigated and assessed by the integrative methods discussed in this review. B Graphic representation of a typical omics data matrix. Rows represent samples, columns represent features. Different submatrix colours represent the phenotype of a group of samples. Blank cells represent missing values
Omics integration offers unprecedented possibilities to unravel biological functions, interpret diseases, identify biomarkers, and uncover hidden associations among omics variables [8,9,10]. As a result, it has become a cornerstone of modern biological research, driven by the development of advanced tools and strategies. However, the term ‘omics integration’ encompasses a wide spectrum of methodological approaches. An important distinction is the level at which integration occurs. In some cases, each omics dataset is analyzed independently, with individual findings being combined for biological interpretation. Alternatively, all datasets may be analyzed simultaneously, typically by assessing the relationships between them or by combining the omics matrices together. Another consideration is whether the integration process is driven by existing knowledge, such as known molecular interactions or biological pathways, or if it is driven entirely by the data itself.
In this review, we address data-driven omics integration, defined as integration strategies that are not driven by prior biological insights. In contrast to previous reviews, which have explored potential tools available in the literature, we exclusively examined methods that have been utilized for integration purposes. This approach provides a comprehensive perspective on practical applications and current trends in the field of omics integration. We have categorized the integration strategies into three main groups: statistical-based methods, multivariate methods, machine learning and artificial intelligence. We also explored the major challenges that researchers encounter in omics integration and highlighted strategies that have been employed to address these issues. Omics data are supposed to be matrices in which each row represents a sample (also referred as individual or patient throughout the paper) and each column represents the omics feature, such as the transcript, protein, or metabolite. Samples are divided into different groups, which usually are different phenotypes (e.g. disease versus control) (Fig. 1B).
Data-driven approaches to integrate proteomics data
A comprehensive search of the PubMed electronic database was conducted using the keywords detailed in the Supplementary Materials to identify studies on omics integration published between 2018 and 2024. These included studies utilized data-driven methods such as statistical methods, multivariate analyses, or machine learning/artificial intelligence models to analyze omics data without relying on prior knowledge of biological relationships. Approaches incorporating external knowledge, such as interactome or pathways databases (e.g., GO or KEGG), as well as hybrid strategies combining data-driven and knowledge-based methods, were excluded from this review. A detailed workflow of the decisional process for inclusion in this review is depicted in Fig. 2A. We followed most of the PRISMA 2020 checklist [11] to define the inclusion/exclusion criteria and structure the paper. The review includes 64 research papers, with their number and proportion relative to the total retrieved papers per year shown in Fig. 2B. Figure 2C illustrates the number of papers retrieved for each omics combination, highlighting the proportion of papers employing statistical, multivariate and ML/AI methods. Table 1 presents an overview of the main employed packages, accompanied by their respective references.
A Descriptive workflow of the decisional process for inclusion in the present review. The workflow is divided into four main sequential steps: retrieval of all the papers, first screening, eligibility assessment and final decision on inclusion. B Histogram representing the total number of papers retrieved by the search string (teal bars) and the number of included papers (orange bars) per year. C Upset plot representing the number of papers retrieved for each omics combination. The proportion of employed methods is also depicted
Statistical and correlation-based methods
Correlation is the statistical measure that quantifies the degree to which two variables are related to each other. A straightforward approach to assessing the relationship between two omics datasets involves visualizing their correlation and computing their coefficient and statistical significance. For instance, a simple scatterplot can facilitate the analysis of expression patterns, leading to the identification of consistent or divergent trends [24, 25]. In Zheng et al. [24], the scatter plot was divided into four regions associated to different colors, the red area indicating higher transcription efficiency rates, the green area representing lower transcription efficiency rates, and the gray regions highlighting protein–transcript pairs with consistent expression patterns. Similarly, in Gao et al. [25], the transcript-to-protein ratios were investigated in scatter plot quadrants representing discordant or unanimous up- or down- regulation of genes. Pearson’s or Spearman’s correlation analysis or their generalizations such as the multivariate generalization of the squared Pearson correlation coefficient, i.e. the RV coefficient, were employed to test correlations between whole sets of differentially expressed genes in different biological contexts [26,27,28,29,30,31,32,33,34,35,36]. The computation of the correlation coefficient permits to obtain different biological insights, including the determination of the extent and nature of the interaction between sets of differentially expressed proteins/metabolites [26], the assessment of whether up-regulated proteins exhibit a significant correlation with abundantly increased metabolites and vice versa [26, 32], the identification of molecular regulatory pathways of correlated genes and proteins [27], or the assessment of transcription-protein correspondence [28, 30, 31, 34,35,36]. Pearson’s correlation analysis has also been demonstrated to be effective in identifying a time delay between the release of mRNA molecule and the production and secretion of the protein, as outlined in the study described in [29]. In the study described in [37], Spearman’s correlation coefficient was computed to integrate three omics datasets (transcriptomics, proteomics, and metabolomics). A cutoff threshold was defined on the correlation coefficient and p-value (0.9 and 0.05, respectively) on the pairwise correlations between differentially expressed proteins (DEPs) and differential metabolites, differentially expressed genes (DEGs) and differentially expressed miRNAs, DEPs and DEGs. The objective of this approach was to identify the major relationships between the three platforms by visualizing the first 100 correlations. In another case [34], Pearson’s correlation analysis was complemented with Procrustes analysis, a form of statistical shape analysis. Procrustes analysis involves the alignment of datasets through scaling, rotation, and translation of the data in a common coordinate space to assess their geometric similarity and correspondence.
Correlation networks are a broad and widely employed application of correlation. Correlation networks extend correlation analysis by transforming these pairwise associations into graphical representations. In such networks, nodes represent individual biological entities (e.g., genes, proteins, or metabolites), and edges are constructed based on correlation thresholds, typically determined by metrics such as R2 or p-value. This methodological framework facilitates the visualization and the analysis of complex relationships within and between datasets, thereby enabling the identification of highly interconnected components and their roles within biological systems. In Gong et al. [38], edges were retained according to specific thresholds on R2 and p-values to construct a multi-omics co-expression network. This network was then integrated with a cancer-related network to enrich the analysis with known interactions in cancer-related pathways, facilitating a deeper understanding of the molecular interactions involved in cancer biology.
A further step in correlation networks is Weighted Gene Correlation Network Analysis (WGCNA) [14]. This method is employed to identify clusters of co-expressed, highly correlated genes, which are referred to as modules. By constructing a scale-free network, WGCNA assigns weights to gene interactions, emphasizing strong correlations while reducing the impact of weaker or spurious connections. These modules can be summarized by their eigenmodules, which are frequently linked to clinically relevant traits, thereby facilitating the identification of functional relationships. In Ding et al. [39], WGCNA was conducted separately on the joint transcriptomics/proteomics and metabolomics data sets, and correlation was computed to uncover associations between genes/proteins and metabolites modules.
Another approach that we have found in literature is xMWAS [13]. xMWAS is an online tool developed in R that performs correlation and multi-variate analyses. xMWAS performs a pairwise association analysis with omics data organized in matrices. The determination of the correlation coefficients is assessed by combining Partial Least Squares (PLS) components and regression coefficients. Subsequently, the obtained coefficients are employed to generate a multi-data integrative network graph. Networks of correlation are created by joining the nodes whose edges meet the requirements in terms of association score and statistical significance. Clusters of highly interconnected nodes, known as communities, can be identified by means of the multilevel community detection method [40], which consists of two iteratively repeated phases. In the first phase of the algorithm, every single network node i is considered, together with its neighbours j. A measure of how well the network is divided in communities, called modularity, is employed to assess the extent to which nodes within a module exhibit higher levels of connectivity with each other compared to those outside the module. This metric is computed by removing node i from its community and assigning it to the community of node j. If the gain in terms of modularity is positive, the node i is moved to the community that bears the maximum gain. Conversely, if there is no gain in modularity, it remains in its original community. When a modality local maxima is reached, the second phase of the algorithm begins. In this phase, a novel network is constructed, with nodes representing the communities identified during the first phase. Then, the same algorithm employed in the first phase is computed to the resulting network and this whole process is iterated until the maximum modularity is reached. xMWAS method was able to uncover omics interconnections by identifying the biological pathways associated to high correlated community in the following studies [41, 42]. In a recent study by Na and collegues [43], the integration of multiple omics with xMWAS was successful in identifying a clear pathophysiological pathway, that had not been identified in the single-omics analysis.
Canonical Correlation Analysis (CCA) and its variant for high-dimensional or multicollinear data, known as Regularised Canonical Correlation Analysis (rCCA) [44], are two integrative and dimensionality reduction methods, that highlight correlations between two omics datasets. CCA and rCCA are included in the Bioconductor R package MixOmics [12]. The CCA strategy involves the calculation of canonical variates, defined as linear combinations of variables from each dataset. Each pair of canonical variates is associated with a canonical correlation value, which represents the correlation between the two novel components. rCCA is the regularised counterpart of CCA, and it must be employed when the total number of variables from both datasets is much larger than the number of samples. rCCA implements l2 penalty, also known as Ridge penalty, to the diagonals of omics matrices, thereby rendering them invertible. This approach effectively overcomes the collinearity issues inherent to standard CCA. rCCA can be used to create relevance networks, wherein solely pairs of variables belonging to different datasets are drawn. These networks reveal relationships between omics variables and can be enriched with biological insights. In literature, rCCA has been employed to identify the most significant correlations through relevance networks [36, 45] and to identify nodes with high connectivity, as they might indicate a key role in the disease [46].
Similarity Network Fusion (SNF) [15] is another graph-based approach that was developed both in R and MATLAB environment. Differently from xMWAS and (r)CCA, SNF builds networks where nodes are samples (e.g., patients) instead of omics data. For each omics dataset, a pairwise distance matrix is calculated by using statistical correlation or other distance measures, such as Euclidean distance. Patient similarity networks are built for each omics matrix and then combined by adopting a nonlinear combination method based on graph overlapping. The algorithm for network fusion derives from the message-passing theory [47], and it iteratively updates each network such that it becomes more similar to the others at every iteration. In this way, low-weight edges are eliminated if present in a single omics matrix but are maintained if present in all networks, while high-weight edges present in one or more networks are added to the others. In addition, SNF is able to detect clusters of samples, as outlined in [48], and to predict labels for new samples on the bases of the constructed network. In a recent study, SNF successfully demonstrated that a combination of multiple omics data can achieve a higher classification performance with respect to single or fewer omics [49]. In another study, SNF was also better than single omics datasets in identifying two clusters of patients based on their omics plasma profile [50]. Ruan et al. [51] propose a variation of SNF that they define spectral clustering SNF (scSNF): with the aim of identifying molecular subtypes of idiopathic pulmonary fibrosis, SNF was firstly applied on the proteomics, miRNA, and RNA expression dataset. Then, spectral clustering was implemented on the fused network, leveraging the eigenvectors of the graph Laplacian to project the subjects into a lower-dimensional space, thereby facilitating the grouping of subjects.
Correlation methods encompass a wide array of strategies. The majority of the studies employ simple Pearson’s correlation coefficients to disclose gene-transcripts relationship with regard to transcriptional efficiency, post-translational modifications, and transcription delays. However, more sophisticated methods have emerged as reliable tools to elucidate the molecular mechanisms and patterns that characterize diverse phenotypes. An overview of the aforementioned papers and the integration strategies that have been adopted in real data studies is provided in Table 2.
Multivariate methods
Multivariate methods represent the most extensive and most variegated category of multi-omics integration strategies. These approaches frequently rely on algebraic decompositions of datasets, leveraging latent variables to extract the most relevant underlying information. Latent variables are algebraic coordinates inferred from data, that represent shared patterns between datasets and reduce their dimensionality. For this reason, they enable the identification of significant relationships and shared patterns and therefore simplify the integration problem.
A number of multivariate methods exist for integration; some of these are adaptations or extensions of widely used dimensionality reduction techniques, such as Principal Component Analysis (PCA). PCA is a technique that simplifies complex matrices by transforming them into a new coordinate system defined by principal components, which are the directions of maximum variance in the data. These principal components (PC), which are linear combinations of the original features, serve as uncorrelated variables. These components can be utilized for further analysis, enabling deeper insights and reducing redundancy. A simple, popular extension of PCA for the multi-block scenario is SUM PCA. A multi-block data set can either come from a multi-platform analysis of the same samples or by the combination of chemical measurements with non-analytical data generated from sensory or consumer sciences. In both cases, the data is not simply multivariate but is multi-modal, i.e., multivariate and multi-source.
SUM-PCA is an approach that applies PCA to a fused data block obtained by concatenating the omics matrices row-by-row. In this method, all data blocks share the same set of super scores (Tsup), while retaining unique block-specific loadings (Pb) and residuals (Eb). The super scores (Tsup) serve as a comprehensive summary that captures the shared characteristics across all blocks and represents a consensus score. The relationship between the consensus scores and the combined block score matrix is described by the block weight matrix W, which quantifies the contribution of each block to the consensus for each principal component, as expressed by the equation Tsup = T⋅W. SUM-PCA has been employed to have a first glance on cell cultures behavior similarity [52].
Multi-Omics Factor Analysis (MOFA) is another generalization of PCA that has been proposed for omics integration [53,54,55,56,57], [58]. MOFA is a data-driven approach that utilizes a set of hidden factors to identify the underlying causes of variability in multi-omics data sets. By leveraging these factors, MOFA enables the identification of the principal sources of variation in multi-omics data sets, and the determination of axes of heterogeneity either shared or unique across the different omics datasets. The algebraic principle of MOFA involves the decomposition of each original omics data matrix into the multiplication of two matrices: the latent factor matrix, common to all data matrices, and a weight matrix specific of each data platform. An added residual noise term is also considered. Once the model is trained, the R package MOFAtools, which is included in the MOFA package, can be employed as a semi-automated pipeline to identify the latent factors. The variation explained by each factor is computed, and then the main contributors to sample heterogeneity can be visualized in a low-dimensional space. Finally the features with the highest weights can be inspected. The package encompasses also the imputation of missing data by calculating the missing values directly from the model equation. In a recent study, transcriptomics, proteomics, metabolomics, and lipidomics blood samples collected from patients affected by Alzheimer’s disease were used to identify analytes that discriminated different groups by setting a threshold on their normalized absolute loading value from MOFA [53]. It has also been employed to examine the extent to which mRNA and protein regulation correlate in aggregative multicellular organisms [57]. As Armenteros et al. [55] have previously proposed, the latent factors identified by MOFA can also be associated with clinical variables and covariates. The authors have found a correlation between the secretion of C-peptide and clinical benefits in diabetes type 1. In Aydin et al. [56], MOFA factors allowed to identify novel target and mediator genes of known quantitative trait loci hotspots, as well as additional loci that were found to drive variation in the three integrated omics datasets. More recently, a new version of MOFA, called MOFA + [18], has been developed to extend MOFA’s application to single-cell analysis. MOFA + improves the scalability of MOFA and is able to manage side information regarding the structure between cells. Its capacity to analyze datasets comprising data from millions of cells makes it particularly well-suited for single-cell analysis. Park et al. [59] employed MOFA + to perform unsupervised classification of genomics, transcriptomics, proteomics and blood biomarkers.
MEFISTO (Method for the Functional Integration of Spatial and Temporal Omics data) is an extension of MOFA [60] that was developed to address the temporal dimension. Indeed, it has been employed for analysing temporal relationships in proteomics and transcriptomics data and identifying pro-thrombotic signals factors that changed over time from a baseline conditions in COVID19 patients convalescence [61]. Finally, among the PCA extensions for omics integration, we can list Multiple Factor Analysis (MFA), that is implemented in the R package FactoMineR [19]. The strength of MFA is the possibility to analyse data by taking into account a partition of the variables into groups (j = 1,…,J groups of variables). In MFA, PCA is performed on weighted variables: a same weight is assigned to every variable belonging to the same group j (j = 1,…J), this weight is set equal to the first eigenvalues of the PCA on group j. This weighting permits to balance the global analysis because the maximum axial inertia (i.e., total variance of the group of variables projected onto a principal component) for each group is 1. Factorial axes has the potential applications in pattern recognition [62]. These axes can facilitate the comprehension of the contribution of each omics dataset to the distance between samples, and the identification of omics matrices that provide similar or discordant information [63].
Another adaptation of dimensionality reduction technique for integration is the Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) [64]. DIABLO is a tool included in the mixOmics package, that has been successfully used for the data-driven, holistic, and hypothesis-free identification of robust biomarkers and disease mechanisms, and sample prediction [54, 65,66,67,68,69,70,71,72]. The algorithm consists in a supervised extension of the sparse Generalized Canonical Correlation Analysis (sGCCA) method [73], which is a multivariate dimension reduction method based on the singular value decomposition (SVD). SVD is a matrix factorization technique that approximates a matrix \(M\) as \(M=U\dot\Sigma \dot {V}^{T}\), i.e. with the multiplication of a left singular matrix \(U\), a diagonal matrix of singular values \(\Sigma\), and the transpose of a singular matrix \({V}\). This is achieved by maximizing the covariance between linear combinations of the variables, referred to as latent component scores, and projecting the data into a lower-dimensional subspace spanned by these components. To select the associated variables across omics levels, sGCCA internally applies a l1 penalization, also named least absolute shrinkage and selection operator LASSO, on the variable coefficients vector, similarly to the regularization approach used in rCCA. These coefficients are the factors associated to the different phenotypes across the different omics data [54]. To extend sGCCA for a classification framework, DIABLO takes into consideration a dummy indicator matrix that indicates the class membership of each sample. Moreover, it replaces the l1 penalty parameter by the number of variables to select in each dataset and each component, as there is a direct correspondence between both parameters. DIABLO has been applied in several research works and for different purposes, among these studies we can list those aimed at the identification of multi-omics features to discriminate different phenotypes [66, 68, 69, 71], the determination of different omics data sets correlation [71, 72], the assessment of which data set yield the most discriminative power [70], the determination of features contribution to the latent variables [65], and the prediction of categories of interest [67].
Another multivariate approach for omics integration is the Projection to Latent Structures or Partial Least Squares (PLS)-based method, which is also available in MixOmics package. In general, PLS is a multivariate projection-based method that explores and explains the relationship between two or more continuous variables. It achieves this by projecting the original data onto a set of latent variables or components that maximize the covariance between the predictors and the response variables. While PLS is specifically designed for regression purposes, its variant PLS-Discriminant Analysis (PLS-DA) performs sample classification: instead of containing continuous variables, the response vector contains categorical ones. Moreover, the method focuses on maximizing the separation between predefined classes while simultaneously capturing the variance in the predictors. This makes it particularly suitable for analyzing complex and high-dimensional omics data. Both these algorithms have been extended to a sparse version to cope with high-dimensional data through the implementation of a l1 penalization to reduce the number of variables. One of the main advantages of PLS-DA is that it associates each variable to theVariable Importance in Projection (VIP) score. The VIP score is a metric that assesses the contribution of the variable in explaining the variance of both the predictors (X) and the response variables (Y). As an example of application, features selected by PLS-DA on the basis of their importance have highlighted crucial relationships between metabolites and proteins in COVID19 patients, by visualizing bipartite omics connections with a relevance network [74]. A variation of PLS-DA, named backward elimination PLS-DA (BE PLS-DA), was implemented by Benedetto and coworkers to select the best discriminant model able to separate the classes under study [62]. BE PLS-DA encompasses an iterative approach to refine the regression model: it uses variable selection based on the VIP scores, tossing the least important variables in each cycle to enhance the discriminative power of the model.
Co-Inertia Analysis (CIA) [16] and Multiple Co-Inertia analysis (MCIA) [17] represent another class of multivariate analysis. These methods were developed to assess relationships and trends across multiple datasets, and they have been proposed for integrative omics analysis [36, 46, 75,76,77]. CIA is included in the R-Bioconductor package made4. By simultaneously generating ordinations (dimension reduction diagrams), it identifies successive orthogonal axes with the highest maximum squared covariance between the datasets, thereby effectively representing joint similarities and trends. This method has allowed, for instance, to identify the patterns of co-expression associated to the maximum covariance between proteins and genes in brain ischemia [36]. CIA has been also performed to assess co-variability between proteomics and lipidomics data from lung tissue samples of insulin-deficient diabetes mellitus pigs and wild type pigs [76].
MCIA is a generalization of CIA to integrate more than two omics datasets, and it is implemented in the R-Bioconductor omicade4 package. MCIA algorithm is a technique that transforms each omics dataset separately into a comparable lower dimensional space, maximizing the sum of the squared covariance between the scores of each dataset through synthetic axes. The different datasets are then projected into the same dimensional space, such that features that share similar trends are closely projected, highlighting relationships among samples and the overall consistency of the datasets. MCIA has been employed to determine the co-relationships and to visualise the similarity and the divergence of datasets from patients affected by ischemic stroke and mevalonate kinase deficiency [46, 75]. In the first one, MCIA has highlighted an overall dissimilarity in the gene and protein datasets structure, that was confirmed by a low RV coefficient [46]. In the second one [75], the projection of exome, transcriptome and proteome on the same space demonstrated a different transcriptomics and proteomics profile in healthy and pathological conditions. Finally, MCIA was employed by Ichikawa et al. [77] to identify co-inertia drivers, but they have utilized the R package MiBiOmics [78].
Multivariate methods represent the most heterogeneous class of integration methods. They comprise two categories of tools: algorithms that have been specifically designed for omics integration (e.g., DIABLO, MOFA and its extensions), and algorithms that have been adapted to the scope (e.g., PLS-DA, MCIA and MFA). These methods have been increasingly used in recent years and have provided meaningful insights about omics datasets relationships, correspondences, and discrepancies. An overview of the above-mentioned papers and the integration strategies is provided in Table 3.
Machine learning and artificial intelligence
Machine learning (ML) is a powerful data science tool that enables systems to analyze complex data, identify patterns, and make informed predictions or decisions automatically. These algorithms are broadly categorized into supervised and unsupervised learning, depending on whether the data includes labeled outcomes (i.e., known classes for each sample) or not. In supervised learning, the model is trained to map inputs to their corresponding labels and can then be employed to predict labels for new, unseen data. Popular supervised ML techniques include linear and logistic regression, decision trees, random forests, support vector machines, and neural networks. In contrast, unsupervised learning involves training models to uncover underlying structures or patterns in unlabeled data. Common unsupervised algorithms include clustering methods, k-nearest neighbors, as well as techniques for dimensionality reduction and anomaly detection.
Clustering is an approach that groups samples based on a predefined distance metric, such that samples within the same cluster are more similar to each other than to those in different clusters. Traditional clustering methods are typically run once, but this makes the robustness and reproducibility of the algorithms variable. To address this, consensus clustering offers a more reliable approach by aggregating results from multiple clustering iterations. One of the implementations of consensus clustering is ConsensusClusterPlus [20] included in the Bioconductor package. The strategy consists in an unsupervised approach in which a proportion of both samples and features is iteratively portioned into k groups according to a clustering algorithm. The proportion of repetitions in which two items are clustered together is defined as pairwise consensus value. For each k, pairwise consensus values are calculated and stored in a consensus matrix; then, the final agglomerative hierarchical consensus clustering is obtained using a distance of 1-consensus values and pruned to k consensus clusters. This approach has been employed by Liu et al. [79] to delineate a comprehensive characterization of esophageal squamous cell carcinomas. They have identified four distinct molecular subtypes each associated with potential therapeutic targets and diagnostic biomarkers. In another study which aimed at identifying tumor molecular subtypes by integrating data from transcriptomics, proteomics, and phosphoproteomics [80], this strategy was applied to analyze each individual omics dataset and to integrated multi-omics. The integration of omics data has provided a better performance than single omics findings: a higher silhouette score, i.e., an index of clustering quality that measures how well a sample fits in the assigned cluster, suggests that integration of the three types of omics data better classifies cancer subtypes. Another clustering-based integration strategy involves the clustering in a latent variable space, like in the case of integrative clustering framework. The iClusterBayes package [21] employs a Bayesian latent variable model to integrate multiple genomic data types measured in the same set of samples. This method provides an integrated cluster assignment through joint inference across data types, while identifying features that drive the formation of these clusters. Integrative clustering has been employed mainly to identify groups of patients with significant distinct clinical and disease profiles [81, 82].
One of the latest clustering-based strategy developed is the subspace clustering as described in Gillenwater et al. [83]. Subspace clustering is implemented within the R package MineClus (Mining Non-Empty clusters) [22] and consists in identifying clusters in subspaces of high-dimensional data. The reduction of proteomics, transcriptomics, and metabolomics data is performed by autoencoders (AE) prior to clustering. AE are deep neural networks consisting of layers of interconnected nodes, where each layer tries to reconstruct the original input data by learning its reduced representation. In order to do that, the nodes use activation functions to process inputs and produce outputs: training AE entails calibrating the activation function’s weights to minimize the difference between the input and the reconstructed output. After reducing the datasets with AE, embeddings from all omics layers were horizontally concatenated for subspace clustering of the integrated data. In the work of Gillenwater et al. [83], the analysis has not produced a satisfactory clustering, since it has not achieved the aim of determining molecular-based clusters with distinct clinical phenotypes. More consistent results in terms of patients’ clinical characterization were in fact obtained by performing subspace clustering on each distinct omics dataset.
On the contrary, the subspace clustering was used to obtain a representation of the stacked multi-omics features was successfully applied in the works of Wang et al. [84] and Khadirnaikar et al. [85]. In the former [84], k-means clustering on the AE embeddings was employed to identify differentially altered pathways associated to different phenotypes of long COVID. In the latter [85], consensus k-means clustering on the latent representation has identified labels that are associated pan-cancer subgroups with distinct clinical characteristics.
As regards supervised models, regression is one of the approaches commonly employed in omics data analysis. These models are designed to capture the linear relationship between one or more independent variables (predictors or explanatory variables) and a dependent variable (the outcome or response). Once fitted, the regression line can be used to predict the outcome value of new input data. In the context of omics integration, regression models have not been utilized to the same extent of unsupervised clustering-based methods: only one example of regression analysis was employed for the integration of proteins and metabolites to predict disc herniation development in dogs [86]. Horvatić et al. [86] used a version of linear regression called elastic-net, which is a linear regression in which a regularization term has been added to the equation to avoid overfitting. Depending on the structure of the regularization term, regularization can be defined as LASSO, ridge, and elastic-net, which are different on penalty term, shrinkage, and feature selection. Elastic-net is a compromise between LASSO and ridge, as the penalty term of the loss function is a combination of both. In the study described in [86], different elastic-net regression models were fitted with different feature subsets, selected either through recursive feature elimination (RFE) or minimun redundancy-maximal relevance (mRMR) algorithms. The final model was built from the features that were repeatedly selected, and it correctly classified all the samples in the test set.
Machine learning (ML) classification models have been gaining increasing prominence in the field of multi-omics integration. This is particularly true for ensemble learning approaches such Random Forest (RF) [87,88,89,90], Adaptive Boosting (AdaBoost) [87] and Gradient Boosting machine [91]. RF combines multiple decision trees to make more accurate predictions. In decision trees, each node represents a query on one or more input features, and each branch represents the outcome of the decision. By training each tree on a random subset of data and performing classification based on the decision of multiple trees, the prediction becomes more stable and less prone to overfitting. Moreover, RF is often chosen for its capability to deal with high dimensionality and missing values [89]. Huang et al. [87] have employed RF for multi-classification by combining metagenomics, metatranscriptomics, metabolomics, proteomics, and viromics, reaching an Area Under the Curve (AUC) above 0.83. Li et al. [88] have integrated proteomics and metabolomics through RF to identify prediction biomarkers. Finally, RF was used for a prognostic model by including genomics, transcriptomics, proteomics, and histopathological image features [89], reaching the highest performance with respect to single-omics models. The combination of multiple proteins and metabolites has provided better results in terms of AUC as illustrated in [90]. However, it is important to note that not always can superior results be obtained by employing a whole set of omics data with respect to a subset. In the work of Wang et al. [92], a panel of two proteins and two metabolites was employed to build several models, these molecules were able to discriminate among the different conditions.
AdaBoost is another ensemble learning method that reached a good result in terms of AUC in a multi-classification problem [87]. Similarly to the random forests, AdaBoost combines decision trees but it assigns an equal weight to the training data, then it calculates the errors and increases the weight of misclassified data points. Another tree is then fitted on the same dataset with updated weights, and the process is repeated iteratively until all the trees are fitted.
Another ensemble learning method is the Gradient Boosting Machine. This algorithm works by constructing additive regression models iteratively minimizing residuals by least squares. Gradient Boosting Machines have been successfully applied to construct predictive models for responders and non-responders to a low-caloric diet using transcriptomics, lipidomics, and metabolomics data, achieving an AUC of 0.75 [91].
Other classification models that have been successfully applied in multi-omics integration include Support Vector Machines (SVM) [93] and deep learning approaches [87]. SVMs address binary classification problems by identifying an optimal hyperplane that separates two classes in a high-dimensional feature space. This hyperplane is selected to maximize the margin between the two classes, ensuring robust performance even with complex or high-dimensional datasets. In contrast, deep learning models, e.g. the feedforward neural networks used in the work of Huang et al. [87]- employ multiple layers of interconnected neurons to capture complex, non-linear relationships in the data. These models are trained iteratively on the input data to minimize prediction error, allowing them to learn patterns and features relevant for accurate diagnosis.
Leveraging the potential of ML models, a python package called AutoGluon-Tabular [23], was employed by Bai et al. [94] to linearly combine the results of several models. AutoGluon-Tabular is an automated ML algorithm designed to build predictive models from unprocessed tabular datasets, such as CSV files. It simplifies the modeling process by automatically recognizing data types in each column, including text, and optimizing hyperparameters and feature engineering. The algorithm trains various base models, including Random Forests, LightGBM, CatBoost, ExtraTrees, XGBoost, and neural networks (e.g., NeuralNetMXNet and NeuralNetFastAI). These base models are trained, and their predictions are used as features to train a final ensemble model, which combines the strengths of the base models.
In general, when compared with single-omics model, multi-omics classification performs better in terms of accuracy and AUC [89, 93]. This confirms the power of investigating several layers of biological processes in order to achieve a better understanding of diseases and phenotypes. An overview of the above-mentioned papers and the integration strategies that have been put in practice is provided in Table 4. Table 5 reports the accuracy metrics of the cited papers for both the multi-omics and the single omics cases.
Challenges and future directions in omics integration
As we have extensively investigated, integration of different omics datasets holds much potential for uncovering complex biological insights that cannot be achieved by analyzing individual datasets alone. This approach is increasingly being employed across diverse contexts, and it has demonstrated its ability to provide new insights on disease biomarkers and mechanisms. However, the processing and analysis of omics data matrices poses a variety of challenges, with the merging of multiple omics data matrices further exacerbating these problems.
First, the quality of each dataset should be assessed to guarantee the data reproducibility [95]. This is important because some common data analysis technique are highly affected by the presence of outliers, either as single analytes or as whole samples [96]. Moreover, in computer sciences, the expression “garbage in, garbage out” is often used to express the concept that the quality of the input determines the quality of the output. Therefore, the result of the analyses strongly depends on the robustness of the initial input.
Another issue is that different omics data matrices are different in terms of data types, size, noise, correspondence and correlation between measurements from different technologies [7]. When combining additional information, e.g. clinical data, dissimilarities in data types may include differences in the order of magnitude, measurement unit or variance–covariance structure [97], mismatched distribution, and diverse data modalities, i.e. continuous signals, discrete counts, intervals, categorical variables, pathways, etc. [98]. Finally, each omics dataset is characterized by a different number of features, typically hundreds to thousands for transcripts, several tens or hundreds for proteins, and tens to thousands for metabolites [99]. The omics layer with the highest number of feature may overcome the other ones, possibly adding annotation bias and enrichment of noise if the employed model is not robust enough [99].
In the following paragraphs, we provide an overview of the main issues that can be encountered when dealing with and integrating high-throughput omics datasets. We also describe the strategies that have been adopted in literature to cope with these challenges, which are also depicted in Fig. 3.
Missing data
High-throughput platforms often produce matrices with a high percentage of missing data, especially in proteomics and metabolomics. More specifically, missing values can be classified in missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR) [100]. MAR data occur when the probability of a missing value depends on other variables in the dataset, but not the missing data itself; MCAR data occur when the probability of missing data is the same for all observations, and it is not related to any other variable in the dataset; MNAR data occur when the probability of missing data is related to the missing data itself. The way of approaching missing values changes on the basis of the category they belong to; however, most of the statistical methods do not allow the presence of missing data, and it is seldom possible to remove entire columns or rows of the data matrix. Imputation is often performed to cope with these problems, but the performance obviously depends on the percentage of missing values with respect to the total amount of information, and it can highly affect downstream analysis result [101, 102]. Moreover, imputing missing values is not a valid methodology in some cases: reasons for the absence of data are manifold, from the actual absence of a protein or metabolite, to low coverage or low sensitivity of the instrument [98]. If the absence of data is due to the actual lack of protein in a sample, its value should not be replaced by a fictitious value. On the other hand, if imputation is considered to be reasonable, missing values can be replaced by some central tendency measures, such as mean, mode, or median of the feature; alternatively, they can be replaced by constant or even random values. Other approaches that we found in literature consists in filling the missing values with either the minimum value for each feature divided by 2 [87] or by one fifth of the minimum value recorded in the dataset for that molecule [52].
These strategies fall into the category of univariate methods, as they impute missing values using only non-missing values of that same feature. Alternatively, multivariate imputation algorithms use the entire set of available features to estimate the missing values. For instance, K-Nearest Neighbour (KNN) method has been suggested to fill in missing data values by using the values of K similar data points in the dataset [80, 85, 85, 86, 90]. The underlying idea is that if there is a missing value in the dataset, the values of the closest data points can be used to estimate it.
Another solution is the imputation with the probabilistic minimum [57] it is implemented in the imputeLCMD R package [103]. LCMD consists in replacing missing values with the lowest detectable value in the entire dataset or within each sample; then, the data are centred around this value and gaps are filled with the smallest value.
Finally, another way to cope with missing data could be the employment of some ML methods that can actually handle missing values, for instance random forest [8, 104]. However, not all machine learning models are robust or perform well in the case of missing data.
Collinearity
Multicollinearity means a high degree of linear correlation between explanatory variables [105], and it often occurs in omics data since features are the result of biological mechanisms which are physically interconnected. In the context of regression analysis, multicollinearity can be assessed by Variance Inflation Factor (VIF), condition index (CI), and Variance Decomposition Proportion (VDP). Mathematically, the VIF is calculated by regressing a predictor i against all the other predictors in the model and calculating the ratio 1/(1 − Ri2), where Ri2 is the regression R2 for that specific predictor i. This index quantifies the extent to which the variance of a coefficient is increased as a consequence of the multicollinearity in a regression analysis. The CI is calculated by performing SVD and computing the square root of the ratio between the largest eigenvalue and the eigenvalue associated to the predictor variable of interest. Finally, VDP represents the degree of variance inflation by multicollinearity, and it allows to determine the variables contributing to it. The principle is that each variable has variance decomposition proportions that are associated to each CI. By summing up these proportions for two or more condition indices and seeing if the sum overcomes a predefined threshold, it is possible to conclude there is multicollinearity between the explanatory variables that correspond to these proportions. Strategies to deal with multicollinearity include increasing sample size, combining multicollinear variables into a single one [105], or deletion of strongly correlated variables (e.g. Pearson R2 > 0.95 [91] or > 0.90 [87]) by taking only one of the two. Another approach could be to substitute correlated features with their linear combination, such as in PCA [91]. However, deletion or combination strategies are not always applicable in systems biology contexts because biological phenomena are highly interconnected, therefore eliminating some features could hide their associations and their involvement in certain pathways. When performing pathway analysis, the more a pathway is enriched, the lower is its p-value. Since the enrichment strongly depends on the number of proteins/metabolites and on how they are connected, deleting some features could affect the whole analysis.
Dimensionality
Biological experiments usually produce matrices characterized by a small number of samples n and high number of features p. Concerning the statistical methods, good practice has consolidated that, as a rule of thumb, n/p should be equal or bigger than 5 [106], otherwise statistical power might be limited [106]. More generally, a high dimensionality leads most models to overfit [107], obtaining good classification performances for the training set, but poor generalization on the test set. This is the so-called curse of dimensionality: increasing the number of features improves the performance up to a certain limit, after which the model starts to perform worse. Indeed, according to a systematic review on pre-processing of data in the medical domain [108], data reduction is the most frequent task in the medical domain (55% of the considered papers), followed by data cleaning (29%), transformation (9%), balancing (5%) and integration (2%).
Among the most implemented strategies to reduce dimensionality, we found several simple strategies, such as: (i) keeping features detected in a defined number of samples [38], [80, 86]; (ii) retaining features on the base of a fold change threshold [36] (iii) deleting features having near-zero variance [91]; (iv) selecting only statistically significant features [27, 71, 109] or features significantly associated with clinical variables [83]; (v) taking genes that are quantified in multiple omics datasets [57]; (vi) performing WGCNA and tossing all the molecules that were not assigned to any group [39]. However, relying solely on statistical tests or correlations between variables to identify the most important ones is a naïve approach due to its univariate nature. This approach focuses on one variable at a time and fails to consider all the relationships between variables, which hinders the ability to identify multivariate patterns underlying biological phenomena.
Among the more complex methods to reduce the dataset, we found strategies that select features based on their coefficient in regression models such Partial Least Squares regression (PLS) and their categorical derivatives [97]: Linear Discriminant Analysis (LDA), PLS-DA, and Orthogonal Projection on Latent Structures Discriminant Analysis (OPLS-DA). Regression analysis with elastic-net regularization can be performed separately for each omics platform to select relevant features [53]. Alternatively, variable selection can be based on the VIP score of PLS-DA models [63]. However, it must be recalled that these models are not always accurate and could overfit, since they can always find a projection for phenotypes separation, even with random data [110, 111]. Another example of multivariate method for dimensionality reduction is minimum Redundancy Maximum Relevance (mRMR) [86]. mRMR algorithm identifies features based on their relevance, which means how much a feature is correlated with the target variable; at the same time, it tries to reduce the redundancy, which is a measure of how much a feature is correlated with the other ones. By combining these two criteria, mRMR allows to identify the most discriminative features of a dataset.
Approaches to cope with the challenge of matrix dimensionality also comprise ML or AI models for variable reduction, for instance through random forest [80] or autoencoders [83]. Other feature selection methods defined as wrappers use ML algorithms to evaluate the performance of the model trained and tested with random subsets of features, in an iterative procedure [112]. Recursive Feature Elimination (RFE) has been employed to select features by recursively removing columns of the data matrix and building a model on the remaining ones [86]. Recursive Feature Elimination Cross Validation (RFECV) was adopted to removes different subsets of features and evaluate the performance of a model using cross-validation [87].
Finally, ML/AI models can be used directly without feature reduction. Indeed, some models like Random Forest, Support Vector Machines and Classification and Regression Trees (CART) do not require feature reduction. However, given the high number of variables, results are often difficult to interpret [113].
Model interpretability
Multivariate methods and machine learning models are powerful tools. However, some of the most powerful multivariate algorithms are based on transformations of the input features into another dimensional space where features are projected. This could be an issue if the biochemical meaning of the model needs to be assessed, since information on the importance of individual features and their mutual relationships could be lost [97].
In machine learning, the simplest models such as linear ones are inherently interpretable: they have the advantage of being transparent and easy to interpret, but this often comes at the expense of reduced predictive accuracy. When dealing with more complex models, there are various strategies to achieve interpretability, ranging from feature importance analysis to more sophisticated techniques that incorporate explainability into the model architecture. In the last years, model explainability has emerged as a pivotal subject in research, often referred to as explainable AI. The rationale is that understanding how the model predicts the outcome, is a mean to trust the prediction and consider it trustworthy [114]. An example of algorithm that can explain classifier predictions is the Local Interpretable Model-Agnostic Explanations (LIME) algorithm, which explicate the prediction of a black-box model by learning a simpler, interpretable model agreeing with the black-box one. Another popular algorithm for explaining the output of machine learning models is the SHAP (SHapley Additive exPlanations), which comes from the game theory and have been increasingly employed in several fields [115].
Computational power
The huge amount of data generated by high-throughput technologies requires an increasing computational power and storage capability of computer systems [7, 99, 116]. Processing such large-scale datasets comes with significant computational challenges, including the high cost of data processing, the need for efficient algorithms, and the requirement for robust infrastructure capable of handling complex computations. Some advanced analytical techniques, such as deep learning and graphical models, demand substantial memory and processing capabilities, which may limit their accessibility for researchers with limited resources. Fortunately, the advent of optimization algorithms, online machine learning, parallelization of workflows, and cloud computing has made large-scale analyses more feasible by improving efficiency and scalability [98]. However, the trade-offs between computational cost and analytical depth remain a key consideration in omics data integration.
Future directions in omics studies and integration
The future of multi-omics integration hinges on addressing the current limitations and exploiting emerging technological and analytical advancements. While existing methods have laid a strong foundation, the field must evolve to handle the increasing volume and complexity of multi-omics data. In recent years, single-cell omics and spatial omics have been increasingly recognized as promising techniques to revolutionize our understanding of biological systems. In contrast with bulk tissue sequencing, which simultaneously analyze thousands of cells from a tissue [117], single-cell omics capture the heterogeneity of the tissue by understanding the unique role of the cells and offering deeper insights into specific cell function and behavior [118]. Recognized by Nature as one of the top emerging technologies in 2022 [120], spatial omics adds another layer of complexity by mapping gene expression patterns within their spatial context, enabling the study of tissue architecture and cell-to-cell communication at resolutions down to the subcellular level [119]. Significant computational and analytical challenges given by the volume of data generated by these technologies still hold. The need to efficiently store, manage, and analyze these datasets continues to outpace available computational resources, making scalability a key concern. Additionally, existing integrative tools lack the analytical capacity to perform crucial functions, requiring further methodological advancements [121]. Nevertheless, advancements in single-cell and spatial multi-omics will continue to drive innovation, offering a more comprehensive view of cellular biology.
Although ML and AI approaches have not been the most common methods for integrating omics data, advancements in high-throughput technologies are likely to make them increasingly crucial in data analysis. These models can overcome challenges associated with high dimensionality, noise, and data heterogeneity. Feature selection will be crucial for the optimal application of these techniques, helping to mitigate challenges posed by high dimensionality, redundancy, and noise in large-scale datasets. Traditional statistical approaches remain widely used [122], but they often struggle with the complexity of multi-omics datasets. Recent advancements in machine learning (ML) and deep learning (DL)-based feature selection methods offer a more scalable and adaptive solution [123]. Even more advanced algorithms for feature selection were born that shift the single-objective viewpoint to a multi-objective perspective, leveraging quantum computing [124] and opening a whole new field of research with further potential.
Another promising advancement in multi-omics is multiscale integration, which provide a holistic understanding of biological systems by linking gene and protein expression data with imaging modalities and clinical metadata [125]. This allows for the identification of disease markers with higher specificity, leading to improved diagnostics, prognostic predictions, and therapeutic interventions [125]. However, this integration introduces new challenges, including batch effects, computational complexity, and standardization issues. Developing robust methods to harmonize and analyze such diverse datasets is crucial for future progress. Finally, the establishment of community-driven initiatives for data sharing and analysis will accelerate the translation of multi-omics findings into clinical applications, such as personalized medicine and drug discovery. Fostering collaboration and idea-sharing among researchers to collectively tackle these complexities is the only way to establish robust pipeline and accordance in data collection and analysis [126].
Conclusions
Omics integration has become a popular topic in systems biology, as it gives the potential to unravel pathophysiological mechanisms at multiple levels, joining together complementary information from different omics platform. This is particularly important in those diseases whose clinical phenotypes and genotypes are not enough to provide neither an understanding of the underlying mechanisms nor the diagnosis and prognosis. Moreover, multiple biological layers are intricately interconnected in human diseases. For example, disruptions in DNA repair processes contribute to various diseases [127], therefore, it is essential to consider interactions with repair molecules to fully understand disease mechanisms. In tumors, the literature is increasingly highlighting the interplay between genetic mechanisms and molecular pathways involved in immunity [128]. Integration of different omics datasets could also lead to huge progress in the context of personalized medicine, which aims at having both molecular and clinical profiles of patients to build individualized health care models with tailored treatment and management [129, 130].
Relationships between omics are not usually causative, and statistical associations cannot seize complex relationships such as post-translational modifications or non-linear reaction kinetics [95]. Besides, correlations do not highlight causal associations, and indeed they can even result by chance. Multivariate methods have the potential to discover hidden patterns and relationships. Indeed, they have been gaining attention for their capability to specifically integrate different datasets and obtain insights on their contribution, similarity, and dissimilarity. On the other hand, interpretability plays a central role in systems biology: while it is true that correlation approaches can be too reductive, it is also true that multivariate methods are more challenging to apply and interpret.
ML/AI models are designed to improve the classification performance at the expense of the understanding of the features importance in the phenotype discrimination.
When dealing with omics integration an important issue is the need to work with appropriate datasets. This is why it is important to define an adequate study design and to try making high-quality datasets publicly available in order to increase research and collaboration towards a successful data integration process. Also, while several public databases are available, they are still limited to single omics [131,132,133].
In summary, although the integration of multiple datasets has yielded encouraging results in terms of molecular mechanism understanding, several challenges remain. Each datasets bear difficulties because of the high-throughput nature of omics platform: data quality, missing data, collinearity. When integrating different omics, the dimension of the problem increases, and data becomes even more heterogeneous. We explored various strategies to address these challenges, emphasizing that robust pre-processing and fine-tuned approaches are essential for unlocking the full potential of omics integration. By improving integration strategies, multi-omics integration will likely become even more relevant for biomedical research in the years to come.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.
Abbreviations
- AE:
-
AutoEncoders
- AUC:
-
Area under the curve
- CART:
-
Classification and regression trees
- CI:
-
Condition Index
- DEGs:
-
Differentially expressed genes
- DEPs:
-
Differentially expressed proteins
- DIABLO:
-
Data Integration Analysis for Biomarker Discovery using Latent cOmponents
- KNN:
-
K-nearest neighbour
- LDA:
-
Linear discriminant analysis
- LIME:
-
Local interpretable model-agnostic explanations
- MAR:
-
Missing at random
- MCAR:
-
Missing completely at random
- MEFISTO:
-
Method for the functional integration of spatial and temporal omics data
- MFA:
-
Multiple factor analysis
- (M)CIA:
-
Multiple co-inertia analysis
- ML:
-
Machine learning
- MNAR:
-
Missing not at random
- MOFA:
-
Multi-omics factor analysis
- mRMR:
-
Minimal redundancy-maximal relevance
- PCA:
-
Principal component analysis
- PLS (-DA):
-
Partial least squares (discriminant analysis)
- RFE:
-
Recursive feature elimination
- RFE (CV):
-
Recursive feature elimination cross validation
- (r)CCA:
-
(Regularized) canonical correlation analysis
- scSNF:
-
(Spectral Clustering) similarity network fusion
- sGCCA:
-
Sparse generalized canonical correlation analysis
- SVD:
-
Singular value decomposition
- SVM:
-
Support vector machines
- VDP:
-
Variance decomposition proportion
- VIF:
-
Variance inflation factor
- VIP:
-
Variable importance in projection
- WGCNA:
-
Weighted gene correlation network analysis
References
Breitling R. What is systems biology? Front Physiol. 2010. https://doi.org/10.3389/fphys.2010.00009/abstract.
Papakonstantinou E, Pierouli K, Eliopoulos E, Vlachakis D. Introductory Chapter: Systems Biology Consolidating State of the Art Genetics and Bioinformatics. In: Vlachakis D, editor. Systems Biology [Internet]. IntechOpen; 2019 [cited 2023 Jan 4]. Available from: https://www.intechopen.com/books/systems-biology/introductory-chapter-systems-biology-consolidating-state-of-the-art-genetics-and-bioinformatics.
Hillmer RA. Systems biology for biologists. PLoS Pathog. 2015;11(5):e1004786.
Li W, Shao C, Zhou H, Du H, Chen H, Wan H, et al. Multi-omics research strategies in ischemic stroke: a multidimensional perspective. Ageing Res Rev. 2022;81: 101730.
Bermingham KM, Brennan L, Segurado R, Barron RE, Gibney ER, Ryan MF, et al. Genetic and environmental contributions to variation in the stable urinary NMR metabolome over time: a classic twin study. J Proteome Res. 2021;20(8):3992–4000.
Gruzieva O, Jeong A, He S, Yu Z, de Bont J, Pinho MGM, et al. Air pollution, metabolites and respiratory health across the life-course. Eur Respir Rev. 2022;31(165): 220038.
Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85–97.
Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;19:3735–46.
Gurke R, Bendes A, Bowes J, Koehm M, Twyman RM, Barton A, et al. Omics and multi-omics analysis for the early identification and improved outcome of patients with psoriatic arthritis. Biomedicines. 2022;10(10):2387.
Ryan CJ, Cimermančič P, Szpiech ZA, Sali A, Hernandez RD, Krogan NJ. High-resolution network biology: connecting sequence with function. Nat Rev Genet. 2013;14(12):865–79.
Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;29(372): n71.
Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13(11): e1005752.
Uppal K, Ma C, Go YM, Jones DP. xMWAS: a data-driven integration and differential network analysis tool. Bioinformatics. 2018;34(4):701–2.
Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9(1):559.
Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7.
Culhane AC, Perrière G, Higgins DG. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics. 2003;4(1):59.
Meng C, Kuster B, Culhane AC, Gholami AM. A multivariate approach to the integration of multi-omics datasets. BMC Bioinformatics. 2014;15(1):162.
Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111.
Lê S, Josse J, Husson F. FactoMineR: an R package for multivariate analysis. J Stat Softw. 2008;18(25):1–18.
Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3.
Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG. A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics. 2018;19(1):71–86.
Yiu ML, Mamoulis N. Frequent-Pattern based Iterative Projected Clustering. In: Proceedings of the Third IEEE International Conference on Data Mining. USA: IEEE Computer Society; 2003. p. 689. (ICDM ‘03).
Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, et al. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [Internet]. arXiv; 2020 [cited 2025 Jan 12]. Available from: http://arxiv.org/abs/2003.06505.
Zheng W, Zhang Y, Sun C, Ge S, Tan Y, Shen H, et al. A multi-omics study of human testis and epididymis. Molecules. 2021;26(11).
Gao YN, Yang X, Wang JQ, Liu HM, Zheng N. Multi-omics reveal additive cytotoxicity effects of aflatoxin B1 and aflatoxin M1 toward intestinal NCM460 cells. Toxins (Basel). 2022;14(6).
Yang F, Zhao LY, Yang WQ, Chao S, Ling ZX, Sun BY, et al. Quantitative proteomics and multi-omics analysis identifies potential biomarkers and the underlying pathological molecular networks in Chinese patients with multiple sclerosis. BMC Neurol. 2024;24(1):423.
Dong W, Chen Y, Zhang Q, Zhao X, Liu P, He H, et al. Effects of lipoteichoic and arachidonic acids on the immune-regulatory mechanism of bovine mammary epithelial cells using multi-omics analysis. Front Vet Sci. 2022;9: 984607.
Elstner M, Olszewski K, Prokisch H, Klopstock T, Murgia M. Multi-omics approach to mitochondrial DNA damage in human muscle fibers. Int J Mol Sci. 2021;22(20).
Johansson M, Ulfenborg B, Andersson CX, Heydarkhan-Hagvall S, Jeppsson A, Sartipy P, et al. Multi-omics characterization of a human stem cell-based model of cardiac hypertrophy. Life (Basel). 2022;12(2).
Kechavarzi BD, Wu H, Doman TN. Bottom-up, integrated -omics analysis identifies broadly dosage-sensitive genes in breast cancer samples from TCGA. PLoS ONE. 2019;14(1): e0210910.
Cziesielski MJ, Liew YJ, Cui G, Schmidt-Roach S, Campana S, Marondedze C, et al. Multi-omics analysis of thermal stress response in a zooxanthellate cnidarian reveals the importance of associating with thermotolerant symbionts. Proc Biol Sci. 2018;285(1877).
Zhang H, Zhao C, Zhang Y, Lu L, Shi W, Zhou Q, et al. Multi-omics analysis revealed NMBA induced esophageal carcinoma tumorigenesis via regulating PPARα signaling pathway. Environ Pollut. 2023;1(324): 121369.
Xu Y, Zhang Y, Qin Y, Gu M, Chen R, Sun Y, et al. Multi-omics analysis of functional substances and expression verification in cashmere fineness. BMC Genomics. 2023;24(1):720.
Jiang B, Yang J, He R, Wang D, Huang Y, Zhao G, et al. Integrated multi-omics analysis for lung adenocarcinoma in Xuanwei, China. Aging. 2023;15(23):14263–91.
Leo IR, Aswad L, Stahl M, Kunold E, Post F, Erkers T, et al. Integrative multi-omics and drug response profiling of childhood acute lymphoblastic leukemia cell lines. Nat Commun. 2022;13(1):1691.
Ramiro L, García-Berrocoso T, Briansó F, Goicoechea L, Simats A, Llombart V, et al. Integrative Multi-omics analysis to characterize human brain ischemia. Mol Neurobiol. 2021;58(8):4107–21.
Wang Z, Xie Z, Zhang Z, Zhou W, Guo B, Li M. Multi-platform omics sequencing dissects the atlas of plasma-derived exosomes in rats with or without depression-like behavior after traumatic spinal cord injury. Prog Neuropsychopharmacol Biol Psychiatry. 2024;8(132): 110987.
Gong TQ, Jiang YZ, Shao C, Peng WT, Liu MW, Li DQ, et al. Proteome-centric cross-omics characterization and integrated network analyses of triple-negative breast cancer. Cell Rep. 2022;38(9): 110460.
Ding Z, Fu L, Tie W, Yan Y, Wu C, Dai J, et al. Highly dynamic, coordinated, and stage-specific profiles are revealed by a multi-omics integrative analysis during tuberous root development in cassava. J Exp Bot. 2020;71(22):7003–17.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.
Lee H, Gao Y, Ko E, Lee J, Lee HK, Lee S, et al. Nonmonotonic response of type 2 diabetes by low concentration organochlorine pesticide mixture: Findings from multi-omics in zebrafish. J Hazard Mater. 2021;15(416): 125956.
Lee H, Sung EJ, Seo S, Min EK, Lee JY, Shim I, et al. Integrated multi-omics analysis reveals the underlying molecular mechanism for developmental neurotoxicity of perfluorooctanesulfonic acid in zebrafish. Environ Int. 2021;157: 106802.
Na AY, Lee H, Min EK, Paudel S, Choi SY, Sim H, et al. Novel time-dependent multi-omics integration in sepsis-associated liver dysfunction. Genom Proteom Bioinform. 2023;21(6):1101–16.
González I, Déjean S, Martin PGP, Baccini A. CCA: an R package to extend canonical correlation analysis. J Stat Softw. 2008;17(23):1–14.
Liang S, Lu Z, Cai L, Zhu M, Zhou H, Zhang J. Multi-Omics analysis reveals molecular insights into the effects of acute ozone exposure on lung tissues of normal and obese male mice. Environ Int. 2024;1(183): 108436.
Simats A, Ramiro L, García-Berrocoso T, Briansó F, Gonzalo R, Martín L, et al. A mouse brain-based multi-omics integrative approach reveals potential blood biomarkers for ischemic stroke. Mol Cell Proteomics. 2020;19(12):1921–36.
Pearl J. Probabilistic Reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann; 1988.
Picard D, Felsberg J, Langini M, Stachura P, Qin N, Macas J, et al. Integrative multi-omics reveals two biologically distinct groups of pilocytic astrocytoma. Acta Neuropathol. 2023;146(4):551–64.
Li CX, Wheelock CE, Sköld CM, Wheelock ÅM. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J. 2018;51(5):1701930.
Li S, Dragan I, Tran VDT, Fung CH, Kuznetsov D, Hansen MK, et al. Multi-omics subgroups associated with glycaemic deterioration in type 2 diabetes: an IMI-RHAPSODY Study. Front Endocrinol (Lausanne). 2024;15:1350796.
Ruan P, Todd JL, Zhao H, Liu Y, Vinisko R, Soellner JF, et al. Integrative multi-omics analysis reveals novel idiopathic pulmonary fibrosis endotypes associated with disease progression. Respir Res. 2023;24(1):141.
Scisciola L, Chianese U, Caponigro V, Basilicata MG, Salviati E, Altucci L, et al. Multi-omics analysis reveals attenuation of cellular stress by empagliflozin in high glucose-treated human cardiomyocytes. J Transl Med. 2023;21(1):662.
Clark C, Dayon L, Masoodi M, Bowman GL, Popp J. An integrative multi-omics approach reveals new central nervous system pathway alterations in Alzheimer’s disease. Alzheimers Res Ther. 2021;13(1):71.
Titz B, Szostak J, Sewer A, Phillips B, Nury C, Schneider T, et al. Multi-omics systems toxicology study of mouse lung assessing the effects of aerosols from two heat-not-burn tobacco products and cigarette smoke. Comput Struct Biotechnol J. 2020;18:1056–73.
Armenteros JJA, Brorsson C, Johansen CH, Banasik K, Mazzoni G, Moulder R, et al. Multi-omics analysis reveals drivers of loss of β-cell function after newly diagnosed autoimmune type 1 diabetes: an INNODIA multicenter study. Diabetes Metab Res Rev. 2024;40(5): e3833.
Aydin S, Pham DT, Zhang T, Keele GR, Skelly DA, Paulo JA, et al. Genetic dissection of the pluripotent proteome through multi-omics data integration. Cell Genomics. 2023;3(4). Available from: https://www.cell.com/cell-genomics/abstract/S2666-979X(23)00043-5.
Edelbroek B, Westholm JO, Bergquist J, Söderbom F. Multi-omics analysis of aggregative multicellularity. iScience. 2024;27(9): 110659.
Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018. https://doi.org/10.15252/msb.20178124.
Park JC, Barahona-Torres N, Jang SY, Mok KY, Kim HJ, Han SH, et al. Multi-omics-based autophagy-related untypical subtypes in patients with cerebral amyloid pathology. Adv Sci (Weinh). 2022;9(23): e2201212.
Velten B, Braunger JM, Argelaguet R, Arnol D, Wirbel J, Bredikhin D, et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat Methods. 2022;19(2):179–86.
Gisby JS, Buang NB, Papadaki A, Clarke CL, Malik TH, Medjeral-Thomas N, et al. Multi-omics identify falling LRRC15 as a COVID-19 severity marker and persistent pro-thrombotic signals in convalescence. Nat Commun. 2022;13(1):7775.
Benedetto A, Robotti E, Belay MH, Ghignone A, Fabbris A, Goggi E, et al. Multi-omics approaches for freshness estimation and detection of illicit conservation treatments in sea bass (Dicentrarchus Labrax): data fusion applications. Int J Mol Sci. 2024;25(3):1509.
Faugere J, Brunet TA, Clément Y, Espeyte A, Geffard O, Lemoine J, et al. Development of a multi-omics extraction method for ecotoxicology: investigation of the reproductive cycle of Gammarus fossarum. Talanta. 2022;28(253): 123806.
Singh A, Shannon CP, Gautier B, Rohart F. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays.
Li S, Alfaro AC, Nguyen TV, Young T, Lulijwa R. An integrated omics approach to investigate summer mortality of New Zealand Greenshell™ mussels. Metabolomics. 2020;16(9):100.
Chappell K, Manna K, Washam CL, Graw S, Alkam D, Thompson MD, et al. Multi-omics data integration reveals correlated regulatory features of triple negative breast cancer. Mol Omics. 2021;17(5):677–91.
Poussin C, Titz B, Xiang Y, Baglia L, Berg R, Bornand D, et al. Blood and urine multi-omics analysis of the impact of e-vaping, smoking, and cessation: from exposome to molecular responses. Sci Rep. 2024;14(1):4286.
Rushing BR. Unlocking the molecular secrets of antifolate drug resistance: a multi-omics investigation of the NCI-60 cell line panel. Biomedicines. 2023;11(9):2532.
Ivanova L, Rangel-Huerta OD, Tartor H, Dahle MK, Uhlig S, Fæste CK. Metabolomics and multi-omics determination of potential plasma biomarkers in PRV-1-infected atlantic salmon. Metabolites. 2024;14(7):375.
Ribeiro DM, Palma M, Salvado J, Hernández-Castellano LE, Capote J, Castro N, et al. Goat mammary gland metabolism: an integrated Omics analysis to unravel seasonal weight loss tolerance. J Proteomics. 2023;30(289): 105009.
Chepy A, Vivier S, Bray F, Ternynck C, Meneboo JP, Figeac M, et al. Effects of immunoglobulins g from systemic sclerosis patients in normal dermal fibroblasts: a multi-omics study. Front Immunol. 2022;13: 904631.
Khalyfa A, Marin JM, Sanz-Rubio D, Lyu Z, Joshi T, Gozal D. Multi-omics analysis of circulating exosomes in adherent long-term treated OSA patients. Int J Mol Sci. 2023;24(22):16074.
Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014;15(3):569–83.
Spick M, Campbell A, Baricevic-Jones I, von Gerichten J, Lewis HM, Frampas CF, et al. Multi-omics reveals mechanisms of partial modulation of COVID-19 dysregulation by glucocorticoid treatment. Int J Mol Sci. 2022;23(20).
Carapito R, Carapito C, Morlon A, Paul N, Vaca Jacome AS, Alsaleh G, et al. Multi-OMICS analyses unveil STAT1 as a potential modifier gene in mevalonate kinase deficiency. Ann Rheum Dis. 2018;77(11):1675–87.
Shashikadze B, Flenkenthaler F, Kemter E, Franzmeier S, Stöckl JB, Haid M, et al. Multi-omics analysis of diabetic pig lungs reveals molecular derangements underlying pulmonary complications of diabetes mellitus. Dis Models Mech. 2024;17(7):dmm050650.
Ichikawa A, Miki D, Hayes CN, Teraoka Y, Nakahara H, Tateno C, et al. Multi-omics analysis of a fatty liver model using human hepatocyte chimeric mice. Sci Rep. 2024;14(1):3362.
Zoppi J, Guillaume JF, Neunlist M, Chaffron S. MiBiOmics: an interactive web application for multi-omics data exploration and integration. BMC Bioinform. 2021;22(1):6.
Liu Z, Zhao Y, Kong P, Liu Y, Huang J, Xu E, et al. Integrated multi-omics profiling yields a clinically relevant molecular classification for esophageal squamous cell carcinoma. Cancer Cell. 2023;41(1):181-195.e9.
Chong W, Zhu X, Ren H, Ye C, Xu K, Wang Z, et al. Integrated multi-omics characterization of KRAS mutant colorectal cancer. Theranostics. 2022;12(11):5138–54.
Eteleeb AM, Novotny BC, Tarraga CS, Sohn C, Dhungel E, Brase L, et al. Brain high-throughput multi-omics data reveal molecular heterogeneity in Alzheimer’s disease. PLoS Biol. 2024;22(4): e3002607.
Anwar MY, Highland H, Buchanan VL, Graff M, Young K, Taylor KD, et al. Machine learning-based clustering identifies obesity subgroups with differential multi-omics profiles and metabolic patterns. Obesity. 2024;32(11):2024–34.
Gillenwater LA, Helmi S, Stene E, Pratte KA, Zhuang Y, Schuyler RP, et al. Multi-omics subtyping pipeline for chronic obstructive pulmonary disease. PLoS ONE. 2021;16(8): e0255337.
Wang K, Khoramjoo M, Srinivasan K, Gordon PMK, Mandal R, Jackson D, et al. Sequential multi-omics analysis identifies clinical phenotypes and predictive biomarkers for long COVID. CR Med [Internet]. 2023;4(11). Available from: https://www.cell.com/cell-reports-medicine/abstract/S2666-3791(23)00431-7.
Khadirnaikar S, Shukla S, Prasanna SRM. Integration of pan-cancer multi-omics data for novel mixed subgroup identification using machine learning methods. PLoS ONE. 2023;18(10):e0287176.
Horvatić A, Gelemanović A, Pirkić B, Smolec O, Beer Ljubić B, Rubić I, et al. Multi-omics approach to elucidate cerebrospinal fluid changes in dogs with intervertebral disc herniation. Int J Mol Sci. 2021;22(21):11678.
Huang Q, Zhang X, Hu Z. Application of artificial intelligence modeling technology based on multi-omics in noninvasive diagnosis of inflammatory bowel disease. J Inflamm Res. 2021;14:1933–43.
Li Y, Hou G, Zhou H, Wang Y, Tun HM, Zhu A, et al. Multi-platform omics analysis reveals molecular signature for COVID-19 pathogenesis, prognosis and drug target discovery. Signal Transduct Target Ther. 2021;6(1):155.
Zeng H, Chen L, Zhang M, Luo Y, Ma X. Integration of histopathological images and multi-dimensional omics analyses predicts molecular features and prognosis in high-grade serous ovarian cancer. Gynecol Oncol. 2021;163(1):171–80.
Fontanilles M, Heisbourg JD, Daban A, Fiore FD, Pépin LF, Marguet F, et al. Metabolic remodeling in glioblastoma: a longitudinal multi-omics study. Acta Neuropathol Commun. 2024;12(12):162.
Valsesia A, Chakrabarti A, Hager J, Langin D, Saris WHM, Astrup A, et al. Integrative phenotyping of glycemic responders upon clinical weight loss using multi-omics. Sci Rep. 2020;10(1):9236.
Wang Y, Huang X, Li F, Jia X, Jia N, Fu J, et al. Serum-integrated omics reveal the host response landscape for severe pediatric community-acquired pneumonia. Crit Care. 2023;27(1):79.
Han Y, Zeng X, Hua L, Quan X, Chen Y, Zhou M, et al. The fusion of multi-omics profile and multimodal EEG data contributes to the personalized diagnostic strategy for neurocognitive disorders. Microbiome. 2024;12(1):12.
Bai W, Li C, Li W, Wang H, Han X, Wang P, et al. Machine learning assists prediction of genes responsible for plant specialized metabolite biosynthesis by integrating multi-omics data. BMC Genomics. 2024;25(1):418.
Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, et al. Metabolomics and multi-omics integration: a survey of computational methods and resources. Metabolites. 2020;10(5):202.
Kumar N, Hoque MdA, Sugimoto M. Robust volcano plot: identification of differential metabolites in the presence of outliers. BMC Bioinform. 2018;19(1):128.
Sperisen P, Cominetti O, Martin FPJ. Longitudinal omics modeling and integration in clinical metabonomics research: challenges in childhood metabolic health research. Front Mol Biosci. 2015. https://doi.org/10.3389/fmolb.2015.00044/abstract.
Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine learning and integrative analysis of biomedical big data. Genes. 2019;10(2):87.
Krassowski M, Das V, Sahu SK, Misra BB. State of the field in multi-omics research: from computational needs to data mining and sharing. Front Genet. 2020;10(11): 610798.
Panda BS, Kumar Adhikari R. A Method for Classification of Missing Values using Data Mining Techniques. In: 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA) [Internet]. Gunupur, India: IEEE; 2020 [cited 2023 Jan 13]. p. 1–5. Available from: https://ieeexplore.ieee.org/document/9132935/.
Taylor SL, Ruhaak LR, Kelly K, Weiss RH, Kim K. Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices. Brief Bioinform. 2016;bbw010.
Hughes RA, Heron J, Sterne JAC, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48(4):1294–304.
Gardner ML, Freitas MA. Multiple imputation approaches applied to the missing value problem in bottom-up proteomics. Int J Mol Sci. 2021;22(17):9650.
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion. 2019;50:71–91.
Kim JH. Multicollinearity and misleading statistical results. Korean J Anesthesiol. 2019;72(6):558–69.
Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Phil Trans R Soc A. 1906;2009(367):4237–53.
Defernez M, Kemsley EK. The use and misuse of chemometrics for treating classification problems. TrAC, Trends Anal Chem. 1997;16(4):216–21.
Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. Comput Methods Programs Biomed. 2018;162:69–85.
Li M, Hameed I, Cao D, He D, Yang P. Integrated omics analyses identify key pathways involved in petiole rigidity formation in sacred lotus. Int J Mol Sci. 2020;21(14):5087.
Rodríguez-Pérez R, Fernández L, Marco S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem. 2018;410(23):5981–92.
Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away: PLS-DA: taking the magic away. J Chemometrics. 2014;28(4):213–25.
Lualdi M, Fasano M. Statistical analysis of proteomics data: a review on feature selection. J Proteomics. 2019;198:18–26.
Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics. 2011;12(1):253.
Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: explaining the predictions of any classifier [Internet]. arXiv; 2016 [cited 2023 May 5]. Available from: http://arxiv.org/abs/1602.04938.
Cakiroglu C, Demir S, Hakan Ozdemir M, Latif Aylak B, Sariisik G, Abualigah L. Data-driven interpretable ensemble learning methods for the prediction of wind turbine power incorporating SHAP analysis. Expert Syst Appl. 2024;1(237): 121464.
Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010;11(5):207.
Paolillo C, Londin E, Fortina P. Single-cell genomics. Clin Chem. 2019;65(8):972–85.
Jehan Z. Chapter 1—single-cell omics: an overview. In: Barh D, Azevedo V, editors. Single-cell omics. Academic Press; 2019. p. 3–19.
Vandereyken K, Sifrim A, Thienpont B, Voet T. Methods and applications for single-cell and spatial multi-omics. Nat Rev Genet. 2023;24(8):494–515.
Eisenstein M. Seven technologies to watch in 2022. Nature. 2022;601(7894):658–61.
Ma A, McDermaid A, Xu J, Chang Y, Ma Q. Integrative methods and practical challenges for single-cell multi-omics. Trends Biotechnol. 2020;38(9):1007–22.
Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics. 2024;24(5):139.
Got A, Zouache D, Moussaoui A, Abualigah L, Alsayat A. Improved manta ray foraging optimizer-based SVM for feature selection problems: a medical case study. J Bionic Eng. 2024;21(1):409–25.
Zouache D, Got A, Alarabiat D, Abualigah L, Talbi EG. A novel multi-objective wrapper-based feature selection method using quantum-inspired and swarm intelligence techniques. Multimed Tools Appl. 2024;83(8):22811–35.
Phan JH, Quo CF, Cheng C, Wang MD. Multiscale integration of -omic, imaging, and clinical data in biomedical informatics. IEEE Rev Biomed Eng. 2012;5:74–87.
Saki N, Haybar H, Aghaei M. Subject: motivation can be suppressed, but scientific ability cannot and should not be ignored. J Transl Med. 2023;21(1):520.
Eftekhar Z, Aghaei M, Saki N. DNA damage repair in megakaryopoiesis: molecular and clinical aspects. Expert Rev Hematol. 2024;17(10):705–12.
Aghapour SA, Torabizadeh M, Bahreiny SS, Saki N, Jalali Far MA, Yousefi-Avarvand A, et al. Investigating the dynamic interplay between cellular immunity and tumor cells in the fight against cancer: an updated comprehensive review. Iran J Blood Cancer. 2024;16(2):84–101.
Goetz LH, Schork NJ. Personalized medicine: motivation, challenges, and progress. Fertil Steril. 2018;109(6):952–63.
Costello JC, Heiser LM, Georgii E, Gönen M, Menden MP, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32(12):1202–12.
Vizcaíno JA, Deutsch EW, Wang R, Csordas A, Reisinger F, Ríos D, et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol. 2014;32(3):223–6.
Haug K, Cochrane K, Nainala VC, Williams M, Chang J, Jayaseelan KV, et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 2020;48(D1):D440–4.
Sud M, Fahy E, Cotter D, Azam K, Vadivelu I, Burant C, et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 2016;44(D1):D463-470.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
AM was in charge of paper retrieval, conceptualization, and writing of the original manuscript. LB and MF also contributed to conceptualization and supervised the manuscript drafting and organization. All the authors read, revised and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Morabito, A., De Simone, G., Pastorelli, R. et al. Algorithms and tools for data-driven omics integration to achieve multilayer biological insights: a narrative review. J Transl Med 23, 425 (2025). https://doi.org/10.1186/s12967-025-06446-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12967-025-06446-x