Your privacy, your choice

We use essential cookies to make sure the site can function. We also use optional cookies for advertising, personalisation of content, usage analysis, and social media.

By accepting optional cookies, you consent to the processing of your personal data - including transfers to third parties. Some third parties are outside of the European Economic Area, with varying standards of data protection.

See our privacy policy for more information on the use of your personal data.

for further information and to change your choices.

Skip to main content

Algorithms and tools for data-driven omics integration to achieve multilayer biological insights: a narrative review

Abstract

Systems biology is a holistic approach to biological sciences that combines experimental and computational strategies, aimed at integrating information from different scales of biological processes to unravel pathophysiological mechanisms and behaviours. In this scenario, high-throughput technologies have been playing a major role in providing huge amounts of omics data, whose integration would offer unprecedented possibilities in gaining insights on diseases and identifying potential biomarkers. In the present review, we focus on strategies that have been applied in literature to integrate genomics, transcriptomics, proteomics, and metabolomics in the year range 2018–2024. Integration approaches were divided into three main categories: statistical-based approaches, multivariate methods, and machine learning/artificial intelligence techniques. Among them, statistical approaches (mainly based on correlation) were the ones with a slightly higher prevalence, followed by multivariate approaches, and machine learning techniques. Integrating multiple biological layers has shown great potential in uncovering molecular mechanisms, identifying putative biomarkers, and aid classification, most of the time resulting in better performances when compared to single omics analyses. However, significant challenges remain. The high-throughput nature of omics platforms introduces issues such as variable data quality, missing values, collinearity, and dimensionality. These challenges further increase when combining multiple omics datasets, as the complexity and heterogeneity of the data increase with integration. We report different strategies that have been found in literature to cope with these challenges, but some open issues still remain and should be addressed to disclose the full potential of omics integration.

Background

Systems biology is an inter-disciplinary approach to science, which tackles the complexity of biological structures by a comprehensive study of different molecular layers of living systems [1]. Even though a single definition of systems biology has not yet been defined, it is often characterized by the use of computational and mathematical modelling to analyse interactions between diverse components of biological systems [2]. Research in this field often focuses on networks of genes, proteins, or metabolites to investigate the ‘omics cascade’ [3]. This cascade represents the sequential flow of biological information, where genes encode the potential phenotypic traits of an organism, but the regulation of proteins and metabolites is further influenced by physiological or pathological stimuli [4], as well as environmental factors such as diet, lifestyle, pollutants and toxic agents [5, 6]. This complex regulation makes biological systems complex and challenging to disentangle into their individual components (Fig. 1A). By examining variations at different levels of biological regulation [7], researchers can deepen their understanding of pathophysiological processes and the interplay between omics layers.

Fig. 1
figure 1

A Schematic overview of the systems biology approach. Environmental factors influence the so-called omics cascade, constituted by genes, transcripts, proteins, and metabolites. Through appropriate experimental setups, omics data can be investigated and assessed by the integrative methods discussed in this review. B Graphic representation of a typical omics data matrix. Rows represent samples, columns represent features. Different submatrix colours represent the phenotype of a group of samples. Blank cells represent missing values

Omics integration offers unprecedented possibilities to unravel biological functions, interpret diseases, identify biomarkers, and uncover hidden associations among omics variables [8,9,10]. As a result, it has become a cornerstone of modern biological research, driven by the development of advanced tools and strategies. However, the term ‘omics integration’ encompasses a wide spectrum of methodological approaches. An important distinction is the level at which integration occurs. In some cases, each omics dataset is analyzed independently, with individual findings being combined for biological interpretation. Alternatively, all datasets may be analyzed simultaneously, typically by assessing the relationships between them or by combining the omics matrices together. Another consideration is whether the integration process is driven by existing knowledge, such as known molecular interactions or biological pathways, or if it is driven entirely by the data itself.

In this review, we address data-driven omics integration, defined as integration strategies that are not driven by prior biological insights. In contrast to previous reviews, which have explored potential tools available in the literature, we exclusively examined methods that have been utilized for integration purposes. This approach provides a comprehensive perspective on practical applications and current trends in the field of omics integration. We have categorized the integration strategies into three main groups: statistical-based methods, multivariate methods, machine learning and artificial intelligence. We also explored the major challenges that researchers encounter in omics integration and highlighted strategies that have been employed to address these issues. Omics data are supposed to be matrices in which each row represents a sample (also referred as individual or patient throughout the paper) and each column represents the omics feature, such as the transcript, protein, or metabolite. Samples are divided into different groups, which usually are different phenotypes (e.g. disease versus control) (Fig. 1B).

Data-driven approaches to integrate proteomics data

A comprehensive search of the PubMed electronic database was conducted using the keywords detailed in the Supplementary Materials to identify studies on omics integration published between 2018 and 2024. These included studies utilized data-driven methods such as statistical methods, multivariate analyses, or machine learning/artificial intelligence models to analyze omics data without relying on prior knowledge of biological relationships. Approaches incorporating external knowledge, such as interactome or pathways databases (e.g., GO or KEGG), as well as hybrid strategies combining data-driven and knowledge-based methods, were excluded from this review. A detailed workflow of the decisional process for inclusion in this review is depicted in Fig. 2A. We followed most of the PRISMA 2020 checklist [11] to define the inclusion/exclusion criteria and structure the paper. The review includes 64 research papers, with their number and proportion relative to the total retrieved papers per year shown in Fig. 2B. Figure 2C illustrates the number of papers retrieved for each omics combination, highlighting the proportion of papers employing statistical, multivariate and ML/AI methods. Table 1 presents an overview of the main employed packages, accompanied by their respective references.

Fig. 2
figure 2

A Descriptive workflow of the decisional process for inclusion in the present review. The workflow is divided into four main sequential steps: retrieval of all the papers, first screening, eligibility assessment and final decision on inclusion. B Histogram representing the total number of papers retrieved by the search string (teal bars) and the number of included papers (orange bars) per year. C Upset plot representing the number of papers retrieved for each omics combination. The proportion of employed methods is also depicted

Table 1 Overview of packages employed for integration accompanied by their respective references (in square brackets)

Statistical and correlation-based methods

Correlation is the statistical measure that quantifies the degree to which two variables are related to each other. A straightforward approach to assessing the relationship between two omics datasets involves visualizing their correlation and computing their coefficient and statistical significance. For instance, a simple scatterplot can facilitate the analysis of expression patterns, leading to the identification of consistent or divergent trends [24, 25]. In Zheng et al. [24], the scatter plot was divided into four regions associated to different colors, the red area indicating higher transcription efficiency rates, the green area representing lower transcription efficiency rates, and the gray regions highlighting protein–transcript pairs with consistent expression patterns. Similarly, in Gao et al. [25], the transcript-to-protein ratios were investigated in scatter plot quadrants representing discordant or unanimous up- or down- regulation of genes. Pearson’s or Spearman’s correlation analysis or their generalizations such as the multivariate generalization of the squared Pearson correlation coefficient, i.e. the RV coefficient, were employed to test correlations between whole sets of differentially expressed genes in different biological contexts [26,27,28,29,30,31,32,33,34,35,36]. The computation of the correlation coefficient permits to obtain different biological insights, including the determination of the extent and nature of the interaction between sets of differentially expressed proteins/metabolites [26], the assessment of whether up-regulated proteins exhibit a significant correlation with abundantly increased metabolites and vice versa [26, 32], the identification of molecular regulatory pathways of correlated genes and proteins [27], or the assessment of transcription-protein correspondence [28, 30, 31, 34,35,36]. Pearson’s correlation analysis has also been demonstrated to be effective in identifying a time delay between the release of mRNA molecule and the production and secretion of the protein, as outlined in the study described in [29]. In the study described in [37], Spearman’s correlation coefficient was computed to integrate three omics datasets (transcriptomics, proteomics, and metabolomics). A cutoff threshold was defined on the correlation coefficient and p-value (0.9 and 0.05, respectively) on the pairwise correlations between differentially expressed proteins (DEPs) and differential metabolites, differentially expressed genes (DEGs) and differentially expressed miRNAs, DEPs and DEGs. The objective of this approach was to identify the major relationships between the three platforms by visualizing the first 100 correlations. In another case [34], Pearson’s correlation analysis was complemented with Procrustes analysis, a form of statistical shape analysis. Procrustes analysis involves the alignment of datasets through scaling, rotation, and translation of the data in a common coordinate space to assess their geometric similarity and correspondence.

Correlation networks are a broad and widely employed application of correlation. Correlation networks extend correlation analysis by transforming these pairwise associations into graphical representations. In such networks, nodes represent individual biological entities (e.g., genes, proteins, or metabolites), and edges are constructed based on correlation thresholds, typically determined by metrics such as R2 or p-value. This methodological framework facilitates the visualization and the analysis of complex relationships within and between datasets, thereby enabling the identification of highly interconnected components and their roles within biological systems. In Gong et al. [38], edges were retained according to specific thresholds on R2 and p-values to construct a multi-omics co-expression network. This network was then integrated with a cancer-related network to enrich the analysis with known interactions in cancer-related pathways, facilitating a deeper understanding of the molecular interactions involved in cancer biology.

A further step in correlation networks is Weighted Gene Correlation Network Analysis (WGCNA) [14]. This method is employed to identify clusters of co-expressed, highly correlated genes, which are referred to as modules. By constructing a scale-free network, WGCNA assigns weights to gene interactions, emphasizing strong correlations while reducing the impact of weaker or spurious connections. These modules can be summarized by their eigenmodules, which are frequently linked to clinically relevant traits, thereby facilitating the identification of functional relationships. In Ding et al. [39], WGCNA was conducted separately on the joint transcriptomics/proteomics and metabolomics data sets, and correlation was computed to uncover associations between genes/proteins and metabolites modules.

Another approach that we have found in literature is xMWAS [13]. xMWAS is an online tool developed in R that performs correlation and multi-variate analyses. xMWAS performs a pairwise association analysis with omics data organized in matrices. The determination of the correlation coefficients is assessed by combining Partial Least Squares (PLS) components and regression coefficients. Subsequently, the obtained coefficients are employed to generate a multi-data integrative network graph. Networks of correlation are created by joining the nodes whose edges meet the requirements in terms of association score and statistical significance. Clusters of highly interconnected nodes, known as communities, can be identified by means of the multilevel community detection method [40], which consists of two iteratively repeated phases. In the first phase of the algorithm, every single network node i is considered, together with its neighbours j. A measure of how well the network is divided in communities, called modularity, is employed to assess the extent to which nodes within a module exhibit higher levels of connectivity with each other compared to those outside the module. This metric is computed by removing node i from its community and assigning it to the community of node j. If the gain in terms of modularity is positive, the node i is moved to the community that bears the maximum gain. Conversely, if there is no gain in modularity, it remains in its original community. When a modality local maxima is reached, the second phase of the algorithm begins. In this phase, a novel network is constructed, with nodes representing the communities identified during the first phase. Then, the same algorithm employed in the first phase is computed to the resulting network and this whole process is iterated until the maximum modularity is reached. xMWAS method was able to uncover omics interconnections by identifying the biological pathways associated to high correlated community in the following studies [41, 42]. In a recent study by Na and collegues [43], the integration of multiple omics with xMWAS was successful in identifying a clear pathophysiological pathway, that had not been identified in the single-omics analysis.

Canonical Correlation Analysis (CCA) and its variant for high-dimensional or multicollinear data, known as Regularised Canonical Correlation Analysis (rCCA) [44], are two integrative and dimensionality reduction methods, that highlight correlations between two omics datasets. CCA and rCCA are included in the Bioconductor R package MixOmics [12]. The CCA strategy involves the calculation of canonical variates, defined as linear combinations of variables from each dataset. Each pair of canonical variates is associated with a canonical correlation value, which represents the correlation between the two novel components. rCCA is the regularised counterpart of CCA, and it must be employed when the total number of variables from both datasets is much larger than the number of samples. rCCA implements l2 penalty, also known as Ridge penalty, to the diagonals of omics matrices, thereby rendering them invertible. This approach effectively overcomes the collinearity issues inherent to standard CCA. rCCA can be used to create relevance networks, wherein solely pairs of variables belonging to different datasets are drawn. These networks reveal relationships between omics variables and can be enriched with biological insights. In literature, rCCA has been employed to identify the most significant correlations through relevance networks [36, 45] and to identify nodes with high connectivity, as they might indicate a key role in the disease [46].

Similarity Network Fusion (SNF) [15] is another graph-based approach that was developed both in R and MATLAB environment. Differently from xMWAS and (r)CCA, SNF builds networks where nodes are samples (e.g., patients) instead of omics data. For each omics dataset, a pairwise distance matrix is calculated by using statistical correlation or other distance measures, such as Euclidean distance. Patient similarity networks are built for each omics matrix and then combined by adopting a nonlinear combination method based on graph overlapping. The algorithm for network fusion derives from the message-passing theory [47], and it iteratively updates each network such that it becomes more similar to the others at every iteration. In this way, low-weight edges are eliminated if present in a single omics matrix but are maintained if present in all networks, while high-weight edges present in one or more networks are added to the others. In addition, SNF is able to detect clusters of samples, as outlined in [48], and to predict labels for new samples on the bases of the constructed network. In a recent study, SNF successfully demonstrated that a combination of multiple omics data can achieve a higher classification performance with respect to single or fewer omics [49]. In another study, SNF was also better than single omics datasets in identifying two clusters of patients based on their omics plasma profile [50]. Ruan et al. [51] propose a variation of SNF that they define spectral clustering SNF (scSNF): with the aim of identifying molecular subtypes of idiopathic pulmonary fibrosis, SNF was firstly applied on the proteomics, miRNA, and RNA expression dataset. Then, spectral clustering was implemented on the fused network, leveraging the eigenvectors of the graph Laplacian to project the subjects into a lower-dimensional space, thereby facilitating the grouping of subjects.

Correlation methods encompass a wide array of strategies. The majority of the studies employ simple Pearson’s correlation coefficients to disclose gene-transcripts relationship with regard to transcriptional efficiency, post-translational modifications, and transcription delays. However, more sophisticated methods have emerged as reliable tools to elucidate the molecular mechanisms and patterns that characterize diverse phenotypes. An overview of the aforementioned papers and the integration strategies that have been adopted in real data studies is provided in Table 2.

Table 2 List of papers employing statistical methods as integration strategy

Multivariate methods

Multivariate methods represent the most extensive and most variegated category of multi-omics integration strategies. These approaches frequently rely on algebraic decompositions of datasets, leveraging latent variables to extract the most relevant underlying information. Latent variables are algebraic coordinates inferred from data, that represent shared patterns between datasets and reduce their dimensionality. For this reason, they enable the identification of significant relationships and shared patterns and therefore simplify the integration problem.

A number of multivariate methods exist for integration; some of these are adaptations or extensions of widely used dimensionality reduction techniques, such as Principal Component Analysis (PCA). PCA is a technique that simplifies complex matrices by transforming them into a new coordinate system defined by principal components, which are the directions of maximum variance in the data. These principal components (PC), which are linear combinations of the original features, serve as uncorrelated variables. These components can be utilized for further analysis, enabling deeper insights and reducing redundancy. A simple, popular extension of PCA for the multi-block scenario is SUM PCA. A multi-block data set can either come from a multi-platform analysis of the same samples or by the combination of chemical measurements with non-analytical data generated from sensory or consumer sciences. In both cases, the data is not simply multivariate but is multi-modal, i.e., multivariate and multi-source.

SUM-PCA is an approach that applies PCA to a fused data block obtained by concatenating the omics matrices row-by-row. In this method, all data blocks share the same set of super scores (Tsup), while retaining unique block-specific loadings (Pb) and residuals (Eb). The super scores (Tsup) serve as a comprehensive summary that captures the shared characteristics across all blocks and represents a consensus score. The relationship between the consensus scores and the combined block score matrix is described by the block weight matrix W, which quantifies the contribution of each block to the consensus for each principal component, as expressed by the equation Tsup = TW. SUM-PCA has been employed to have a first glance on cell cultures behavior similarity [52].

Multi-Omics Factor Analysis (MOFA) is another generalization of PCA that has been proposed for omics integration [53,54,55,56,57], [58]. MOFA is a data-driven approach that utilizes a set of hidden factors to identify the underlying causes of variability in multi-omics data sets. By leveraging these factors, MOFA enables the identification of the principal sources of variation in multi-omics data sets, and the determination of axes of heterogeneity either shared or unique across the different omics datasets. The algebraic principle of MOFA involves the decomposition of each original omics data matrix into the multiplication of two matrices: the latent factor matrix, common to all data matrices, and a weight matrix specific of each data platform. An added residual noise term is also considered. Once the model is trained, the R package MOFAtools, which is included in the MOFA package, can be employed as a semi-automated pipeline to identify the latent factors. The variation explained by each factor is computed, and then the main contributors to sample heterogeneity can be visualized in a low-dimensional space. Finally the features with the highest weights can be inspected. The package encompasses also the imputation of missing data by calculating the missing values directly from the model equation. In a recent study, transcriptomics, proteomics, metabolomics, and lipidomics blood samples collected from patients affected by Alzheimer’s disease were used to identify analytes that discriminated different groups by setting a threshold on their normalized absolute loading value from MOFA [53]. It has also been employed to examine the extent to which mRNA and protein regulation correlate in aggregative multicellular organisms [57]. As Armenteros et al. [55] have previously proposed, the latent factors identified by MOFA can also be associated with clinical variables and covariates. The authors have found a correlation between the secretion of C-peptide and clinical benefits in diabetes type 1. In Aydin et al. [56], MOFA factors allowed to identify novel target and mediator genes of known quantitative trait loci hotspots, as well as additional loci that were found to drive variation in the three integrated omics datasets. More recently, a new version of MOFA, called MOFA + [18], has been developed to extend MOFA’s application to single-cell analysis. MOFA + improves the scalability of MOFA and is able to manage side information regarding the structure between cells. Its capacity to analyze datasets comprising data from millions of cells makes it particularly well-suited for single-cell analysis. Park et al. [59] employed MOFA + to perform unsupervised classification of genomics, transcriptomics, proteomics and blood biomarkers.

MEFISTO (Method for the Functional Integration of Spatial and Temporal Omics data) is an extension of MOFA [60] that was developed to address the temporal dimension. Indeed, it has been employed for analysing temporal relationships in proteomics and transcriptomics data and identifying pro-thrombotic signals factors that changed over time from a baseline conditions in COVID19 patients convalescence [61]. Finally, among the PCA extensions for omics integration, we can list Multiple Factor Analysis (MFA), that is implemented in the R package FactoMineR [19]. The strength of MFA is the possibility to analyse data by taking into account a partition of the variables into groups (j = 1,…,J groups of variables). In MFA, PCA is performed on weighted variables: a same weight is assigned to every variable belonging to the same group j (j = 1,…J), this weight is set equal to the first eigenvalues of the PCA on group j. This weighting permits to balance the global analysis because the maximum axial inertia (i.e., total variance of the group of variables projected onto a principal component) for each group is 1. Factorial axes has the potential applications in pattern recognition [62]. These axes can facilitate the comprehension of the contribution of each omics dataset to the distance between samples, and the identification of omics matrices that provide similar or discordant information [63].

Another adaptation of dimensionality reduction technique for integration is the Data Integration Analysis for Biomarker discovery using Latent cOmponents (DIABLO) [64]. DIABLO is a tool included in the mixOmics package, that has been successfully used for the data-driven, holistic, and hypothesis-free identification of robust biomarkers and disease mechanisms, and sample prediction [54, 65,66,67,68,69,70,71,72]. The algorithm consists in a supervised extension of the sparse Generalized Canonical Correlation Analysis (sGCCA) method [73], which is a multivariate dimension reduction method based on the singular value decomposition (SVD). SVD is a matrix factorization technique that approximates a matrix \(M\) as \(M=U\dot\Sigma \dot {V}^{T}\), i.e. with the multiplication of a left singular matrix \(U\), a diagonal matrix of singular values \(\Sigma\), and the transpose of a singular matrix \({V}\). This is achieved by maximizing the covariance between linear combinations of the variables, referred to as latent component scores, and projecting the data into a lower-dimensional subspace spanned by these components. To select the associated variables across omics levels, sGCCA internally applies a l1 penalization, also named least absolute shrinkage and selection operator LASSO, on the variable coefficients vector, similarly to the regularization approach used in rCCA. These coefficients are the factors associated to the different phenotypes across the different omics data [54]. To extend sGCCA for a classification framework, DIABLO takes into consideration a dummy indicator matrix that indicates the class membership of each sample. Moreover, it replaces the l1 penalty parameter by the number of variables to select in each dataset and each component, as there is a direct correspondence between both parameters. DIABLO has been applied in several research works and for different purposes, among these studies we can list those aimed at the identification of multi-omics features to discriminate different phenotypes [66, 68, 69, 71], the determination of different omics data sets correlation [71, 72], the assessment of which data set yield the most discriminative power [70], the determination of features contribution to the latent variables [65], and the prediction of categories of interest [67].

Another multivariate approach for omics integration is the Projection to Latent Structures or Partial Least Squares (PLS)-based method, which is also available in MixOmics package. In general, PLS is a multivariate projection-based method that explores and explains the relationship between two or more continuous variables. It achieves this by projecting the original data onto a set of latent variables or components that maximize the covariance between the predictors and the response variables. While PLS is specifically designed for regression purposes, its variant PLS-Discriminant Analysis (PLS-DA) performs sample classification: instead of containing continuous variables, the response vector contains categorical ones. Moreover, the method focuses on maximizing the separation between predefined classes while simultaneously capturing the variance in the predictors. This makes it particularly suitable for analyzing complex and high-dimensional omics data. Both these algorithms have been extended to a sparse version to cope with high-dimensional data through the implementation of a l1 penalization to reduce the number of variables. One of the main advantages of PLS-DA is that it associates each variable to theVariable Importance in Projection (VIP) score. The VIP score is a metric that assesses the contribution of the variable in explaining the variance of both the predictors (X) and the response variables (Y). As an example of application, features selected by PLS-DA on the basis of their importance have highlighted crucial relationships between metabolites and proteins in COVID19 patients, by visualizing bipartite omics connections with a relevance network [74]. A variation of PLS-DA, named backward elimination PLS-DA (BE PLS-DA), was implemented by Benedetto and coworkers to select the best discriminant model able to separate the classes under study [62]. BE PLS-DA encompasses an iterative approach to refine the regression model: it uses variable selection based on the VIP scores, tossing the least important variables in each cycle to enhance the discriminative power of the model.

Co-Inertia Analysis (CIA) [16] and Multiple Co-Inertia analysis (MCIA) [17] represent another class of multivariate analysis. These methods were developed to assess relationships and trends across multiple datasets, and they have been proposed for integrative omics analysis [36, 46, 75,76,77]. CIA is included in the R-Bioconductor package made4. By simultaneously generating ordinations (dimension reduction diagrams), it identifies successive orthogonal axes with the highest maximum squared covariance between the datasets, thereby effectively representing joint similarities and trends. This method has allowed, for instance, to identify the patterns of co-expression associated to the maximum covariance between proteins and genes in brain ischemia [36]. CIA has been also performed to assess co-variability between proteomics and lipidomics data from lung tissue samples of insulin-deficient diabetes mellitus pigs and wild type pigs [76].

MCIA is a generalization of CIA to integrate more than two omics datasets, and it is implemented in the R-Bioconductor omicade4 package. MCIA algorithm is a technique that transforms each omics dataset separately into a comparable lower dimensional space, maximizing the sum of the squared covariance between the scores of each dataset through synthetic axes. The different datasets are then projected into the same dimensional space, such that features that share similar trends are closely projected, highlighting relationships among samples and the overall consistency of the datasets. MCIA has been employed to determine the co-relationships and to visualise the similarity and the divergence of datasets from patients affected by ischemic stroke and mevalonate kinase deficiency [46, 75]. In the first one, MCIA has highlighted an overall dissimilarity in the gene and protein datasets structure, that was confirmed by a low RV coefficient [46]. In the second one [75], the projection of exome, transcriptome and proteome on the same space demonstrated a different transcriptomics and proteomics profile in healthy and pathological conditions. Finally, MCIA was employed by Ichikawa et al. [77] to identify co-inertia drivers, but they have utilized the R package MiBiOmics [78].

Multivariate methods represent the most heterogeneous class of integration methods. They comprise two categories of tools: algorithms that have been specifically designed for omics integration (e.g., DIABLO, MOFA and its extensions), and algorithms that have been adapted to the scope (e.g., PLS-DA, MCIA and MFA). These methods have been increasingly used in recent years and have provided meaningful insights about omics datasets relationships, correspondences, and discrepancies. An overview of the above-mentioned papers and the integration strategies is provided in Table 3.

Table 3 List of papers employing multivariate methods as integration strategy

Machine learning and artificial intelligence

Machine learning (ML) is a powerful data science tool that enables systems to analyze complex data, identify patterns, and make informed predictions or decisions automatically. These algorithms are broadly categorized into supervised and unsupervised learning, depending on whether the data includes labeled outcomes (i.e., known classes for each sample) or not. In supervised learning, the model is trained to map inputs to their corresponding labels and can then be employed to predict labels for new, unseen data. Popular supervised ML techniques include linear and logistic regression, decision trees, random forests, support vector machines, and neural networks. In contrast, unsupervised learning involves training models to uncover underlying structures or patterns in unlabeled data. Common unsupervised algorithms include clustering methods, k-nearest neighbors, as well as techniques for dimensionality reduction and anomaly detection.

Clustering is an approach that groups samples based on a predefined distance metric, such that samples within the same cluster are more similar to each other than to those in different clusters. Traditional clustering methods are typically run once, but this makes the robustness and reproducibility of the algorithms variable. To address this, consensus clustering offers a more reliable approach by aggregating results from multiple clustering iterations. One of the implementations of consensus clustering is ConsensusClusterPlus [20] included in the Bioconductor package. The strategy consists in an unsupervised approach in which a proportion of both samples and features is iteratively portioned into k groups according to a clustering algorithm. The proportion of repetitions in which two items are clustered together is defined as pairwise consensus value. For each k, pairwise consensus values are calculated and stored in a consensus matrix; then, the final agglomerative hierarchical consensus clustering is obtained using a distance of 1-consensus values and pruned to k consensus clusters. This approach has been employed by Liu et al. [79] to delineate a comprehensive characterization of esophageal squamous cell carcinomas. They have identified four distinct molecular subtypes each associated with potential therapeutic targets and diagnostic biomarkers. In another study which aimed at identifying tumor molecular subtypes by integrating data from transcriptomics, proteomics, and phosphoproteomics [80], this strategy was applied to analyze each individual omics dataset and to integrated multi-omics. The integration of omics data has provided a better performance than single omics findings: a higher silhouette score, i.e., an index of clustering quality that measures how well a sample fits in the assigned cluster, suggests that integration of the three types of omics data better classifies cancer subtypes. Another clustering-based integration strategy involves the clustering in a latent variable space, like in the case of integrative clustering framework. The iClusterBayes package [21] employs a Bayesian latent variable model to integrate multiple genomic data types measured in the same set of samples. This method provides an integrated cluster assignment through joint inference across data types, while identifying features that drive the formation of these clusters. Integrative clustering has been employed mainly to identify groups of patients with significant distinct clinical and disease profiles [81, 82].

One of the latest clustering-based strategy developed is the subspace clustering as described in Gillenwater et al. [83]. Subspace clustering is implemented within the R package MineClus (Mining Non-Empty clusters) [22] and consists in identifying clusters in subspaces of high-dimensional data. The reduction of proteomics, transcriptomics, and metabolomics data is performed by autoencoders (AE) prior to clustering. AE are deep neural networks consisting of layers of interconnected nodes, where each layer tries to reconstruct the original input data by learning its reduced representation. In order to do that, the nodes use activation functions to process inputs and produce outputs: training AE entails calibrating the activation function’s weights to minimize the difference between the input and the reconstructed output. After reducing the datasets with AE, embeddings from all omics layers were horizontally concatenated for subspace clustering of the integrated data. In the work of Gillenwater et al. [83], the analysis has not produced a satisfactory clustering, since it has not achieved the aim of determining molecular-based clusters with distinct clinical phenotypes. More consistent results in terms of patients’ clinical characterization were in fact obtained by performing subspace clustering on each distinct omics dataset.

On the contrary, the subspace clustering was used to obtain a representation of the stacked multi-omics features was successfully applied in the works of Wang et al. [84] and Khadirnaikar et al. [85]. In the former [84], k-means clustering on the AE embeddings was employed to identify differentially altered pathways associated to different phenotypes of long COVID. In the latter [85], consensus k-means clustering on the latent representation has identified labels that are associated pan-cancer subgroups with distinct clinical characteristics.

As regards supervised models, regression is one of the approaches commonly employed in omics data analysis. These models are designed to capture the linear relationship between one or more independent variables (predictors or explanatory variables) and a dependent variable (the outcome or response). Once fitted, the regression line can be used to predict the outcome value of new input data. In the context of omics integration, regression models have not been utilized to the same extent of unsupervised clustering-based methods: only one example of regression analysis was employed for the integration of proteins and metabolites to predict disc herniation development in dogs [86]. Horvatić et al. [86] used a version of linear regression called elastic-net, which is a linear regression in which a regularization term has been added to the equation to avoid overfitting. Depending on the structure of the regularization term, regularization can be defined as LASSO, ridge, and elastic-net, which are different on penalty term, shrinkage, and feature selection. Elastic-net is a compromise between LASSO and ridge, as the penalty term of the loss function is a combination of both. In the study described in [86], different elastic-net regression models were fitted with different feature subsets, selected either through recursive feature elimination (RFE) or minimun redundancy-maximal relevance (mRMR) algorithms. The final model was built from the features that were repeatedly selected, and it correctly classified all the samples in the test set.

Machine learning (ML) classification models have been gaining increasing prominence in the field of multi-omics integration. This is particularly true for ensemble learning approaches such Random Forest (RF) [87,88,89,90], Adaptive Boosting (AdaBoost) [87] and Gradient Boosting machine [91]. RF combines multiple decision trees to make more accurate predictions. In decision trees, each node represents a query on one or more input features, and each branch represents the outcome of the decision. By training each tree on a random subset of data and performing classification based on the decision of multiple trees, the prediction becomes more stable and less prone to overfitting. Moreover, RF is often chosen for its capability to deal with high dimensionality and missing values [89]. Huang et al. [87] have employed RF for multi-classification by combining metagenomics, metatranscriptomics, metabolomics, proteomics, and viromics, reaching an Area Under the Curve (AUC) above 0.83. Li et al. [88] have integrated proteomics and metabolomics through RF to identify prediction biomarkers. Finally, RF was used for a prognostic model by including genomics, transcriptomics, proteomics, and histopathological image features [89], reaching the highest performance with respect to single-omics models. The combination of multiple proteins and metabolites has provided better results in terms of AUC as illustrated in [90]. However, it is important to note that not always can superior results be obtained by employing a whole set of omics data with respect to a subset. In the work of Wang et al. [92], a panel of two proteins and two metabolites was employed to build several models, these molecules were able to discriminate among the different conditions.

AdaBoost is another ensemble learning method that reached a good result in terms of AUC in a multi-classification problem [87]. Similarly to the random forests, AdaBoost combines decision trees but it assigns an equal weight to the training data, then it calculates the errors and increases the weight of misclassified data points. Another tree is then fitted on the same dataset with updated weights, and the process is repeated iteratively until all the trees are fitted.

Another ensemble learning method is the Gradient Boosting Machine. This algorithm works by constructing additive regression models iteratively minimizing residuals by least squares. Gradient Boosting Machines have been successfully applied to construct predictive models for responders and non-responders to a low-caloric diet using transcriptomics, lipidomics, and metabolomics data, achieving an AUC of 0.75 [91].

Other classification models that have been successfully applied in multi-omics integration include Support Vector Machines (SVM) [93] and deep learning approaches [87]. SVMs address binary classification problems by identifying an optimal hyperplane that separates two classes in a high-dimensional feature space. This hyperplane is selected to maximize the margin between the two classes, ensuring robust performance even with complex or high-dimensional datasets. In contrast, deep learning models, e.g. the feedforward neural networks used in the work of Huang et al. [87]- employ multiple layers of interconnected neurons to capture complex, non-linear relationships in the data. These models are trained iteratively on the input data to minimize prediction error, allowing them to learn patterns and features relevant for accurate diagnosis.

Leveraging the potential of ML models, a python package called AutoGluon-Tabular [23], was employed by Bai et al. [94] to linearly combine the results of several models. AutoGluon-Tabular is an automated ML algorithm designed to build predictive models from unprocessed tabular datasets, such as CSV files. It simplifies the modeling process by automatically recognizing data types in each column, including text, and optimizing hyperparameters and feature engineering. The algorithm trains various base models, including Random Forests, LightGBM, CatBoost, ExtraTrees, XGBoost, and neural networks (e.g., NeuralNetMXNet and NeuralNetFastAI). These base models are trained, and their predictions are used as features to train a final ensemble model, which combines the strengths of the base models.

In general, when compared with single-omics model, multi-omics classification performs better in terms of accuracy and AUC [89, 93]. This confirms the power of investigating several layers of biological processes in order to achieve a better understanding of diseases and phenotypes. An overview of the above-mentioned papers and the integration strategies that have been put in practice is provided in Table 4. Table 5 reports the accuracy metrics of the cited papers for both the multi-omics and the single omics cases.

Table 4 List of papers employing multivariate methods as integration strategy
Table 5 Evaluation metrics for the papers in which these metrics were reported

Challenges and future directions in omics integration

As we have extensively investigated, integration of different omics datasets holds much potential for uncovering complex biological insights that cannot be achieved by analyzing individual datasets alone. This approach is increasingly being employed across diverse contexts, and it has demonstrated its ability to provide new insights on disease biomarkers and mechanisms. However, the processing and analysis of omics data matrices poses a variety of challenges, with the merging of multiple omics data matrices further exacerbating these problems.

First, the quality of each dataset should be assessed to guarantee the data reproducibility [95]. This is important because some common data analysis technique are highly affected by the presence of outliers, either as single analytes or as whole samples [96]. Moreover, in computer sciences, the expression “garbage in, garbage out” is often used to express the concept that the quality of the input determines the quality of the output. Therefore, the result of the analyses strongly depends on the robustness of the initial input.

Another issue is that different omics data matrices are different in terms of data types, size, noise, correspondence and correlation between measurements from different technologies [7]. When combining additional information, e.g. clinical data, dissimilarities in data types may include differences in the order of magnitude, measurement unit or variance–covariance structure [97], mismatched distribution, and diverse data modalities, i.e. continuous signals, discrete counts, intervals, categorical variables, pathways, etc. [98]. Finally, each omics dataset is characterized by a different number of features, typically hundreds to thousands for transcripts, several tens or hundreds for proteins, and tens to thousands for metabolites [99]. The omics layer with the highest number of feature may overcome the other ones, possibly adding annotation bias and enrichment of noise if the employed model is not robust enough [99].

In the following paragraphs, we provide an overview of the main issues that can be encountered when dealing with and integrating high-throughput omics datasets. We also describe the strategies that have been adopted in literature to cope with these challenges, which are also depicted in Fig. 3.

Fig. 3
figure 3

Overview of the main challenges in omics integration (missing data, collinearity, dimensionality) and strategies that have been employed in the retrieved papers to cope with these issues

Missing data

High-throughput platforms often produce matrices with a high percentage of missing data, especially in proteomics and metabolomics. More specifically, missing values can be classified in missing at random (MAR), missing completely at random (MCAR), and missing not at random (MNAR) [100]. MAR data occur when the probability of a missing value depends on other variables in the dataset, but not the missing data itself; MCAR data occur when the probability of missing data is the same for all observations, and it is not related to any other variable in the dataset; MNAR data occur when the probability of missing data is related to the missing data itself. The way of approaching missing values changes on the basis of the category they belong to; however, most of the statistical methods do not allow the presence of missing data, and it is seldom possible to remove entire columns or rows of the data matrix. Imputation is often performed to cope with these problems, but the performance obviously depends on the percentage of missing values with respect to the total amount of information, and it can highly affect downstream analysis result [101, 102]. Moreover, imputing missing values is not a valid methodology in some cases: reasons for the absence of data are manifold, from the actual absence of a protein or metabolite, to low coverage or low sensitivity of the instrument [98]. If the absence of data is due to the actual lack of protein in a sample, its value should not be replaced by a fictitious value. On the other hand, if imputation is considered to be reasonable, missing values can be replaced by some central tendency measures, such as mean, mode, or median of the feature; alternatively, they can be replaced by constant or even random values. Other approaches that we found in literature consists in filling the missing values with either the minimum value for each feature divided by 2 [87] or by one fifth of the minimum value recorded in the dataset for that molecule [52].

These strategies fall into the category of univariate methods, as they impute missing values using only non-missing values of that same feature. Alternatively, multivariate imputation algorithms use the entire set of available features to estimate the missing values. For instance, K-Nearest Neighbour (KNN) method has been suggested to fill in missing data values by using the values of K similar data points in the dataset [80, 85, 85, 86, 90]. The underlying idea is that if there is a missing value in the dataset, the values of the closest data points can be used to estimate it.

Another solution is the imputation with the probabilistic minimum [57] it is implemented in the imputeLCMD R package [103]. LCMD consists in replacing missing values with the lowest detectable value in the entire dataset or within each sample; then, the data are centred around this value and gaps are filled with the smallest value.

Finally, another way to cope with missing data could be the employment of some ML methods that can actually handle missing values, for instance random forest [8, 104]. However, not all machine learning models are robust or perform well in the case of missing data.

Collinearity

Multicollinearity means a high degree of linear correlation between explanatory variables [105], and it often occurs in omics data since features are the result of biological mechanisms which are physically interconnected. In the context of regression analysis, multicollinearity can be assessed by Variance Inflation Factor (VIF), condition index (CI), and Variance Decomposition Proportion (VDP). Mathematically, the VIF is calculated by regressing a predictor i against all the other predictors in the model and calculating the ratio 1/(1 − Ri2), where Ri2 is the regression R2 for that specific predictor i. This index quantifies the extent to which the variance of a coefficient is increased as a consequence of the multicollinearity in a regression analysis. The CI is calculated by performing SVD and computing the square root of the ratio between the largest eigenvalue and the eigenvalue associated to the predictor variable of interest. Finally, VDP represents the degree of variance inflation by multicollinearity, and it allows to determine the variables contributing to it. The principle is that each variable has variance decomposition proportions that are associated to each CI. By summing up these proportions for two or more condition indices and seeing if the sum overcomes a predefined threshold, it is possible to conclude there is multicollinearity between the explanatory variables that correspond to these proportions. Strategies to deal with multicollinearity include increasing sample size, combining multicollinear variables into a single one [105], or deletion of strongly correlated variables (e.g. Pearson R2 > 0.95 [91] or > 0.90 [87]) by taking only one of the two. Another approach could be to substitute correlated features with their linear combination, such as in PCA [91]. However, deletion or combination strategies are not always applicable in systems biology contexts because biological phenomena are highly interconnected, therefore eliminating some features could hide their associations and their involvement in certain pathways. When performing pathway analysis, the more a pathway is enriched, the lower is its p-value. Since the enrichment strongly depends on the number of proteins/metabolites and on how they are connected, deleting some features could affect the whole analysis.

Dimensionality

Biological experiments usually produce matrices characterized by a small number of samples n and high number of features p. Concerning the statistical methods, good practice has consolidated that, as a rule of thumb, n/p should be equal or bigger than 5 [106], otherwise statistical power might be limited [106]. More generally, a high dimensionality leads most models to overfit [107], obtaining good classification performances for the training set, but poor generalization on the test set. This is the so-called curse of dimensionality: increasing the number of features improves the performance up to a certain limit, after which the model starts to perform worse. Indeed, according to a systematic review on pre-processing of data in the medical domain [108], data reduction is the most frequent task in the medical domain (55% of the considered papers), followed by data cleaning (29%), transformation (9%), balancing (5%) and integration (2%).

Among the most implemented strategies to reduce dimensionality, we found several simple strategies, such as: (i) keeping features detected in a defined number of samples [38], [80, 86]; (ii) retaining features on the base of a fold change threshold [36] (iii) deleting features having near-zero variance [91]; (iv) selecting only statistically significant features [27, 71, 109] or features significantly associated with clinical variables [83]; (v) taking genes that are quantified in multiple omics datasets [57]; (vi) performing WGCNA and tossing all the molecules that were not assigned to any group [39]. However, relying solely on statistical tests or correlations between variables to identify the most important ones is a naïve approach due to its univariate nature. This approach focuses on one variable at a time and fails to consider all the relationships between variables, which hinders the ability to identify multivariate patterns underlying biological phenomena.

Among the more complex methods to reduce the dataset, we found strategies that select features based on their coefficient in regression models such Partial Least Squares regression (PLS) and their categorical derivatives [97]: Linear Discriminant Analysis (LDA), PLS-DA, and Orthogonal Projection on Latent Structures Discriminant Analysis (OPLS-DA). Regression analysis with elastic-net regularization can be performed separately for each omics platform to select relevant features [53]. Alternatively, variable selection can be based on the VIP score of PLS-DA models [63]. However, it must be recalled that these models are not always accurate and could overfit, since they can always find a projection for phenotypes separation, even with random data [110, 111]. Another example of multivariate method for dimensionality reduction is minimum Redundancy Maximum Relevance (mRMR) [86]. mRMR algorithm identifies features based on their relevance, which means how much a feature is correlated with the target variable; at the same time, it tries to reduce the redundancy, which is a measure of how much a feature is correlated with the other ones. By combining these two criteria, mRMR allows to identify the most discriminative features of a dataset.

Approaches to cope with the challenge of matrix dimensionality also comprise ML or AI models for variable reduction, for instance through random forest [80] or autoencoders [83]. Other feature selection methods defined as wrappers use ML algorithms to evaluate the performance of the model trained and tested with random subsets of features, in an iterative procedure [112]. Recursive Feature Elimination (RFE) has been employed to select features by recursively removing columns of the data matrix and building a model on the remaining ones [86]. Recursive Feature Elimination Cross Validation (RFECV) was adopted to removes different subsets of features and evaluate the performance of a model using cross-validation [87].

Finally, ML/AI models can be used directly without feature reduction. Indeed, some models like Random Forest, Support Vector Machines and Classification and Regression Trees (CART) do not require feature reduction. However, given the high number of variables, results are often difficult to interpret [113].

Model interpretability

Multivariate methods and machine learning models are powerful tools. However, some of the most powerful multivariate algorithms are based on transformations of the input features into another dimensional space where features are projected. This could be an issue if the biochemical meaning of the model needs to be assessed, since information on the importance of individual features and their mutual relationships could be lost [97].

In machine learning, the simplest models such as linear ones are inherently interpretable: they have the advantage of being transparent and easy to interpret, but this often comes at the expense of reduced predictive accuracy. When dealing with more complex models, there are various strategies to achieve interpretability, ranging from feature importance analysis to more sophisticated techniques that incorporate explainability into the model architecture. In the last years, model explainability has emerged as a pivotal subject in research, often referred to as explainable AI. The rationale is that understanding how the model predicts the outcome, is a mean to trust the prediction and consider it trustworthy [114]. An example of algorithm that can explain classifier predictions is the Local Interpretable Model-Agnostic Explanations (LIME) algorithm, which explicate the prediction of a black-box model by learning a simpler, interpretable model agreeing with the black-box one. Another popular algorithm for explaining the output of machine learning models is the SHAP (SHapley Additive exPlanations), which comes from the game theory and have been increasingly employed in several fields [115].

Computational power

The huge amount of data generated by high-throughput technologies requires an increasing computational power and storage capability of computer systems [7, 99, 116]. Processing such large-scale datasets comes with significant computational challenges, including the high cost of data processing, the need for efficient algorithms, and the requirement for robust infrastructure capable of handling complex computations. Some advanced analytical techniques, such as deep learning and graphical models, demand substantial memory and processing capabilities, which may limit their accessibility for researchers with limited resources. Fortunately, the advent of optimization algorithms, online machine learning, parallelization of workflows, and cloud computing has made large-scale analyses more feasible by improving efficiency and scalability [98]. However, the trade-offs between computational cost and analytical depth remain a key consideration in omics data integration.

Future directions in omics studies and integration

The future of multi-omics integration hinges on addressing the current limitations and exploiting emerging technological and analytical advancements. While existing methods have laid a strong foundation, the field must evolve to handle the increasing volume and complexity of multi-omics data. In recent years, single-cell omics and spatial omics have been increasingly recognized as promising techniques to revolutionize our understanding of biological systems. In contrast with bulk tissue sequencing, which simultaneously analyze thousands of cells from a tissue [117], single-cell omics capture the heterogeneity of the tissue by understanding the unique role of the cells and offering deeper insights into specific cell function and behavior [118]. Recognized by Nature as one of the top emerging technologies in 2022 [120], spatial omics adds another layer of complexity by mapping gene expression patterns within their spatial context, enabling the study of tissue architecture and cell-to-cell communication at resolutions down to the subcellular level [119]. Significant computational and analytical challenges given by the volume of data generated by these technologies still hold. The need to efficiently store, manage, and analyze these datasets continues to outpace available computational resources, making scalability a key concern. Additionally, existing integrative tools lack the analytical capacity to perform crucial functions, requiring further methodological advancements [121]. Nevertheless, advancements in single-cell and spatial multi-omics will continue to drive innovation, offering a more comprehensive view of cellular biology.

Although ML and AI approaches have not been the most common methods for integrating omics data, advancements in high-throughput technologies are likely to make them increasingly crucial in data analysis. These models can overcome challenges associated with high dimensionality, noise, and data heterogeneity. Feature selection will be crucial for the optimal application of these techniques, helping to mitigate challenges posed by high dimensionality, redundancy, and noise in large-scale datasets. Traditional statistical approaches remain widely used [122], but they often struggle with the complexity of multi-omics datasets. Recent advancements in machine learning (ML) and deep learning (DL)-based feature selection methods offer a more scalable and adaptive solution [123]. Even more advanced algorithms for feature selection were born that shift the single-objective viewpoint to a multi-objective perspective, leveraging quantum computing [124] and opening a whole new field of research with further potential.

Another promising advancement in multi-omics is multiscale integration, which provide a holistic understanding of biological systems by linking gene and protein expression data with imaging modalities and clinical metadata [125]. This allows for the identification of disease markers with higher specificity, leading to improved diagnostics, prognostic predictions, and therapeutic interventions [125]. However, this integration introduces new challenges, including batch effects, computational complexity, and standardization issues. Developing robust methods to harmonize and analyze such diverse datasets is crucial for future progress. Finally, the establishment of community-driven initiatives for data sharing and analysis will accelerate the translation of multi-omics findings into clinical applications, such as personalized medicine and drug discovery. Fostering collaboration and idea-sharing among researchers to collectively tackle these complexities is the only way to establish robust pipeline and accordance in data collection and analysis [126].

Conclusions

Omics integration has become a popular topic in systems biology, as it gives the potential to unravel pathophysiological mechanisms at multiple levels, joining together complementary information from different omics platform. This is particularly important in those diseases whose clinical phenotypes and genotypes are not enough to provide neither an understanding of the underlying mechanisms nor the diagnosis and prognosis. Moreover, multiple biological layers are intricately interconnected in human diseases. For example, disruptions in DNA repair processes contribute to various diseases [127], therefore, it is essential to consider interactions with repair molecules to fully understand disease mechanisms. In tumors, the literature is increasingly highlighting the interplay between genetic mechanisms and molecular pathways involved in immunity [128]. Integration of different omics datasets could also lead to huge progress in the context of personalized medicine, which aims at having both molecular and clinical profiles of patients to build individualized health care models with tailored treatment and management [129, 130].

Relationships between omics are not usually causative, and statistical associations cannot seize complex relationships such as post-translational modifications or non-linear reaction kinetics [95]. Besides, correlations do not highlight causal associations, and indeed they can even result by chance. Multivariate methods have the potential to discover hidden patterns and relationships. Indeed, they have been gaining attention for their capability to specifically integrate different datasets and obtain insights on their contribution, similarity, and dissimilarity. On the other hand, interpretability plays a central role in systems biology: while it is true that correlation approaches can be too reductive, it is also true that multivariate methods are more challenging to apply and interpret.

ML/AI models are designed to improve the classification performance at the expense of the understanding of the features importance in the phenotype discrimination.

When dealing with omics integration an important issue is the need to work with appropriate datasets. This is why it is important to define an adequate study design and to try making high-quality datasets publicly available in order to increase research and collaboration towards a successful data integration process. Also, while several public databases are available, they are still limited to single omics [131,132,133].

In summary, although the integration of multiple datasets has yielded encouraging results in terms of molecular mechanism understanding, several challenges remain. Each datasets bear difficulties because of the high-throughput nature of omics platform: data quality, missing data, collinearity. When integrating different omics, the dimension of the problem increases, and data becomes even more heterogeneous. We explored various strategies to address these challenges, emphasizing that robust pre-processing and fine-tuned approaches are essential for unlocking the full potential of omics integration. By improving integration strategies, multi-omics integration will likely become even more relevant for biomedical research in the years to come.

Availability of data and materials

Data sharing is not applicable to this article as no datasets were generated or analysed during the current study.

Abbreviations

AE:

AutoEncoders

AUC:

Area under the curve

CART:

Classification and regression trees

CI:

Condition Index

DEGs:

Differentially expressed genes

DEPs:

Differentially expressed proteins

DIABLO:

Data Integration Analysis for Biomarker Discovery using Latent cOmponents

KNN:

K-nearest neighbour

LDA:

Linear discriminant analysis

LIME:

Local interpretable model-agnostic explanations

MAR:

Missing at random

MCAR:

Missing completely at random

MEFISTO:

Method for the functional integration of spatial and temporal omics data

MFA:

Multiple factor analysis

(M)CIA:

Multiple co-inertia analysis

ML:

Machine learning

MNAR:

Missing not at random

MOFA:

Multi-omics factor analysis

mRMR:

Minimal redundancy-maximal relevance

PCA:

Principal component analysis

PLS (-DA):

Partial least squares (discriminant analysis)

RFE:

Recursive feature elimination

RFE (CV):

Recursive feature elimination cross validation

(r)CCA:

(Regularized) canonical correlation analysis

scSNF:

(Spectral Clustering) similarity network fusion

sGCCA:

Sparse generalized canonical correlation analysis

SVD:

Singular value decomposition

SVM:

Support vector machines

VDP:

Variance decomposition proportion

VIF:

Variance inflation factor

VIP:

Variable importance in projection

WGCNA:

Weighted gene correlation network analysis

References

  1. Breitling R. What is systems biology? Front Physiol. 2010. https://doi.org/10.3389/fphys.2010.00009/abstract.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Papakonstantinou E, Pierouli K, Eliopoulos E, Vlachakis D. Introductory Chapter: Systems Biology Consolidating State of the Art Genetics and Bioinformatics. In: Vlachakis D, editor. Systems Biology [Internet]. IntechOpen; 2019 [cited 2023 Jan 4]. Available from: https://www.intechopen.com/books/systems-biology/introductory-chapter-systems-biology-consolidating-state-of-the-art-genetics-and-bioinformatics.

  3. Hillmer RA. Systems biology for biologists. PLoS Pathog. 2015;11(5):e1004786.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Li W, Shao C, Zhou H, Du H, Chen H, Wan H, et al. Multi-omics research strategies in ischemic stroke: a multidimensional perspective. Ageing Res Rev. 2022;81: 101730.

    Article  PubMed  Google Scholar 

  5. Bermingham KM, Brennan L, Segurado R, Barron RE, Gibney ER, Ryan MF, et al. Genetic and environmental contributions to variation in the stable urinary NMR metabolome over time: a classic twin study. J Proteome Res. 2021;20(8):3992–4000.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Gruzieva O, Jeong A, He S, Yu Z, de Bont J, Pinho MGM, et al. Air pollution, metabolites and respiratory health across the life-course. Eur Respir Rev. 2022;31(165): 220038.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Ritchie MD, Holzinger ER, Li R, Pendergrass SA, Kim D. Methods of integrating data to uncover genotype–phenotype interactions. Nat Rev Genet. 2015;16(2):85–97.

    Article  PubMed  Google Scholar 

  8. Picard M, Scott-Boyer MP, Bodein A, Périn O, Droit A. Integration strategies of multi-omics data for machine learning analysis. Comput Struct Biotechnol J. 2021;19:3735–46.

    Article  PubMed  PubMed Central  Google Scholar 

  9. Gurke R, Bendes A, Bowes J, Koehm M, Twyman RM, Barton A, et al. Omics and multi-omics analysis for the early identification and improved outcome of patients with psoriatic arthritis. Biomedicines. 2022;10(10):2387.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Ryan CJ, Cimermančič P, Szpiech ZA, Sali A, Hernandez RD, Krogan NJ. High-resolution network biology: connecting sequence with function. Nat Rev Genet. 2013;14(12):865–79.

    Article  PubMed  PubMed Central  Google Scholar 

  11. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;29(372): n71.

    Article  Google Scholar 

  12. Rohart F, Gautier B, Singh A, Lê Cao KA. mixOmics: an R package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. 2017;13(11): e1005752.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Uppal K, Ma C, Go YM, Jones DP. xMWAS: a data-driven integration and differential network analysis tool. Bioinformatics. 2018;34(4):701–2.

    Article  PubMed  Google Scholar 

  14. Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9(1):559.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Wang B, Mezlini AM, Demir F, Fiume M, Tu Z, Brudno M, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333–7.

    Article  PubMed  Google Scholar 

  16. Culhane AC, Perrière G, Higgins DG. Cross-platform comparison and visualisation of gene expression data using co-inertia analysis. BMC Bioinformatics. 2003;4(1):59.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Meng C, Kuster B, Culhane AC, Gholami AM. A multivariate approach to the integration of multi-omics datasets. BMC Bioinformatics. 2014;15(1):162.

    Article  PubMed  PubMed Central  Google Scholar 

  18. Argelaguet R, Arnol D, Bredikhin D, Deloro Y, Velten B, Marioni JC, et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 2020;21(1):111.

    Article  PubMed  PubMed Central  Google Scholar 

  19. Lê S, Josse J, Husson F. FactoMineR: an R package for multivariate analysis. J Stat Softw. 2008;18(25):1–18.

    Google Scholar 

  20. Wilkerson MD, Hayes DN. ConsensusClusterPlus: a class discovery tool with confidence assessments and item tracking. Bioinformatics. 2010;26(12):1572–3.

    Article  PubMed  PubMed Central  Google Scholar 

  21. Mo Q, Shen R, Guo C, Vannucci M, Chan KS, Hilsenbeck SG. A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. Biostatistics. 2018;19(1):71–86.

    Article  PubMed  Google Scholar 

  22. Yiu ML, Mamoulis N. Frequent-Pattern based Iterative Projected Clustering. In: Proceedings of the Third IEEE International Conference on Data Mining. USA: IEEE Computer Society; 2003. p. 689. (ICDM ‘03).

  23. Erickson N, Mueller J, Shirkov A, Zhang H, Larroy P, Li M, et al. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data [Internet]. arXiv; 2020 [cited 2025 Jan 12]. Available from: http://arxiv.org/abs/2003.06505.

  24. Zheng W, Zhang Y, Sun C, Ge S, Tan Y, Shen H, et al. A multi-omics study of human testis and epididymis. Molecules. 2021;26(11).

  25. Gao YN, Yang X, Wang JQ, Liu HM, Zheng N. Multi-omics reveal additive cytotoxicity effects of aflatoxin B1 and aflatoxin M1 toward intestinal NCM460 cells. Toxins (Basel). 2022;14(6).

  26. Yang F, Zhao LY, Yang WQ, Chao S, Ling ZX, Sun BY, et al. Quantitative proteomics and multi-omics analysis identifies potential biomarkers and the underlying pathological molecular networks in Chinese patients with multiple sclerosis. BMC Neurol. 2024;24(1):423.

    Article  PubMed  PubMed Central  Google Scholar 

  27. Dong W, Chen Y, Zhang Q, Zhao X, Liu P, He H, et al. Effects of lipoteichoic and arachidonic acids on the immune-regulatory mechanism of bovine mammary epithelial cells using multi-omics analysis. Front Vet Sci. 2022;9: 984607.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Elstner M, Olszewski K, Prokisch H, Klopstock T, Murgia M. Multi-omics approach to mitochondrial DNA damage in human muscle fibers. Int J Mol Sci. 2021;22(20).

  29. Johansson M, Ulfenborg B, Andersson CX, Heydarkhan-Hagvall S, Jeppsson A, Sartipy P, et al. Multi-omics characterization of a human stem cell-based model of cardiac hypertrophy. Life (Basel). 2022;12(2).

  30. Kechavarzi BD, Wu H, Doman TN. Bottom-up, integrated -omics analysis identifies broadly dosage-sensitive genes in breast cancer samples from TCGA. PLoS ONE. 2019;14(1): e0210910.

    Article  PubMed  PubMed Central  Google Scholar 

  31. Cziesielski MJ, Liew YJ, Cui G, Schmidt-Roach S, Campana S, Marondedze C, et al. Multi-omics analysis of thermal stress response in a zooxanthellate cnidarian reveals the importance of associating with thermotolerant symbionts. Proc Biol Sci. 2018;285(1877).

  32. Zhang H, Zhao C, Zhang Y, Lu L, Shi W, Zhou Q, et al. Multi-omics analysis revealed NMBA induced esophageal carcinoma tumorigenesis via regulating PPARα signaling pathway. Environ Pollut. 2023;1(324): 121369.

    Article  Google Scholar 

  33. Xu Y, Zhang Y, Qin Y, Gu M, Chen R, Sun Y, et al. Multi-omics analysis of functional substances and expression verification in cashmere fineness. BMC Genomics. 2023;24(1):720.

    Article  PubMed  PubMed Central  Google Scholar 

  34. Jiang B, Yang J, He R, Wang D, Huang Y, Zhao G, et al. Integrated multi-omics analysis for lung adenocarcinoma in Xuanwei, China. Aging. 2023;15(23):14263–91.

    Article  PubMed  PubMed Central  Google Scholar 

  35. Leo IR, Aswad L, Stahl M, Kunold E, Post F, Erkers T, et al. Integrative multi-omics and drug response profiling of childhood acute lymphoblastic leukemia cell lines. Nat Commun. 2022;13(1):1691.

    Article  PubMed  PubMed Central  Google Scholar 

  36. Ramiro L, García-Berrocoso T, Briansó F, Goicoechea L, Simats A, Llombart V, et al. Integrative Multi-omics analysis to characterize human brain ischemia. Mol Neurobiol. 2021;58(8):4107–21.

    Article  PubMed  Google Scholar 

  37. Wang Z, Xie Z, Zhang Z, Zhou W, Guo B, Li M. Multi-platform omics sequencing dissects the atlas of plasma-derived exosomes in rats with or without depression-like behavior after traumatic spinal cord injury. Prog Neuropsychopharmacol Biol Psychiatry. 2024;8(132): 110987.

    Article  Google Scholar 

  38. Gong TQ, Jiang YZ, Shao C, Peng WT, Liu MW, Li DQ, et al. Proteome-centric cross-omics characterization and integrated network analyses of triple-negative breast cancer. Cell Rep. 2022;38(9): 110460.

    Article  PubMed  Google Scholar 

  39. Ding Z, Fu L, Tie W, Yan Y, Wu C, Dai J, et al. Highly dynamic, coordinated, and stage-specific profiles are revealed by a multi-omics integrative analysis during tuberous root development in cassava. J Exp Bot. 2020;71(22):7003–17.

    Article  PubMed  Google Scholar 

  40. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech. 2008;2008(10):P10008.

    Article  Google Scholar 

  41. Lee H, Gao Y, Ko E, Lee J, Lee HK, Lee S, et al. Nonmonotonic response of type 2 diabetes by low concentration organochlorine pesticide mixture: Findings from multi-omics in zebrafish. J Hazard Mater. 2021;15(416): 125956.

    Article  Google Scholar 

  42. Lee H, Sung EJ, Seo S, Min EK, Lee JY, Shim I, et al. Integrated multi-omics analysis reveals the underlying molecular mechanism for developmental neurotoxicity of perfluorooctanesulfonic acid in zebrafish. Environ Int. 2021;157: 106802.

    Article  PubMed  Google Scholar 

  43. Na AY, Lee H, Min EK, Paudel S, Choi SY, Sim H, et al. Novel time-dependent multi-omics integration in sepsis-associated liver dysfunction. Genom Proteom Bioinform. 2023;21(6):1101–16.

    Article  Google Scholar 

  44. González I, Déjean S, Martin PGP, Baccini A. CCA: an R package to extend canonical correlation analysis. J Stat Softw. 2008;17(23):1–14.

    Google Scholar 

  45. Liang S, Lu Z, Cai L, Zhu M, Zhou H, Zhang J. Multi-Omics analysis reveals molecular insights into the effects of acute ozone exposure on lung tissues of normal and obese male mice. Environ Int. 2024;1(183): 108436.

    Article  Google Scholar 

  46. Simats A, Ramiro L, García-Berrocoso T, Briansó F, Gonzalo R, Martín L, et al. A mouse brain-based multi-omics integrative approach reveals potential blood biomarkers for ischemic stroke. Mol Cell Proteomics. 2020;19(12):1921–36.

    Article  PubMed  PubMed Central  Google Scholar 

  47. Pearl J. Probabilistic Reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann; 1988.

    Google Scholar 

  48. Picard D, Felsberg J, Langini M, Stachura P, Qin N, Macas J, et al. Integrative multi-omics reveals two biologically distinct groups of pilocytic astrocytoma. Acta Neuropathol. 2023;146(4):551–64.

    Article  PubMed  PubMed Central  Google Scholar 

  49. Li CX, Wheelock CE, Sköld CM, Wheelock ÅM. Integration of multi-omics datasets enables molecular classification of COPD. Eur Respir J. 2018;51(5):1701930.

    Article  PubMed  Google Scholar 

  50. Li S, Dragan I, Tran VDT, Fung CH, Kuznetsov D, Hansen MK, et al. Multi-omics subgroups associated with glycaemic deterioration in type 2 diabetes: an IMI-RHAPSODY Study. Front Endocrinol (Lausanne). 2024;15:1350796.

    Article  PubMed  Google Scholar 

  51. Ruan P, Todd JL, Zhao H, Liu Y, Vinisko R, Soellner JF, et al. Integrative multi-omics analysis reveals novel idiopathic pulmonary fibrosis endotypes associated with disease progression. Respir Res. 2023;24(1):141.

    Article  PubMed  PubMed Central  Google Scholar 

  52. Scisciola L, Chianese U, Caponigro V, Basilicata MG, Salviati E, Altucci L, et al. Multi-omics analysis reveals attenuation of cellular stress by empagliflozin in high glucose-treated human cardiomyocytes. J Transl Med. 2023;21(1):662.

    Article  PubMed  PubMed Central  Google Scholar 

  53. Clark C, Dayon L, Masoodi M, Bowman GL, Popp J. An integrative multi-omics approach reveals new central nervous system pathway alterations in Alzheimer’s disease. Alzheimers Res Ther. 2021;13(1):71.

    Article  PubMed  PubMed Central  Google Scholar 

  54. Titz B, Szostak J, Sewer A, Phillips B, Nury C, Schneider T, et al. Multi-omics systems toxicology study of mouse lung assessing the effects of aerosols from two heat-not-burn tobacco products and cigarette smoke. Comput Struct Biotechnol J. 2020;18:1056–73.

    Article  PubMed  PubMed Central  Google Scholar 

  55. Armenteros JJA, Brorsson C, Johansen CH, Banasik K, Mazzoni G, Moulder R, et al. Multi-omics analysis reveals drivers of loss of β-cell function after newly diagnosed autoimmune type 1 diabetes: an INNODIA multicenter study. Diabetes Metab Res Rev. 2024;40(5): e3833.

    Article  PubMed  Google Scholar 

  56. Aydin S, Pham DT, Zhang T, Keele GR, Skelly DA, Paulo JA, et al. Genetic dissection of the pluripotent proteome through multi-omics data integration. Cell Genomics. 2023;3(4). Available from: https://www.cell.com/cell-genomics/abstract/S2666-979X(23)00043-5.

  57. Edelbroek B, Westholm JO, Bergquist J, Söderbom F. Multi-omics analysis of aggregative multicellularity. iScience. 2024;27(9): 110659.

    Article  PubMed  PubMed Central  Google Scholar 

  58. Argelaguet R, Velten B, Arnol D, Dietrich S, Zenz T, Marioni JC, et al. Multi-Omics Factor Analysis—a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol. 2018. https://doi.org/10.15252/msb.20178124.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Park JC, Barahona-Torres N, Jang SY, Mok KY, Kim HJ, Han SH, et al. Multi-omics-based autophagy-related untypical subtypes in patients with cerebral amyloid pathology. Adv Sci (Weinh). 2022;9(23): e2201212.

    Article  PubMed  Google Scholar 

  60. Velten B, Braunger JM, Argelaguet R, Arnol D, Wirbel J, Bredikhin D, et al. Identifying temporal and spatial patterns of variation from multimodal data using MEFISTO. Nat Methods. 2022;19(2):179–86.

    Article  PubMed  PubMed Central  Google Scholar 

  61. Gisby JS, Buang NB, Papadaki A, Clarke CL, Malik TH, Medjeral-Thomas N, et al. Multi-omics identify falling LRRC15 as a COVID-19 severity marker and persistent pro-thrombotic signals in convalescence. Nat Commun. 2022;13(1):7775.

    Article  PubMed  PubMed Central  Google Scholar 

  62. Benedetto A, Robotti E, Belay MH, Ghignone A, Fabbris A, Goggi E, et al. Multi-omics approaches for freshness estimation and detection of illicit conservation treatments in sea bass (Dicentrarchus Labrax): data fusion applications. Int J Mol Sci. 2024;25(3):1509.

    Article  PubMed  PubMed Central  Google Scholar 

  63. Faugere J, Brunet TA, Clément Y, Espeyte A, Geffard O, Lemoine J, et al. Development of a multi-omics extraction method for ecotoxicology: investigation of the reproductive cycle of Gammarus fossarum. Talanta. 2022;28(253): 123806.

    Google Scholar 

  64. Singh A, Shannon CP, Gautier B, Rohart F. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays.

  65. Li S, Alfaro AC, Nguyen TV, Young T, Lulijwa R. An integrated omics approach to investigate summer mortality of New Zealand Greenshell™ mussels. Metabolomics. 2020;16(9):100.

    Article  PubMed  Google Scholar 

  66. Chappell K, Manna K, Washam CL, Graw S, Alkam D, Thompson MD, et al. Multi-omics data integration reveals correlated regulatory features of triple negative breast cancer. Mol Omics. 2021;17(5):677–91.

    Article  PubMed  PubMed Central  Google Scholar 

  67. Poussin C, Titz B, Xiang Y, Baglia L, Berg R, Bornand D, et al. Blood and urine multi-omics analysis of the impact of e-vaping, smoking, and cessation: from exposome to molecular responses. Sci Rep. 2024;14(1):4286.

    Article  PubMed  PubMed Central  Google Scholar 

  68. Rushing BR. Unlocking the molecular secrets of antifolate drug resistance: a multi-omics investigation of the NCI-60 cell line panel. Biomedicines. 2023;11(9):2532.

    Article  PubMed  PubMed Central  Google Scholar 

  69. Ivanova L, Rangel-Huerta OD, Tartor H, Dahle MK, Uhlig S, Fæste CK. Metabolomics and multi-omics determination of potential plasma biomarkers in PRV-1-infected atlantic salmon. Metabolites. 2024;14(7):375.

    Article  PubMed  PubMed Central  Google Scholar 

  70. Ribeiro DM, Palma M, Salvado J, Hernández-Castellano LE, Capote J, Castro N, et al. Goat mammary gland metabolism: an integrated Omics analysis to unravel seasonal weight loss tolerance. J Proteomics. 2023;30(289): 105009.

    Article  Google Scholar 

  71. Chepy A, Vivier S, Bray F, Ternynck C, Meneboo JP, Figeac M, et al. Effects of immunoglobulins g from systemic sclerosis patients in normal dermal fibroblasts: a multi-omics study. Front Immunol. 2022;13: 904631.

    Article  PubMed  PubMed Central  Google Scholar 

  72. Khalyfa A, Marin JM, Sanz-Rubio D, Lyu Z, Joshi T, Gozal D. Multi-omics analysis of circulating exosomes in adherent long-term treated OSA patients. Int J Mol Sci. 2023;24(22):16074.

    Article  PubMed  PubMed Central  Google Scholar 

  73. Tenenhaus A, Philippe C, Guillemot V, Le Cao KA, Grill J, Frouin V. Variable selection for generalized canonical correlation analysis. Biostatistics. 2014;15(3):569–83.

    Article  PubMed  Google Scholar 

  74. Spick M, Campbell A, Baricevic-Jones I, von Gerichten J, Lewis HM, Frampas CF, et al. Multi-omics reveals mechanisms of partial modulation of COVID-19 dysregulation by glucocorticoid treatment. Int J Mol Sci. 2022;23(20).

  75. Carapito R, Carapito C, Morlon A, Paul N, Vaca Jacome AS, Alsaleh G, et al. Multi-OMICS analyses unveil STAT1 as a potential modifier gene in mevalonate kinase deficiency. Ann Rheum Dis. 2018;77(11):1675–87.

    Article  PubMed  Google Scholar 

  76. Shashikadze B, Flenkenthaler F, Kemter E, Franzmeier S, Stöckl JB, Haid M, et al. Multi-omics analysis of diabetic pig lungs reveals molecular derangements underlying pulmonary complications of diabetes mellitus. Dis Models Mech. 2024;17(7):dmm050650.

    Article  Google Scholar 

  77. Ichikawa A, Miki D, Hayes CN, Teraoka Y, Nakahara H, Tateno C, et al. Multi-omics analysis of a fatty liver model using human hepatocyte chimeric mice. Sci Rep. 2024;14(1):3362.

    Article  PubMed  PubMed Central  Google Scholar 

  78. Zoppi J, Guillaume JF, Neunlist M, Chaffron S. MiBiOmics: an interactive web application for multi-omics data exploration and integration. BMC Bioinform. 2021;22(1):6.

    Article  Google Scholar 

  79. Liu Z, Zhao Y, Kong P, Liu Y, Huang J, Xu E, et al. Integrated multi-omics profiling yields a clinically relevant molecular classification for esophageal squamous cell carcinoma. Cancer Cell. 2023;41(1):181-195.e9.

    Article  PubMed  Google Scholar 

  80. Chong W, Zhu X, Ren H, Ye C, Xu K, Wang Z, et al. Integrated multi-omics characterization of KRAS mutant colorectal cancer. Theranostics. 2022;12(11):5138–54.

    Article  PubMed  PubMed Central  Google Scholar 

  81. Eteleeb AM, Novotny BC, Tarraga CS, Sohn C, Dhungel E, Brase L, et al. Brain high-throughput multi-omics data reveal molecular heterogeneity in Alzheimer’s disease. PLoS Biol. 2024;22(4): e3002607.

    Article  PubMed  PubMed Central  Google Scholar 

  82. Anwar MY, Highland H, Buchanan VL, Graff M, Young K, Taylor KD, et al. Machine learning-based clustering identifies obesity subgroups with differential multi-omics profiles and metabolic patterns. Obesity. 2024;32(11):2024–34.

    Article  PubMed  Google Scholar 

  83. Gillenwater LA, Helmi S, Stene E, Pratte KA, Zhuang Y, Schuyler RP, et al. Multi-omics subtyping pipeline for chronic obstructive pulmonary disease. PLoS ONE. 2021;16(8): e0255337.

    Article  PubMed  PubMed Central  Google Scholar 

  84. Wang K, Khoramjoo M, Srinivasan K, Gordon PMK, Mandal R, Jackson D, et al. Sequential multi-omics analysis identifies clinical phenotypes and predictive biomarkers for long COVID. CR Med [Internet]. 2023;4(11). Available from: https://www.cell.com/cell-reports-medicine/abstract/S2666-3791(23)00431-7.

  85. Khadirnaikar S, Shukla S, Prasanna SRM. Integration of pan-cancer multi-omics data for novel mixed subgroup identification using machine learning methods. PLoS ONE. 2023;18(10):e0287176.

    Article  PubMed  PubMed Central  Google Scholar 

  86. Horvatić A, Gelemanović A, Pirkić B, Smolec O, Beer Ljubić B, Rubić I, et al. Multi-omics approach to elucidate cerebrospinal fluid changes in dogs with intervertebral disc herniation. Int J Mol Sci. 2021;22(21):11678.

    Article  PubMed  PubMed Central  Google Scholar 

  87. Huang Q, Zhang X, Hu Z. Application of artificial intelligence modeling technology based on multi-omics in noninvasive diagnosis of inflammatory bowel disease. J Inflamm Res. 2021;14:1933–43.

    Article  PubMed  PubMed Central  Google Scholar 

  88. Li Y, Hou G, Zhou H, Wang Y, Tun HM, Zhu A, et al. Multi-platform omics analysis reveals molecular signature for COVID-19 pathogenesis, prognosis and drug target discovery. Signal Transduct Target Ther. 2021;6(1):155.

    Article  PubMed  PubMed Central  Google Scholar 

  89. Zeng H, Chen L, Zhang M, Luo Y, Ma X. Integration of histopathological images and multi-dimensional omics analyses predicts molecular features and prognosis in high-grade serous ovarian cancer. Gynecol Oncol. 2021;163(1):171–80.

    Article  PubMed  Google Scholar 

  90. Fontanilles M, Heisbourg JD, Daban A, Fiore FD, Pépin LF, Marguet F, et al. Metabolic remodeling in glioblastoma: a longitudinal multi-omics study. Acta Neuropathol Commun. 2024;12(12):162.

    Article  PubMed  PubMed Central  Google Scholar 

  91. Valsesia A, Chakrabarti A, Hager J, Langin D, Saris WHM, Astrup A, et al. Integrative phenotyping of glycemic responders upon clinical weight loss using multi-omics. Sci Rep. 2020;10(1):9236.

    Article  PubMed  PubMed Central  Google Scholar 

  92. Wang Y, Huang X, Li F, Jia X, Jia N, Fu J, et al. Serum-integrated omics reveal the host response landscape for severe pediatric community-acquired pneumonia. Crit Care. 2023;27(1):79.

    Article  PubMed  PubMed Central  Google Scholar 

  93. Han Y, Zeng X, Hua L, Quan X, Chen Y, Zhou M, et al. The fusion of multi-omics profile and multimodal EEG data contributes to the personalized diagnostic strategy for neurocognitive disorders. Microbiome. 2024;12(1):12.

    Article  PubMed  PubMed Central  Google Scholar 

  94. Bai W, Li C, Li W, Wang H, Han X, Wang P, et al. Machine learning assists prediction of genes responsible for plant specialized metabolite biosynthesis by integrating multi-omics data. BMC Genomics. 2024;25(1):418.

    Article  PubMed  PubMed Central  Google Scholar 

  95. Eicher T, Kinnebrew G, Patt A, Spencer K, Ying K, Ma Q, et al. Metabolomics and multi-omics integration: a survey of computational methods and resources. Metabolites. 2020;10(5):202.

    Article  PubMed  PubMed Central  Google Scholar 

  96. Kumar N, Hoque MdA, Sugimoto M. Robust volcano plot: identification of differential metabolites in the presence of outliers. BMC Bioinform. 2018;19(1):128.

    Article  Google Scholar 

  97. Sperisen P, Cominetti O, Martin FPJ. Longitudinal omics modeling and integration in clinical metabonomics research: challenges in childhood metabolic health research. Front Mol Biosci. 2015. https://doi.org/10.3389/fmolb.2015.00044/abstract.

    Article  PubMed  PubMed Central  Google Scholar 

  98. Mirza B, Wang W, Wang J, Choi H, Chung NC, Ping P. Machine learning and integrative analysis of biomedical big data. Genes. 2019;10(2):87.

    Article  PubMed  PubMed Central  Google Scholar 

  99. Krassowski M, Das V, Sahu SK, Misra BB. State of the field in multi-omics research: from computational needs to data mining and sharing. Front Genet. 2020;10(11): 610798.

    Article  Google Scholar 

  100. Panda BS, Kumar Adhikari R. A Method for Classification of Missing Values using Data Mining Techniques. In: 2020 International Conference on Computer Science, Engineering and Applications (ICCSEA) [Internet]. Gunupur, India: IEEE; 2020 [cited 2023 Jan 13]. p. 1–5. Available from: https://ieeexplore.ieee.org/document/9132935/.

  101. Taylor SL, Ruhaak LR, Kelly K, Weiss RH, Kim K. Effects of imputation on correlation: implications for analysis of mass spectrometry data from multiple biological matrices. Brief Bioinform. 2016;bbw010.

  102. Hughes RA, Heron J, Sterne JAC, Tilling K. Accounting for missing data in statistical analyses: multiple imputation is not always the answer. Int J Epidemiol. 2019;48(4):1294–304.

    Article  PubMed  PubMed Central  Google Scholar 

  103. Gardner ML, Freitas MA. Multiple imputation approaches applied to the missing value problem in bottom-up proteomics. Int J Mol Sci. 2021;22(17):9650.

    Article  PubMed  PubMed Central  Google Scholar 

  104. Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion. 2019;50:71–91.

    Article  PubMed  Google Scholar 

  105. Kim JH. Multicollinearity and misleading statistical results. Korean J Anesthesiol. 2019;72(6):558–69.

    Article  PubMed  PubMed Central  Google Scholar 

  106. Johnstone IM, Titterington DM. Statistical challenges of high-dimensional data. Phil Trans R Soc A. 1906;2009(367):4237–53.

    Google Scholar 

  107. Defernez M, Kemsley EK. The use and misuse of chemometrics for treating classification problems. TrAC, Trends Anal Chem. 1997;16(4):216–21.

    Article  Google Scholar 

  108. Idri A, Benhar H, Fernández-Alemán JL, Kadi I. A systematic map of medical data preprocessing in knowledge discovery. Comput Methods Programs Biomed. 2018;162:69–85.

    Article  PubMed  Google Scholar 

  109. Li M, Hameed I, Cao D, He D, Yang P. Integrated omics analyses identify key pathways involved in petiole rigidity formation in sacred lotus. Int J Mol Sci. 2020;21(14):5087.

    Article  PubMed  PubMed Central  Google Scholar 

  110. Rodríguez-Pérez R, Fernández L, Marco S. Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study. Anal Bioanal Chem. 2018;410(23):5981–92.

    Article  PubMed  Google Scholar 

  111. Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away: PLS-DA: taking the magic away. J Chemometrics. 2014;28(4):213–25.

    Article  Google Scholar 

  112. Lualdi M, Fasano M. Statistical analysis of proteomics data: a review on feature selection. J Proteomics. 2019;198:18–26.

    Article  PubMed  Google Scholar 

  113. Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics. 2011;12(1):253.

    Article  PubMed  PubMed Central  Google Scholar 

  114. Ribeiro MT, Singh S, Guestrin C. “Why Should I Trust You?”: explaining the predictions of any classifier [Internet]. arXiv; 2016 [cited 2023 May 5]. Available from: http://arxiv.org/abs/1602.04938.

  115. Cakiroglu C, Demir S, Hakan Ozdemir M, Latif Aylak B, Sariisik G, Abualigah L. Data-driven interpretable ensemble learning methods for the prediction of wind turbine power incorporating SHAP analysis. Expert Syst Appl. 2024;1(237): 121464.

    Article  Google Scholar 

  116. Stein LD. The case for cloud computing in genome informatics. Genome Biol. 2010;11(5):207.

    Article  PubMed  PubMed Central  Google Scholar 

  117. Paolillo C, Londin E, Fortina P. Single-cell genomics. Clin Chem. 2019;65(8):972–85.

    Article  PubMed  Google Scholar 

  118. Jehan Z. Chapter 1—single-cell omics: an overview. In: Barh D, Azevedo V, editors. Single-cell omics. Academic Press; 2019. p. 3–19.

    Chapter  Google Scholar 

  119. Vandereyken K, Sifrim A, Thienpont B, Voet T. Methods and applications for single-cell and spatial multi-omics. Nat Rev Genet. 2023;24(8):494–515.

    Article  PubMed  Google Scholar 

  120. Eisenstein M. Seven technologies to watch in 2022. Nature. 2022;601(7894):658–61.

    Article  PubMed  Google Scholar 

  121. Ma A, McDermaid A, Xu J, Chang Y, Ma Q. Integrative methods and practical challenges for single-cell multi-omics. Trends Biotechnol. 2020;38(9):1007–22.

    Article  PubMed  PubMed Central  Google Scholar 

  122. Borah K, Das HS, Seth S, Mallick K, Rahaman Z, Mallik S. A review on advancements in feature selection and feature extraction for high-dimensional NGS data analysis. Funct Integr Genomics. 2024;24(5):139.

    Article  PubMed  Google Scholar 

  123. Got A, Zouache D, Moussaoui A, Abualigah L, Alsayat A. Improved manta ray foraging optimizer-based SVM for feature selection problems: a medical case study. J Bionic Eng. 2024;21(1):409–25.

    Article  Google Scholar 

  124. Zouache D, Got A, Alarabiat D, Abualigah L, Talbi EG. A novel multi-objective wrapper-based feature selection method using quantum-inspired and swarm intelligence techniques. Multimed Tools Appl. 2024;83(8):22811–35.

    Article  Google Scholar 

  125. Phan JH, Quo CF, Cheng C, Wang MD. Multiscale integration of -omic, imaging, and clinical data in biomedical informatics. IEEE Rev Biomed Eng. 2012;5:74–87.

    Article  PubMed  PubMed Central  Google Scholar 

  126. Saki N, Haybar H, Aghaei M. Subject: motivation can be suppressed, but scientific ability cannot and should not be ignored. J Transl Med. 2023;21(1):520.

    Article  PubMed  PubMed Central  Google Scholar 

  127. Eftekhar Z, Aghaei M, Saki N. DNA damage repair in megakaryopoiesis: molecular and clinical aspects. Expert Rev Hematol. 2024;17(10):705–12.

    Article  PubMed  Google Scholar 

  128. Aghapour SA, Torabizadeh M, Bahreiny SS, Saki N, Jalali Far MA, Yousefi-Avarvand A, et al. Investigating the dynamic interplay between cellular immunity and tumor cells in the fight against cancer: an updated comprehensive review. Iran J Blood Cancer. 2024;16(2):84–101.

    Article  Google Scholar 

  129. Goetz LH, Schork NJ. Personalized medicine: motivation, challenges, and progress. Fertil Steril. 2018;109(6):952–63.

    Article  PubMed  PubMed Central  Google Scholar 

  130. Costello JC, Heiser LM, Georgii E, Gönen M, Menden MP, et al. A community effort to assess and improve drug sensitivity prediction algorithms. Nat Biotechnol. 2014;32(12):1202–12.

    Article  PubMed  PubMed Central  Google Scholar 

  131. Vizcaíno JA, Deutsch EW, Wang R, Csordas A, Reisinger F, Ríos D, et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat Biotechnol. 2014;32(3):223–6.

    Article  PubMed  PubMed Central  Google Scholar 

  132. Haug K, Cochrane K, Nainala VC, Williams M, Chang J, Jayaseelan KV, et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 2020;48(D1):D440–4.

    PubMed  Google Scholar 

  133. Sud M, Fahy E, Cotter D, Azam K, Vadivelu I, Burant C, et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 2016;44(D1):D463-470.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

AM was in charge of paper retrieval, conceptualization, and writing of the original manuscript. LB and MF also contributed to conceptualization and supervised the manuscript drafting and organization. All the authors read, revised and approved the final manuscript.

Corresponding author

Correspondence to Aurelia Morabito.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morabito, A., De Simone, G., Pastorelli, R. et al. Algorithms and tools for data-driven omics integration to achieve multilayer biological insights: a narrative review. J Transl Med 23, 425 (2025). https://doi.org/10.1186/s12967-025-06446-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12967-025-06446-x

Keywords