Introduction

Cardiovascular disease remains a leading cause of morbidity and mortality worldwide [1], necessitating innovative approaches to unravel its etiologies. -Omics approaches enable data driven strategies that can capture the multifaceted nature of disease drivers. Modern molecular biology has become a largely technologically driven endeavor [2] and with each generation of advancements in -omics technologies, science has achieved deeper insights into disease mechanisms.

Multi-omic approaches, which integrate data across different -omics modalities, offer a strategy to disentangle the processes driving cardiac conditions beyond what a single modality can achieve in isolation. This is especially the case when considering the relative strengths and weaknesses of current technological methods to measure different -omics modalities. This review discusses the information that can be gained from the various modalities of -omics data, with a primary focus on the transcriptome and the proteome, and the benefits of combining them. We highlight current advances in -omic technologies, data integration strategies, and their recent applications in cardiac research.

Moving Beyond the Genome

The initial wave of insights in heart disease from -omics methodologies emerged from the field of genomics, which significantly advanced our understanding of inherited cardiac diseases. Genomic studies have elucidated critical proteins and pathways involved in the etiology of cardiac diseases, providing valuable information from both monogenic disorders and association studies that identify genetic predispositions to various cardiac conditions. For instance, studies of monogenic disease have classified key genes encoding ion channels or their interactors as definitively causative of long QT syndrome [3]. Genome-wide association studies (GWAS) have increased our understanding of heritable disorders of complex etiologies. An early example was the association of variations in BAG3 with dilated cardiomyopathy [4], but there are now insights into many cardiac diseases including heart failure [5] and atrial fibrillation [6]. Many variants associated with cardiac disease however remain of unknown significance or are without annotated causal genes and thus are of limited clinical utility until we can determine their pathogenicity and characterise their mechanistic role.

For acquired or lifestyle-associated cardiac diseases, insights must be derived from molecular layers beyond the genome. In contrast to the approximately static inherited DNA sequence, the downstream transcriptome and thence proteome change according to the interaction between an individual’s genome and environment. These molecular modalities therefore provide more information with which to elucidate pathophysiological processes (Fig. 1A).

Fig. 1
figure 1

A. Research scientists commonly aim to explain the molecular mechanisms underpinning phenotypes. The “central dogma” of biology explains how information encoded in the genome is first transcribed and then translated into proteins that ultimately determine phenotypes. However, at each stage additional information is incorporated (with some exemplar additional mechanisms shown) that means measurements correspond more directly to phenotypic outcomes. B. The correlation between transcript and protein abundance measured in the heart tissue of a single donor by Wang et al. [13], where each datapoint corresponds to a single gene. For some genes, such as PKP2 in the upper inset, transcript levels strongly correlate with protein abundance (R2 = 0.78), indicating that most regulation occurs at the transcriptional level. For others, such as PRDX1 in the lower inset, transcript levels show little correlation with protein abundance (R2 = 0.04), suggesting that post-transcriptional regulation plays a larger role. C. From samples of cardiac tissue, -omics methods can be applied to bulk samples, or variants now allow measurements to be resolved by cell or within space

Bulk Transcriptomics and Proteomics

Bulk Transcriptomics

Transcriptomics technologies followed upon the back of genomic technologies and are likewise facilitated by the ability to amplify small amounts of starting material using polymerase chain reaction (PCR). RNAseq has become the mature technology of choice, being relatively affordable and now also widely accessible. Moreover, there are mature pipelines adequate for analysis of the data generated [7, 8]. Bulk approaches are used here to refer to those that measure the total sum of a given transcript within a sample, such as a tissue biopsy, in effect averaging gene expression across all constituent parts.

We can begin to assess cardiac dysregulation in various disease states through bulk transcriptomics data obtained from human cardiac biopsies of diverse patient backgrounds. RNAseq has been used to characterise subtypes within cardiomyopathies and has revealed distinct gene expression signatures unique to heart failure with preserved and reduced ejection fraction [9], as well as specific expression profiles in dilated and ischemic cardiomyopathy patient biopsies [10]. Analysing gene expression patterns associated with specific heart conditions enables us to characterise these diseases and gain insights into their potential underlying causes.

The underlying assumption of profiling mRNA to identify genes with altered expression is that mRNA abundance serves as a good predictor of the corresponding protein levels. However, correlation between mRNA expression levels and protein abundance is relatively weak, with typical Pearson correlation coefficients around 0.4 [11, 12]. Figure 1B shows the abundance of mRNA plotted against that of the corresponding protein, measured in heart tissue by Wang et al. [13]. For some genes, such as PKP2 in the upper inset, mRNA abundance is excellent at predicting protein abundance, whereas for other such as PRDX1 in the lower inset there is little predictive value of mRNA abundance. Whilst these measurements were taken in steady state conditions, there fortunately appears to be a better correlation between mRNA and protein levels when evaluating changes across conditions. Nonetheless, one way to address this problem is to measure protein abundance directly, rather than to utilise mRNA abundance as a proxy.

Bulk Proteomics

Proteomics presents a significantly greater technical challenge than transcriptomics. Firstly, there exists no current equivalent of PCR in order to amplify the amount of protein present in a sample. Quantification is then also much more difficult due to higher dynamic range of protein expression in comparison to transcripts, encompassing an additional ~ 3 orders of magnitude [14]. It is worth noting that the heart presents a particular challenge here, due to the specialisation of the tissue for contractile and supporting functions, meaning that just the 10 most abundant proteins comprise almost 20% of the total measured protein abundance.

Mass spectrometry (MS) has emerged as the method of choice for proteomics studies, allowing unbiased and (at least in theory) system wide identification and quantification of proteins. Recent improvements in MS technology have significantly increased the number of proteins that can be quantified, with current capabilities exceeding 10,000 protein groups in the heart [15] and quantification of 6–7,000 proteins in cardiac tissue is achievable using routine workflows by specialist groups [16,17,18,19].

Using MS-based proteomics on human cardiac biopsies, researchers can distinguish proteomic profiles of cardiomyopathies and track changes associated with disease progression. This approach has been employed to distinguish cardiac proteome profiles of ischemic from dilated cardiomyopathies [20] and to differentiate amyloidosis subtypes [21]. In the same manner as with transcriptomics data, altered molecular pathways can be identified based upon the relative changes in protein abundance between disease and control groups. For instance, this has led to insights into DNA damage processes being involved in arrhythmogenic cardiomyopathy [17] and of inflammation driving the electrical remodeling in sinus node dysfunction in heart failure [16].

Protein abundance is a correlate of function, but a proteins activity is also determined by both a suite of post-translational modifications and the other proteins that it interacts with. Further than quantifying protein abundance, proteomics can also be used to quantify changes in post translational modifications to proteins [22, 23]. In cardiac pathologies, quantitative phosphoproteomics has been used to outline how signaling networks in failing hearts are re-wired with beta-blocker and ACE inhibitor treatment [24] and to unveil a mechanism for exercise- or catecholaminergic induced arrythmias in arrhythmogenic cardiomyopathy [25]. The methodology also enables outlining of phosphorylation-mediated signaling networks in cardiac tissue or cells [26, 27]. Proteomics has also been used to outline protein–protein interaction networks of importance for cardiac physiology [28].

Biomarkers

Whilst measurement of molecular abundance within tissue is essential to the study of disease mechanisms, proteomics is increasingly employed in peripheral blood serum or plasma in order to identify biomarkers that are predictive of disease diagnosis or prognosis. Peripheral blood contains factors that are released by the heart, but this signal will be convoluted by the simultaneous release and uptake from other tissues. A canonical approach to biomarker discovery is therefore to identify factors that are secreted into the circulation whose expression is specific to the heart tissue, as was the case for example for cardiac troponins [29]. By integrating our existing knowledge in this manner to improve specificity for cardiac disease, we can help to avoid some of the issues of the low statistical power with which biomarker studies have historically been performed [30]. This is especially the case when studies are performed directly in humans, who exhibit higher variation than that seen in controlled experiments utilising model organisms.

Large scale biobanks are changing this paradigm, enabling biomarker discovery at population scale [31]. Analogies can be drawn to how the genomics field addressed the poor reproducibility of small scale genetic association studies [32] to pave the way for modern GWAS with their much greater statistical rigor [33]. The chief challenge of proteomics in the blood plasma is the even higher dynamic range than in heart tissue [30]. Again there have been technological advancements to enable performing proteomics at this scale, through either aptamer- (e.g. SomaScan (somalogic, USA); [34]) or antibody-based proteomics (e.g. Olink (Olink Proteomics, Sweden); [35]), or advances in sample preparation before mass spectrometry based proteomics (e.g. SEER Proteograph (SEER, USA); [36]). The UK Biobank plasma proteome project has led the way, measuring 1463 plasma proteins in 54,306 individuals [31]. Schuermans and colleagues leverage this dataset in order to calculate composite biomarker scores for 4 common cardiac diseases [37]. This approach can be particularly powerful with the integration of genetic sequence data, which allows the use of mendelian randomisation methods to assess which protein-disease relationships are supported by causal evidence. These technologies can also be employed in targeted patient populations, as exemplified by the examination of the impact of the SGLT2 inhibitor empagliflozin upon the circulating proteome during the EMPEROR heart failure clinical trials [38,39,40].

Beyond Assuming Tissue Homogeneity

Historically, bulk approaches as discussed so far have been necessary in order to gather enough input material for -omics methods. These continue to have notable advantages in data quality when quantifying species abundance, minimising technological noise and in particular in proteomics increasing measurement “depth”: the number of analytes that are quantified in an experiment.

However, bulk approaches make the (patently incorrect) assumption that species abundance is homogenous throughout a tissue sample, averaging signals from mixed cell populations and across different regions in space. Cell type specific changes may therefore be masked, particularly as the signal from rare but pathophysiologically relevant cell types or states is diluted. The interpretation of bulk data therefore benefits from complementary cell-specific approaches to provide this insight (Fig. 1C).

Cell-type Resolved Measurements

Single Cell Transcriptomics

Advances in multiplexing in transcriptomics methods have revolutionised cardiac research, enabling gene expression to be measured within single cells [41]. This has allowed refined characterisations of cardiac cell populations, such as outlining transitions in fibroblast states in failing hearts [42]. It has also provided insights beyond cardiomyocytes and fibroblasts in failing hearts [43], as well as expanded our understanding of the role of inflammatory cells in atrial fibrillation [44].

The progression in the field has involved first mapping the cell populations present in the heart, followed by detailed information on regional and anatomical niches, and then studying adaptations in specific cell populations in cardiac disease states. High-quality sample preparation is a prerequisite for single-cell transcriptomics measurements. Due in large to the fibrous nature of heart tissue, dissociating individual cells across all cell types using a single dissociation approach is challenging. An alternate approach is to isolate nuclei and sequence nuclear transcripts, which however results in the loss of all extranuclear RNA content. Each isolation process will lead to over- and under-representation of certain cell types. Several isolation approaches have been combined to create an atlas describing the cells of the healthy human heart [43, 45, 46]. These datasets represent a major technological achievement and comprise huge numbers of cells and nuclei (704,296 in the current version of the heart cell atlas [46]) which has vastly increased our understanding of the heterogeneity within heart regions and between different cell types. Atlas approaches are also being extended to include disease states in the heart. For example, Reichart et al. present a cellular atlas of dilated cardiomyopathy containing 880,000 nuclei from 61 patients with non ischaemic cardiomyopathy (with or without annotated pathogenic genetic variants) and 18 controls [47].

Currently, there remains a tradeoff that must be made between (very) high degrees of multiplexing allowing the analysis of many cells or nuclei in parallel on the one hand, and the introduction of technical artefacts in the data (such as “sparsity”: a disproportionate number of zeros in a dataset) on the other. The number of cells or nuclei sequenced and the number of patients are likely to increase with current technological developments focusing on more convenient sample preparation protocols. These advancements will allow these methods to scale towards bigger and more clinically relevant studies. While a higher number of cells or nuclei is desirable, it is important for statistical rigour that the relevant n remains the biological sample size, and not the number of cells or nuclei in an experiment [48].

Single Cell Proteomics

A generational change has recently occurred in mass spectrometers [49, 50], representing an “inflection point” [51] in our ability to perform proteomics with increasingly low amounts of input material. Coupled with sample preparation [52] and cell sorting technologies [53], this is beginning to make single cell proteomics feasible without severe compromises in measurement depth. Single cell proteomics has now for example been employed to characterise heterogeneity within induced pluripotent stem cells [54]. Although still emerging, these technologies are beginning to be applied in cardiac research, with Kreimer and colleagues measuring > 1000 protein groups in 92 single cardiomyocytes isolated from a single human donor at a throughput of 96 cells per day [55]. The major technical challenge lies in liberating single cells from cardiac tissue without introducing a selection bias. Further than the fibrous nature of cardiac tissue, the large size of cardiomyocytes adds a layer of complexity to isolating and sorting cells as they are too big for traditional FACS or MACS approaches. Downstream sample preparation before MS analysis is becoming increasingly automated, which is crucial to minimise technological variation that would otherwise be introduced during manual handing [53]. Quantifying protein changes at cellular resolution in cardiac disease states will be a next leap forward likely happening within the next couple of years.

Spatially Resolved Measurements

Spatial Transcriptomics

Spatial resolution provides additional value beyond the cellular context of gene expression and facilitates the generation and testing of hypotheses about how the local (micro)environment influences gene expression. This is done under the heuristic that proximity is a good proxy for interactions. The experimental possibilities that these methods have made tractable led to spatial transcriptomics being declared Nature method of the year in 2021 [56]. Spatial transcriptomics was first applied in the human heart by Kuppe et al., who used it in combination with single cell transcriptomics to evaluate changes in cell type and cell state composition after myocardial infarction, and how this varies with proximity to both other cell types and their proximity to the ischaemic zone [57]. Kanemaru and colleagues use a similar approach to map cells to microanatomical structures such as the sinoatrial node, providing new insight into the compartmentalisation of cells within it [46].

Spatial Proteomics

Spatial proteomics is an emerging field that enables quantitative measurement of protein distribution within tissue sections, analogous to approaches used in spatial transcriptomics. The potential value of this information is again underscored by spatial proteomics being declared Nature method of the year, this time in 2024 [58].

Just as the advancement of mass spectrometers and sample preparation workflows has enabled single cell proteomics, the suitability of these methods for low amounts of input material means they can equally be employed to resolve protein abundance in space. This is achieved most simply using the established method of laser microdissection [59] to dissect regions of interest from tissue sections before proceeding with MS based proteomic workflows [60]. Crucially, the increased sensitivity of modern methods means that extensive pooling of tissue areas or samples is no longer required, and that spatial resolution can be high enough to deliver genuine insight: even at cellular resolution. This enables for example the characterisation of distinct cell types while preserving their spatial relationships, offering valuable insights into cellular interactions within their native environment.

Other strategies towards spatial proteomics that build upon traditional imaging approaches utilising antibody based immunohistochemistry to detect specific proteins also exist. Current advancements are rapidly increasing the number of analytes that can be multiplexed within a single sample. Quantitation can then be performed by a growing range of methods that include cyclic immunofluorescence [61], co-detection by indexing [62], multiplexed ion beam imaging [63] and imaging mass cytometry (IMC), the latter feasible to perform even in 3 dimensions [64].

A recent development seeks to combine multiplexed imaging with mass spectrometry: coined as “deep visual proteomics” (DVP). This first employs a small panel of markers, enabling in silico segmentation of tissue into its constituent cells and microenvironments based upon image analysis using deep neural networks. Laser microdissection then allows subsequent proteomic analysis by MS proteomics as discussed previously [65]. DVP has to date only been employed outside of the cardiac field, where it has made previously intractable hypothesis generating experiments possible. Validated insights have so far been provided into primary melanoma using archived tissue samples [66], and in a single composite case of small lymphocytic and classical Hodgkin lymphoma [67] in order to inform personalised medicine approach. Having demonstrated proof-of-concept, it is ripe to be exploited within the cardiac field.

Approaches to Integrate Diverse Datasets

There are myriad different ways to integrate multiple -omics datasets, and as is often the case mature and consensus strategies to robustly analyse data lag behind our ability to generate the datasets themselves.

Principled Integration

The most common approaches to data integration are principled approaches that integrate different -omics modalities based upon our dogmatic knowledge of which information each modality can best espouse: what information for example does single cell transcriptomics provide, that is orthogonal and complimentary to protein bulk protein abundance? By integrating our prior knowledge in this manner, we are required to make fewest additional assumptions of our datasets.

Integrating RNAseq and proteomics provides a more comprehensive molecular picture than either modality alone. Due to the costs involved, -omics experiments typically utilise small sample sizes and so we can have increased confidence if findings are seen across multiple molecular modalities. We emphasise that the quantification of gene product abundance by transcriptomic and proteomics are not redundant and there is therefore a notable cost of a high false negative rate with this approach, as proteins whose abundance is not regulated primarily by transcription may not show concordant regulation (Fig. 1A-B).

Principled integration also allows flexibility to draw in information from external datasets, including across organisms. For example, we demonstrated that integrating bulk proteomics data with single cell transcriptomics data enables the prediction of the cell populations from which the measured proteins are most likely to originate: generating insight despite the current challenges in measuring cell resolved protein abundance in the heart directly [16, 28].

Genetic evidence can similarly be integrated via simple intersection [68, 69]. There is good evidence for the utility of this practice, with drugs developed with such supporting evidence from human genetics at least twice as likely to be approved than those without [70]. We utilised human GWAS in order to confirm relevance to the human electrocardiogram of proteins whose abundance is changed in a mouse model of heart failure with concurrent sinus node dysfunction [16], and to prioritise for functional importance among protein interactors of cardiac ion channels [28]. Major initiatives are underway in order to better curate such datasets into databases and so facilitate their use by researchers without the required technical expertise. Examples focus primarily upon genetic information [71], protein expression [72, 73] or molecular signaling pathways [74]. The advent of databases with such broad scope is beginning to reduce the domain expertise required by researchers to manually curate such information themselves, and efforts to integrate some of this analysis directly into analytic pipelines [75] will help overcome what is commonly a bottleneck in -omics experiments.

Data Driven Integration Strategies

An alternative approach to principled integration is to rely more heavily on the data themselves and employ data driven integration strategies. The principle strengths and drawbacks of these approaches are captured eloquently by Miao et al.: “In a sense, machine learning methods completely avoid the careful modelling of mechanisms and instead apply a generically complex model to a very large reference dataset to produce a well-performing model with unknown parts” [76].

These will become increasingly powerful as the costs of experimental methods reduce and larger, more statistically powerful studies emerge [77]. -Omics datasets already present a statistical challenge, since the number of analytes measured is vastly greater than the number of statistical replicates (often referred to as the “curse of dimensionality”). This is especially true at the technical bleeding edge, where experimental costs are greatest and hence sample sizes often smallest. Datasets of genuine size are already beginning to present computational challenges, with methods needing to scale for example to analyse millions cells and hence tens of billions of datapoints within a single project [76].

Across Modalities

Data driven approaches to integrate different -omics modalities are nascent, and parallels may be drawn to analytic pipelines for individual -omics datasets, where in many cases it has taken a period of time after technological innovations for us to understand the characteristics of the data produced well enough to analyse it in a manner that reaches real rather than spurious conclusions [33, 78, 79]. Data driven approaches may currently be thought of as generating hypotheses about interactions. These can broadly be categorised into those that integrate different molecular modalities early in the analytic strategy (which provide the most flexibility for models to uncover novel molecular interactions) and those that integrate datasets late in the pipeline [80]. The latter better consider technical variation, a majority of which is assay specific. A theoretical tradeoff must be made between the flexibility of a model and its tendency to “overfit” a dataset: a very flexible model may interpret the experimental data so closely that predictions do not generalise to external or future data. Commonly, these methods seek to interpret such highly dimensional input by embedding into a lower dimensional space. Argelaguet and colleagues introduce what they describe can be intuited as a “versatile and statistically rigorous generalization of principal component analysis”, in which the lower dimensional space identified using factor analysis [81]. Allesøe and colleagues use generative deep learning based models based upon variational autoencoders to embed multiple datasets into a low dimensional space [82]. The generative nature of these models however further allows in silico investigation of the impact of perturbations to help identify associations across -omics modalities.

Across Time

The studies discussed so far in large have a cross-sectional design, where two or more groups (such as disease and control) are compared. A cross-sectional design is useful for identifying molecules significantly associated with a disease state and can provide valuable knowledge on what biological processes are associated with the disease. In contrast, if an experiment is designed with repeated measurements over time (such as repeated protocol biopsies following heart transplantation) it is possible to identify how and when molecules change. It is then possible to define clusters of molecules that have the same dynamic behavior and by application of the principle of “guilt by association” infer likely functional relationships. By analysing trajectories over time we can address the order in which changes occur, and begin to disentangle cause and effect at the molecular level. Whilst interventional evidence remains the gold standard to infer causality, observing for example that process A occurs after process B means that it is impossible that A causes B. In this manner, time series data help to prioritise candidates for additional follow-up experiments.

The most prevalent approach to analyse -omics data is univariate: to analyse the trajectory of each analyte separately. This is typically done using regression models due to their flexibility to accommodate many experimental designs within a single framework [7, 8, 83]. The specification of the appropriate regression model depends on the study design. For time series data, it is especially important to determine whether there are repeated measurements of the same statistical unit (e.g. patient, animal) or whether each measurement is made independently. If measurements are taken serially from the same subject and this is not taken into account during analysis, the assumption of independence is violated and hence an ordinary regression model is invalid. A typical choice instead is to use linear mixed-effects models to account for within-subject correlation [84, 85]. However, for some experimental designs covariance pattern models might better capture typical covariance structures in time series data, with implementation such as lmmStar [86] or nlme [87] available for R [88]. If there are no repeated measurements in the study, it is advisable to use a weighted least squares fit, since variance tends to increase over time in longitudinal data [89].

Conclusion and Outlook

Here, we have reviewed recent technological developments in our ability to measure the proteome and the transcriptome in bulk, in single cells and across space, and highlighted their recent applications in the study of cardiac diseases. In addition to the focus here, other types of -omics data such as epigenomics and metabolomics and derivatives of all of the above each provide distinct insights into the biological system. Only by integrating multiple complimentary -omics modalities, and resolving their measurement across space, time and by cell type of origin, can we begin to understand pathophysiology at a systems level. As a greater number of datasets with increasingly large samples sizes are acquired, the next significant challenge lies in data analysis and interpretation. We must continue to develop strategies to most effectively utilise and integrate these diverse pieces of information in order to understand the dysregulation underlying cardiac disease states and hence develop new therapeutic strategies to ultimately benefit patients.

Key References

  • Schuermans A, Pournamdari AB, Lee J, Bhukar R, Ganesh S, Darosa N, et al. Integrative proteomic analyses across common cardiac diseases yield mechanistic insights and enhanced prediction. Nat Cardiovasc Res. 2024 Nov 21;1–15.

  • Utilising the UK Biobank to combine plasma proteomics with phenotypes, Schuermans and colleagues identify associations between plasma proteomics and 4 common cardiac diseases. Evidence of causality is assessed through the use of mendelian randomisation by combination with genomic data.

  • Kuppe C, Ramirez Flores RO, Li Z, Hayat S, Levinson RT, Liao X, et al. Spatial multi-omic map of human myocardial infarction. Nature. 2022 Aug;608(7924):766–77.

  • Provide the first use of spatial transcriptomics in the human heart, which is interpreted in concert with single cell gene expression and single cell chromatin accessibility data from the same samples. They provide new insight into interactions between cell types and states after myocardial infarction.

  • Kanemaru K, Cranley J, Muraro D, Miranda AMA, Ho SY, Wilbrey-Clark A, et al. Spatially resolved multiomics of human cardiac niches. Nature. 2023 Jul;619(7971):801–10.

  • Combine single cell and spatial transcriptomics to characterise microanatomic regions of the healthy heart, focusing upon the cardiac conduction system.