A General Framework for Integrative Analysis of Incomplete Multi-omics Data
- Problem: Multi-omics analysis is challenged by missing values (incomplete subject profiling) and detection limits (censored data).
- Method: A general statistical framework based on a joint likelihood function and an Expectation-Maximization (EM) algorithm was developed to rigorously model and integrate multi-omics data while accounting for arbitrary missingness and censoring.
- Impact: Applied to the SPIROMICS cohort, the framework demonstrated superior statistical power and reduced bias compared to ad-hoc imputation methods, particularly in identifying protein quantitative trait loci (pQTLs) and biomarker-phenotype associations.
PubMed: 32691502 DOI: 10.1002/gepi.22328 Overview generated by: Gemini 2.5 Flash, 27/11/2025
Background and Objective
The analysis of modern multi-omics studies is complicated by two major statistical challenges: 1. Missing Data: Not all omics data types (e.g., protein expression, metabolomics) are measured on every subject, often due to cost constraints, leading to partial or incomplete datasets. 2. Detection Limits (Censoring): Quantitative omics measurements frequently involve values that fall below (or above) the detection limit of the instrument, leading to left- (or right-) censoring of the data.
Failing to properly account for these issues can lead to biased parameter estimates and reduced statistical power in integrative analyses. This paper proposes a general statistical framework to rigorously and powerfully handle missing values and detection limits in the integrative analysis of multi-omics data.
Methods: The Integrative Analysis Framework
Modeling Incomplete Data
The framework addresses two main analytic goals common in multi-omics studies: 1. Omics-to-Genetics (xQTL): Relating quantitative omics features (e.g., protein or metabolite levels) to genetic variants (SNPs) and covariates using linear regression models. 2. Omics-to-Phenotype: Relating phenotypes (e.g., disease status) to quantitative omics features and covariates using generalized linear models (GLMs).
Key Innovation: Joint Likelihood and EM Algorithm
The core innovation is the derivation of a joint likelihood function that formally accounts for the incomplete data structure. This joint likelihood allows for:
- Arbitrary Missingness: The model is valid even when the pattern of missing omics data is complex (i.e., when data is missing at random).
- Censoring: It directly models the values that are below or above detection limits as censored observations, avoiding the need for simple (and often biased) imputation methods like replacing censored values with \(\frac{1}{2}\) the detection limit.
The authors use an Expectation-Maximization (EM) algorithm to obtain maximum likelihood estimates of all model parameters, efficiently handling the latent (unobserved) censored and missing values.
Application: Emphysema and Blood Biomarkers
The method was applied to data from the SPIROMICS cohort, integrating genetic variants (SNPs), circulating blood biomarkers (proteins), and the phenotype emphysema status.
Key Results and Findings
Superior Performance Over Imputation
The proposed method consistently demonstrated superior performance compared to standard practice methods that use ad-hoc imputation (e.g., replacing values below the detection limit with \(\frac{1}{2}\) the limit) or methods that simply remove incomplete samples:
- Reduced Bias: The likelihood-based approach yielded less biased estimates for the effects of biomarkers on emphysema.
- Increased Power: The framework substantially increased the statistical power to detect associations, particularly when identifying protein quantitative trait loci (pQTLs), where many variants were discovered that were missed by imputation-based approaches.
- Robustness: The method proved robust in handling high rates of missingness and censoring, which are common in proteomic and metabolomic datasets.
Conclusions and Significance
This paper presents a rigorous and powerful statistical solution for overcoming the challenges of missing data and detection limits in the integrative analysis of multi-omics datasets .
By developing a formal joint likelihood framework and using the EM algorithm, the method accurately models the relationships between genetic variants, omics features, and clinical phenotypes. This advancement is critical for improving the quality and interpretability of findings in large-scale multi-omics studies and accelerating the discovery of genetic determinants and molecular mediators of complex diseases.