A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data
- Method: The paper proposes a fully Bayesian latent variable model for integrative clustering analysis of multi-omics data, building on the iCluster framework.
- Key Innovation: The Bayesian approach incorporates adaptive shrinkage priors to enforce sparsity (feature selection) on the omics-specific loading matrices, which simultaneously identifies robust disease subtypes and their minimal molecular signatures.
- Impact: Applied to TCGA cancer data, the model demonstrated superior performance in identifying clinically relevant, stable subtypes and the specific genes/loci driving the differences across mRNA, methylation, and CNV data.
PubMed: 28541380 DOI: 10.1093/biostatistics/kxx017 Overview generated by: Gemini 2.5 Flash, 27/11/2025
Background and Objective
The identification of clinically relevant disease subtypes and their corresponding molecular signatures is a central goal in precision medicine, particularly in cancer research. Large-scale projects like The Cancer Genome Atlas (TCGA) generate vast amounts of heterogeneous multi-omics data (e.g., gene expression, DNA methylation, copy number variation). However, standard clustering methods applied to individual omics layers often yield conflicting or unstable results, failing to capture the comprehensive biological signal.
This paper introduces a novel statistical framework based on a fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. The objective is to simultaneously: 1. Identify robust, shared disease subtypes across all omics platforms. 2. Identify the key molecular features (signatures) from each omics platform that characterize these subtypes.
Methods: The Bayesian Integrative Clustering (iCluster) Framework
Core Algorithm: Latent Variable Model
The method is an advancement of the iCluster and iClusterPlus frameworks. It assumes that the variation in each omics dataset (\(\mathbf{X}_k\)) for the \(k\)-th data type can be explained by a set of shared, hidden (latent) variables (\(\mathbf{Z}\)), plus noise:
\[\mathbf{X}_k = \mathbf{Z} \mathbf{W}_k + \mathbf{E}_k\]
where \(\mathbf{Z}\) represents the common latent factors (which define the patient clusters) and \(\mathbf{W}_k\) are the omics-specific loading matrices (which define the molecular features).
Bayesian Implementation and Innovation
The key methodological innovations are the implementation of the model using a fully Bayesian approach (via Gibbs sampling) and the incorporation of sparsity priors:
- Integrative Clustering: The latent variables \(\mathbf{Z}\) are clustered using a Dirichlet process prior, which automatically determines the optimal number of clusters (\(K\)) that best explains the integrated data structure.
- Feature Selection (Sparsity): Adaptive shrinkage priors are placed on the loading matrices (\(\mathbf{W}_k\)). This is crucial because it drives many of the loadings to zero, effectively performing automatic variable selection and identifying a sparse set of molecular features (genes, loci, etc.) that are most relevant to the cluster assignments. This simultaneously resolves the high-dimensionality problem and provides the subtype-specific molecular signatures.
- Handling Multi-type Data: The Bayesian framework naturally accommodates the different distributional properties of multi-omics data (e.g., continuous expression, count data, binary mutation status).
Key Results and Application
Application to Cancer Data
The method was applied to three publicly available cancer datasets (Glioblastoma Multiforme, Lung Squamous Cell Carcinoma, and Endometrial Carcinoma) from TCGA, integrating data types such as: * mRNA expression * DNA methylation * Copy number variation (CNV)
Robust Subtype Identification
The Bayesian integrative clustering approach demonstrated superior performance in identifying biologically and clinically relevant subtypes compared to methods applied to single omics layers or less sophisticated integrative methods. The inferred clusters showed strong agreement with established cancer subtyping schemes and often refined them.
Signature Discovery
The sparse loading matrices (\(\mathbf{W}_k\)) successfully identified the specific, highly characteristic molecular signatures for each subtype across the different omics layers, providing a clear biological interpretation of the patient groupings.
Conclusions and Significance
This fully Bayesian latent variable model offers a statistically rigorous and powerful solution for the integrative clustering of multi-omics data . The method’s ability to automatically determine the number of clusters and simultaneously perform sparse feature selection is a major advance. By providing robust disease subtypes and their corresponding minimal molecular signatures, this framework is critical for advancing precision medicine and translational research in complex diseases.