A fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data

bayesian statistics
bioinformatics
cancer subtyping
clustering
latent variables
multi-omics
  • Method: The paper proposes a fully Bayesian latent variable model for integrative clustering analysis of multi-omics data, building on the iCluster framework.
  • Key Innovation: The Bayesian approach incorporates adaptive shrinkage priors to enforce sparsity (feature selection) on the omics-specific loading matrices, which simultaneously identifies robust disease subtypes and their minimal molecular signatures.
  • Impact: Applied to TCGA cancer data, the model demonstrated superior performance in identifying clinically relevant, stable subtypes and the specific genes/loci driving the differences across mRNA, methylation, and CNV data.
Published

23 January 2026

PubMed: 28541380 DOI: 10.1093/biostatistics/kxx017 Overview generated by: Gemini 2.5 Flash, 27/11/2025

Background and Objective

The identification of clinically relevant disease subtypes and their corresponding molecular signatures is a central goal in precision medicine, particularly in cancer research. Large-scale projects like The Cancer Genome Atlas (TCGA) generate vast amounts of heterogeneous multi-omics data (e.g., gene expression, DNA methylation, copy number variation). However, standard clustering methods applied to individual omics layers often yield conflicting or unstable results, failing to capture the comprehensive biological signal.

This paper introduces a novel statistical framework based on a fully Bayesian latent variable model for integrative clustering analysis of multi-type omics data. The objective is to simultaneously: 1. Identify robust, shared disease subtypes across all omics platforms. 2. Identify the key molecular features (signatures) from each omics platform that characterize these subtypes.

Methods: The Bayesian Integrative Clustering (iCluster) Framework

Core Algorithm: Latent Variable Model

The method is an advancement of the iCluster and iClusterPlus frameworks. It assumes that the variation in each omics dataset (\(\mathbf{X}_k\)) for the \(k\)-th data type can be explained by a set of shared, hidden (latent) variables (\(\mathbf{Z}\)), plus noise:

\[\mathbf{X}_k = \mathbf{Z} \mathbf{W}_k + \mathbf{E}_k\]

where \(\mathbf{Z}\) represents the common latent factors (which define the patient clusters) and \(\mathbf{W}_k\) are the omics-specific loading matrices (which define the molecular features).

Bayesian Implementation and Innovation

The key methodological innovations are the implementation of the model using a fully Bayesian approach (via Gibbs sampling) and the incorporation of sparsity priors:

  • Integrative Clustering: The latent variables \(\mathbf{Z}\) are clustered using a Dirichlet process prior, which automatically determines the optimal number of clusters (\(K\)) that best explains the integrated data structure.
  • Feature Selection (Sparsity): Adaptive shrinkage priors are placed on the loading matrices (\(\mathbf{W}_k\)). This is crucial because it drives many of the loadings to zero, effectively performing automatic variable selection and identifying a sparse set of molecular features (genes, loci, etc.) that are most relevant to the cluster assignments. This simultaneously resolves the high-dimensionality problem and provides the subtype-specific molecular signatures.
  • Handling Multi-type Data: The Bayesian framework naturally accommodates the different distributional properties of multi-omics data (e.g., continuous expression, count data, binary mutation status).

Key Results and Application

Application to Cancer Data

The method was applied to three publicly available cancer datasets (Glioblastoma Multiforme, Lung Squamous Cell Carcinoma, and Endometrial Carcinoma) from TCGA, integrating data types such as: * mRNA expression * DNA methylation * Copy number variation (CNV)

Robust Subtype Identification

The Bayesian integrative clustering approach demonstrated superior performance in identifying biologically and clinically relevant subtypes compared to methods applied to single omics layers or less sophisticated integrative methods. The inferred clusters showed strong agreement with established cancer subtyping schemes and often refined them.

Signature Discovery

The sparse loading matrices (\(\mathbf{W}_k\)) successfully identified the specific, highly characteristic molecular signatures for each subtype across the different omics layers, providing a clear biological interpretation of the patient groupings.

Conclusions and Significance

This fully Bayesian latent variable model offers a statistically rigorous and powerful solution for the integrative clustering of multi-omics data . The method’s ability to automatically determine the number of clusters and simultaneously perform sparse feature selection is a major advance. By providing robust disease subtypes and their corresponding minimal molecular signatures, this framework is critical for advancing precision medicine and translational research in complex diseases.