Statistical Methods for Integrative Clustering of Multi-omics Data

Multi-omics

Review

Clustering

Cancer Subtyping

Latent Variables

Statistical Methods

Bioinformatics

Published

23 January 2026

PubMed: 36929074 DOI: 10.1007/978-1-0716-2986-4_5 Overview generated by: Gemini 2.5 Flash, 27/11/2025

Background and Objective

The heterogeneity of cancers, driven by complex alterations across multiple molecular levels (genomics, epigenomics, transcriptomics), necessitates advanced statistical methods. Identifying robust molecular subtypes of cancer is a crucial step for personalized medicine. Traditional clustering methods applied to single omics layers often yield unstable results.

This paper provides an overview and practical guide to integrative clustering—an unsupervised learning approach that uses multi-omics data to simultaneously identify shared disease subtypes and their associated molecular signatures. The chapter primarily focuses on model-based statistical approaches.

Methods: Integrative Clustering Taxonomy

The chapter classifies and describes the prominent statistical methods used for integrative clustering of multi-omics data:

1. Model-Based Approaches

These methods assume that the variation across different omics datasets can be explained by a set of shared, hidden (latent) variables or factors.

iCluster (Integrative Clustering): A latent variable model that uses a joint likelihood function to integrate multiple omics data types. It incorporates penalized regression to enforce sparsity and perform feature selection, identifying the key molecular features that define the clusters.
iClusterPlus (Bayesian Extensions): Enhancements that use Bayesian methods and Dirichlet process priors to improve robustness, handle diverse data distributions (e.g., count, binary), and automatically estimate the optimal number of clusters (\(K\)).

2. Nonparametric Approaches

These methods do not rely on specific distributional assumptions.

Similarity Network Fusion (SNF): This method constructs a similarity network for each individual omics dataset, then iteratively fuses them into a single, comprehensive patient similarity network, which is then clustered to define the final disease subtypes.

Conclusions and Significance

Integrative clustering is essential for analyzing heterogeneous multi-omics data. By combining information across multiple molecular layers, these methods produce stable and biologically meaningful disease subtypes and their corresponding molecular signatures. The guide provides the necessary foundation and practical steps for researchers to apply these techniques in cancer biology and precision medicine, including data preprocessing, model selection, and biological interpretation (e.g., using Kaplan-Meier curves to assess survival differences between subtypes).