Statistical Methods for Integrative Clustering of Multi-omics Data
PubMed: 36929074 DOI: 10.1007/978-1-0716-2986-4_5 Overview generated by: Gemini 2.5 Flash, 27/11/2025
Background and Objective
The heterogeneity of cancers, driven by complex alterations across multiple molecular levels (genomics, epigenomics, transcriptomics), necessitates advanced statistical methods. Identifying robust molecular subtypes of cancer is a crucial step for personalized medicine. Traditional clustering methods applied to single omics layers often yield unstable results.
This paper provides an overview and practical guide to integrative clustering—an unsupervised learning approach that uses multi-omics data to simultaneously identify shared disease subtypes and their associated molecular signatures. The chapter primarily focuses on model-based statistical approaches.
Methods: Integrative Clustering Taxonomy
The chapter classifies and describes the prominent statistical methods used for integrative clustering of multi-omics data:
1. Model-Based Approaches
These methods assume that the variation across different omics datasets can be explained by a set of shared, hidden (latent) variables or factors.
- iCluster (Integrative Clustering): A latent variable model that uses a joint likelihood function to integrate multiple omics data types. It incorporates penalized regression to enforce sparsity and perform feature selection, identifying the key molecular features that define the clusters.
- iClusterPlus (Bayesian Extensions): Enhancements that use Bayesian methods and Dirichlet process priors to improve robustness, handle diverse data distributions (e.g., count, binary), and automatically estimate the optimal number of clusters (\(K\)).
2. Nonparametric Approaches
These methods do not rely on specific distributional assumptions.
- Similarity Network Fusion (SNF): This method constructs a similarity network for each individual omics dataset, then iteratively fuses them into a single, comprehensive patient similarity network, which is then clustered to define the final disease subtypes.
Conclusions and Significance
Integrative clustering is essential for analyzing heterogeneous multi-omics data. By combining information across multiple molecular layers, these methods produce stable and biologically meaningful disease subtypes and their corresponding molecular signatures. The guide provides the necessary foundation and practical steps for researchers to apply these techniques in cancer biology and precision medicine, including data preprocessing, model selection, and biological interpretation (e.g., using Kaplan-Meier curves to assess survival differences between subtypes).