DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays
- Method: DIABLO (Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics datasets) is a supervised multi-block PLS/GCCA method for joint analysis of heterogeneous omics data.
- Feature Selection: It uses a sparse penalty (L1) to select a minimal set of key molecular drivers that are highly correlated across omics layers and maximally associated with a specific clinical outcome (e.g., disease status).
- Impact: DIABLO demonstrated superior classification accuracy and biological coherence in identifying integrated biomarkers for complex diseases, such as the molecular drivers distinguishing breast cancer subtypes.
PubMed: 30657866 DOI: 10.1093/bioinformatics/bty1054 Overview generated by: Gemini 2.5 Flash, 27/11/2025
Background and Objective
The challenge in analyzing modern multi-omics data is the high dimensionality of individual datasets and the difficulty in integrating these diverse data types (e.g., transcriptomics, proteomics, metabolomics) while accounting for the shared biological signal across them. Current methods often integrate data post-analysis, neglecting the opportunity to jointly identify features that drive biological differences.
This paper introduces DIABLO (Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics datasets), a novel computational method designed to: 1. Perform supervised integration of multiple heterogeneous omics datasets. 2. Simultaneously identify a minimal set of key molecular features (biomarkers) that are highly correlated across omics types and maximally associated with a specific outcome variable (e.g., disease status, clinical subtype).
Methods: The DIABLO Framework
Core Algorithm
DIABLO is based on Partial Least Squares (PLS), a method for simultaneous dimension reduction and feature selection. Specifically, it uses a generalized form of PLS called multi-block PLS (MB-PLS) or Generalised Canonical Correlation Analysis (GCCA).
Supervised Integration
Unlike unsupervised integration methods, DIABLO is supervised—it incorporates an outcome variable (e.g., healthy vs. disease) into the model. The method identifies latent components (similar to principal components) that maximize the covariance between: 1. The multi-omics datasets. 2. The multi-omics datasets and the outcome variable.
Feature Selection and Biomarker Discovery
DIABLO employs a sparse penalty (using an \(L_1\) penalty, similar to LASSO regression) within its iterative algorithm. This forces the latent components to be linear combinations of only a small number of features. This feature selection step is critical for: * Dimension Reduction: Reducing the number of irrelevant features. * Biomarker Identification: Pinpointing the most important “key molecular drivers” that explain the variation in the biological outcome across the different omics layers.
Key Results and Application
Performance
The authors demonstrated that DIABLO outperformed several state-of-the-art multi-omics integration and classification methods (e.g., iClusterPlus, multi-kernel learning) in terms of: * Classification Accuracy: Achieving higher predictive performance for distinguishing between sample groups. * Biological Relevance: Identifying smaller, more biologically coherent subsets of features that were consistently selected across different omics blocks.
Case Study: Breast Cancer
DIABLO was applied to a multi-omics breast cancer dataset (transcriptomics, metabolomics, miRNA) to distinguish between clinical subtypes. * Integrated Signatures: DIABLO identified a signature of features that were highly correlated across the omics layers, including specific genes (mRNA), microRNAs, and metabolites. * Key Drivers: The method pinpointed known and novel molecular drivers (e.g., genes and pathways related to cell cycle and proliferation) whose concerted changes across the different omics layers were responsible for the differences between cancer subtypes.
Conclusions and Significance
DIABLO is a powerful, flexible, and scalable tool for supervised integration of multi-omics data. Its ability to simultaneously perform dimension reduction, feature selection, and association with a clinical outcome makes it particularly well-suited for biomarker discovery.
By identifying small, highly relevant, and correlated sets of features across different molecular layers, DIABLO provides valuable insights into the key molecular drivers underlying complex biological states or disease phenotypes, aiding in the transition toward personalized medicine.