Deep profiling of gene expression across 18 human cancers
- Framework: This paper introduces DeepProfile, an unsupervised deep-learning model using an ensemble of Variational Autoencoders (VAEs) to create robust and biologically interpretable low-dimensional latent spaces from 50,211 transcriptomes across 18 human cancers.
- Key Biological Findings: Universal latent variables across all 18 cancers are driven by genes controlling immune cell activation, while cancer-specific variables define molecular subtypes.
- Clinical Associations: DeepProfile linked gene expression to clinical outcomes, finding that Tumour Mutation Burden (TMB) is associated with cell-cycle pathways, and patient survival correlates with DNA-mismatch repair and MHC class II antigen presentation pathway activity.
PubMed: 39690287 DOI: 10.1038/s41551-024-01290-8 Overview generated by: Gemini 2.5 Flash, 10/12/2025
Study Goal and DeepProfile Framework
This paper introduces DeepProfile, an unsupervised deep-learning framework designed to generate low-dimensional, robust, and biologically interpretable latent spaces from gene-expression data across multiple human cancers. The framework addresses key challenges in applying deep learning to cancer transcriptomics, namely the risk of overfitting, the non-deterministic nature of model training, and the “black box” problem of biological interpretability.
Methods: Deep Learner and Interpreter Components
The DeepProfile framework is built upon three main components: Data Collector, Deep Learner, and Interpreter.
- Data Collector and Cohort: The analysis utilized a large cohort of 50,211 transcriptomes from 1,098 datasets across 18 human cancer types, primarily sourced from the Gene Expression Omnibus (GEO).
- Deep Learner (Ensemble VAE): The core model is an ensemble of Variational Autoencoders (VAEs). The VAE encodes high-dimensional gene expression signals into a low-dimensional latent space, typically composed of 150 latent variables. The ensemble approach integrates signals from hundreds of different VAE runs and latent dimension sizes to enhance model robustness and stability.
- Interpreter (Integrated Gradients): To ensure biological interpretability, DeepProfile incorporates the Integrated Gradients feature attribution method. This tool quantifies how much each input gene contributes to a specific latent variable, allowing for subsequent pathway enrichment tests to define pathway-level attributions.
Key Findings and Performance
DeepProfile demonstrated superior performance in capturing biological signals compared to alternative dimensionality-reduction methods, including PCA, ICA, and standard VAEs.
- Interpretability Advantage: DeepProfile’s latent variables captured a significantly higher average number of known KEGG, BioCarta, and Reactome pathways, as well as Oncogenic Signatures gene sets, across the 18 cancers compared to other methods (e.g., DeepProfile captured pathways in 106 out of 108 test cases).
Pan-Cancer Commonality and Specificity
Analysis of the learned latent spaces revealed distinct patterns of gene expression variation:
- Universal Patterns: Genes that are universally important in defining the latent spaces across all 18 cancer types primarily control immune cell activation and aspects of the inflammatory response.
- Cancer-Specific Patterns: Cancer-type-specific genes and pathways contribute to the latent spaces for only one particular cancer type, serving to define molecular disease subtypes and reflect tissue-specific biology.
Clinical and Mutational Associations
A key methodology was developed to link the DeepProfile latent variable embeddings to patient- and tumour-level clinical characteristics, specifically patient survival and tumour mutation burden (TMB).
- Tumour Mutation Burden (TMB): TMB was found to be closely associated with the expression of cell-cycle-related pathways across a large majority of cancers.
- Patient Survival: Survival consistently correlated with the activity of the DNA-mismatch repair pathway and the MHC class II antigen presentation pathway.
- Cellular Origin of MHC II: The study further identified that tumour-associated macrophages (TAMs) are a source of the survival-correlated MHC class II transcripts.
Conclusions
DeepProfile successfully leveraged unsupervised deep learning and an ensemble approach to create robust and highly interpretable gene-expression embeddings for pan-cancer analysis. This methodology facilitated the discovery of shared (immune-related) and specific (subtype-defining) biological patterns across diverse cancers. Furthermore, the ability to link latent variables to clinical phenotypes allowed for the identification of pathways (e.g., DNA-mismatch repair, MHC class II presentation) associated with patient survival and TMB, demonstrating the utility of deep unsupervised learning in generating novel biological insights from large cancer datasets.