Multi-Omics Factor Analysis a framework for unsupervised integration of multi-omics data sets

Multi-omics
Unsupervised Learning
Factor Analysis
Latent Variables
Dimension Reduction
Cancer
Single-Cell Genomics
Bioinformatics
Published

23 January 2026

PubMed: 29925568 DOI: 10.15252/msb.20178124 Overview generated by: Gemini 2.5 Flash, 27/11/2025

Background and Objective

The complexity and heterogeneity of multi-omics data (e.g., combining genomics, transcriptomics, and epigenomics) require sophisticated statistical methods for integration. Most existing methods are either limited to two data types or are “supervised,” meaning they require a known outcome variable (like disease status) for analysis. There was a critical need for unsupervised integration methods that can systematically discover the principal sources of variation in multi-omics datasets without prior knowledge of the relevant biological axes.

This paper introduces MOFA (Multi-Omics Factor Analysis), a computational framework designed to: 1. Unsupervisedly integrate multiple heterogeneous multi-omics data sets. 2. Infer a set of latent factors (hidden variables) that capture the major axes of biological and technical variability. 3. Disentangle shared heterogeneity (variation common to multiple omics layers) from modality-specific heterogeneity (variation unique to a single omics layer).

Methods: The MOFA Framework

Core Algorithm

MOFA is based on a general form of Factor Analysis. It models each observed omics data matrix (e.g., RNA expression, DNA methylation) as a linear combination of a small number of shared, hidden latent factors.

  • Shared Factors: These factors simultaneously explain the variability across all data modalities. The model automatically learns which factors are relevant for which omics layer, thus mapping the factors to the data modalities they affect.
  • Sparsity: The model employs a sparse prior on the factor loadings. This ensures that each factor is defined by only a small, interpretable subset of molecular features (genes, CpGs, etc.), making the resulting factors easier to interpret biologically.
  • Probabilistic Framework: Being a fully probabilistic model, MOFA can naturally handle missing values (data imputation), batch effects, and different data distributions (e.g., binary for mutations, continuous for expression).

Application: Chronic Lymphocytic Leukemia (CLL)

MOFA was applied to a large cohort of Chronic Lymphocytic Leukemia (CLL) patient samples, profiled for four molecular modalities: somatic mutations, RNA expression, DNA methylation, and ex vivo drug responses.

Key Results and Unsupervised Disentanglement

Dissecting CLL Heterogeneity

MOFA successfully identified 12 latent factors that captured the major dimensions of disease heterogeneity in CLL. Crucially, the factors could be grouped:

  • Shared Factors: Factors shared across RNA expression, DNA methylation, and drug responses were strongly associated with known clinical drivers, such as the immunoglobulin heavy chain variable (IGHV) mutation status, a major prognostic marker in CLL.
  • Modality-Specific Factors: Other factors were specific to a single omics layer, such as a factor that only loaded on somatic mutations and distinguished known mutations like SF3B1.

Downstream Applications

The inferred latent factors enabled several downstream analyses:

  • Sample Clustering: The factors were used to robustly identify subgroups of patients that were characterized by distinct multi-omics profiles.
  • Feature Set Identification: The sparse loadings immediately identified the minimal set of key molecular features (genes, CpGs, etc.) that defined each factor and its associated biological axis.
  • Prediction and Imputation: The factors could be used for data imputation of missing values and for improved outcome prediction compared to models built on raw data.

Conclusions and Significance

MOFA is a highly effective, flexible, and scalable statistical framework for the unsupervised integration of multi-omics data sets .

Its ability to disentangle shared and specific axes of variation is paramount for discovering and interpreting the underlying biological phenomena in complex datasets. By distilling high-dimensional omics data into a small, interpretable set of latent factors, MOFA provides a powerful foundation for precision medicine, subtyping diseases, and gaining mechanistic insights.