Imputation of label-free quantitative mass spectrometry-based proteomics data using self-supervised deep learning

collaborative filtering

deep learning

imputation

mass spectrometry

proteomics

self-supervised learning

variational autoencoder

Goal: To develop and validate PIMMS (Proteomics Imputation Modeling Mass Spectrometry), a deep learning-based framework for imputing missing values in label-free quantitative (LFQ) mass spectrometry data.
Method: PIMMS leverages three self-supervised models—Collaborative Filtering (CF), Denoising Autoencoder (DAE), and Variational Autoencoder (VAE)—which significantly outperformed 27 other imputation methods, including median imputation and R-based KNN, on simulated Missing Not At Random (MNAR) data.
Impact: Applying PIMMS-VAE to a clinical ALD cohort identified 30 additional significantly differentially abundant proteins (+13.2%) compared to non-imputed data, demonstrating that DL imputation can enhance the biological conclusions derived from proteomics analysis.

Published

23 January 2026

PubMed: 38926340 DOI: 10.1038/s41467-024-48711-5 Overview generated by: Gemini 2.5 Flash, 28/11/2025

Key Findings: PIMMS for Imputation

The central finding is the development and validation of PIMMS (Proteomics Imputation Modeling Mass Spectrometry), a framework utilizing self-supervised deep learning (DL) models for the imputation of missing values in label-free quantification (LFQ) mass spectrometry (MS) proteomics data.

PIMMS demonstrated superior performance compared to conventional and other machine learning methods (e.g., KNN, Random Forest, median imputation) on simulated missing values, achieving Mean Absolute Error (MAE) values between 0.54 and 0.58 on protein groups for the large HeLa dataset, significantly better than median imputation (MAE 1.24).

Crucially, when applied to a clinical cohort: * PIMMS-VAE (Variational Autoencoder) recovered 15 out of 17 differentially abundant protein groups that were lost when comparing a reduced (80%) dataset to the full non-imputed dataset. * When analyzing the full Alcohol-related Liver Disease (ALD) dataset, PIMMS-VAE identified 30 additional proteins (+13.2%) that were significantly differentially abundant across disease stages compared to no imputation, and these proteins were found to be predictive of ALD progression.

Study Design and Datasets

The authors employed a dual-dataset approach for model development and application:

Model Development and Evaluation

Data Source: A large dataset of HeLa cell line tryptic lysates consisting of 564 runs acquired over roughly two years on a single Q Executive HF-X Orbitrap. A smaller subset of 50 runs was also used to test performance dependence on sample size.
Evaluation Strategy: Missing values were simulated as Missing Not At Random (MNAR) (25%, 50%, or 75%) using the Lazar et al. procedure to ensure lower intensities were sufficiently represented.
Comparison: The three DL models (CF, DAE, VAE) were compared against 27 other imputation approaches, including common methods like k-nearest neighbors (KNN) and random forest (RF).

Clinical Application (Case Study)

Data Source: Blood plasma proteomics data from an Alcohol-related Liver Disease (ALD) cohort involving 358 individuals.
Goal: To assess the impact of DL imputation on downstream biological conclusions, specifically differential abundance analysis and the prediction of disease progression.

Methods: Self-Supervised Deep Learning Models

The PIMMS framework incorporates three self-supervised deep learning architectures to model and impute missing intensities at the precursor, aggregated peptide, or protein group level. The objective for the autoencoder-based models and collaborative filtering is reconstruction, while the VAE is a generative model.

Collaborative Filtering (CF)

Mechanism: Assigns a trainable embedding vector to each sample and each feature (e.g., protein group). The intensity prediction is generated from the combination of these embeddings.
Loss Function: Mean-Squared Error (MSE) reconstruction loss.

Denoising Autoencoder (DAE)

Mechanism: An autoencoder architecture where the model learns to reconstruct the original data from a corrupted (masked) input, forcing it to learn a useful deterministic latent representation of the sample.
Loss Function: Mean-Squared Error (MSE) reconstruction loss.

Variational Autoencoder (VAE)

Mechanism: A generative model that encodes a stochastic latent representation, typically a high-dimensional Gaussian distribution. It adds a regularization constraint (Kullback–Leibler divergence loss) on the latent space in addition to reconstruction.
Loss Function: A probabilistic loss that accounts for both reconstruction and the latent space constraint.

Detailed Results and Performance

Performance by Feature Level and Abundance

DL models were effective at all levels of MS data (precursors, peptides, and protein groups). Lower-level data (precursors) were found to be easier to learn due to less aggregation.
The imputation accuracy depended on the protein group’s prevalence: MAE was significantly lower (better) for protein groups observed in more than 80% of samples (MAE below 0.4) than for those observed in 25–80% of samples (MAE 0.6–0.8).

Scalability

The deep learning methods, unlike some R-based packages evaluated, scaled well to high-dimensional data (large sample size and many features), which is critical for analyzing precursors or peptides rather than just aggregated protein groups.

Conclusions and Recommendations

The study concludes that deep learning models, particularly the VAE component of PIMMS, provide a powerful, holistic, and scalable approach to handling missing values in LFQ MS data.

The traditional reliance on heuristic approaches (like drawing from a down-shifted normal distribution) can be misleading, as not all missing values are due to the limit of detection. Deep learning models avoid this rigid assumption by learning the complex patterns from the data distribution itself.

The authors recommend the use of deep learning for imputation in MS-based proteomics, especially on larger datasets, as it improves the sensitivity of differential abundance analysis and can lead to more robust biological conclusions by identifying more significant findings. The PIMMS workflows and code are publicly shared to facilitate reproducibility and adoption.