Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies

data analysis
imputation
mass spectrometry
metabolomics
missing data
quality control
  • Objective: The study systematically characterized the sources of missing values (MVs) in untargeted Mass Spectrometry (MS)-based metabolomics data and evaluated various imputation strategies.
  • Missing Value Types: Distinguished between systematic missingness primarily due to Limits of Detection (LOD) and random missingness due to technical issues.
  • Best Strategy: For the prevalent LOD-related MVs, which represent concentrations near zero, simple methods like imputation with half of the minimum observed value were found to be effective and often outperformed complex data-driven methods (e.g., PPCA).
  • Recommendation: Researchers should use targeted imputation strategies based on the nature of the missingness to avoid introducing bias and reducing statistical power.
Published

23 January 2026

PubMed: 30830398 DOI: 10.1007/s11306-018-1420-2 Overview generated by: Gemini 2.5 Flash, 28/11/2025

Key Focus: The Problem of Missing Values in Untargeted Metabolomics

This study provides a systematic characterization of missing values (MVs) in untargeted mass spectrometry (MS)-based metabolomics data and evaluates various strategies for handling these MVs. Missing data is a common issue that can severely reduce statistical power and introduce bias in downstream biomedical studies.

Characterization of Missing Values

The study distinguished two primary types of MVs based on their origin:

  1. Missing Due to Limits of Detection (LOD): This is the most prevalent form and is often systematic. It occurs when a compound’s concentration is below the instrument’s detection threshold. This systematic pattern can be further influenced by run day-dependent effects (e.g., changes in instrument performance over time).
  2. Missing at Random (MAR) or Completely at Random (MCAR): These MVs occur less frequently and are typically a consequence of random technical errors during sample preparation or measurement (e.g., ionization suppression, inconsistent retention time).

Evaluation of Imputation Strategies

The study evaluated several common imputation strategies, including:

  • Fixed-Value Imputation: Replacing MVs with a fixed constant, such as zero, the mean, or a value derived from the limit of detection (LOD) (e.g., half the minimum detected value).
  • Data-Driven Imputation: Using statistical or machine learning methods based on the observed data, such as Probabilistic Principal Component Analysis (PPCA) or \(k\)-Nearest Neighbors (kNN).

Best Performing Strategy

The key finding regarding MV handling was that the best strategy depends on the nature of the missingness:

  • For LOD-related MVs (Systematic Missingness): Simple methods like half-of-the-minimum-observed-value imputation performed surprisingly well and often outperformed more complex data-driven methods. This is because LOD-MVs are not truly random but represent a biological value that is near zero.
  • For MAR/MCAR MVs (Random Missingness): Data-driven methods like PPCA and kNN generally performed better than fixed-value methods. However, given that LOD-MVs dominate untargeted metabolomics, the overall benefit of complex methods was limited.

Conclusions and Recommendations

The study emphasizes that the high proportion of systematic, LOD-related missingness is a characteristic feature of untargeted MS-based metabolomics.

Researchers should: 1. Characterize Missingness: Visually inspect data to distinguish systematic LOD-MVs from random MVs. 2. Apply Targeted Imputation: For the high number of LOD-MVs, use a simple, robust method like imputation with a small value (e.g., half the minimum). 3. Future Development: The authors call for the development of new, tailored imputation methods that can explicitly and simultaneously model both the systematic (LOD) and random (MAR/MCAR) components of missingness in metabolomics data.