Biomarker identification by interpretable maximum mean discrepancy

biomarker discovery

computational biology

machine learning

multi-omics

statistics

two-sample test

Topic: Introduction of SpInOpt-MMD (Sparse, Interpretable, and Optimized Maximum Mean Discrepancy), a novel method for simultaneously performing two-sample testing and biomarker feature selection in high-dimensional omics data.
Method: SpInOpt-MMD integrates sparse and interpretable optimization into the Maximum Mean Discrepancy (MMD) test, allowing it to detect statistically significant group differences and identify the features (biomarkers) responsible in a single step.
Impact: The method is effective for small sample sizes and outperforms other feature selection approaches (like SHAP) in several contexts, offering a powerful, unified approach for biomarker discovery in multi-omics and biomedical applications.

Published

23 January 2026

PubMed: 38940158 DOI: 10.1093/bioinformatics/btae251 Overview generated by: Gemini 2.5 Flash, 27/11/2025

Background and Objective

In biomedical applications, researchers frequently deal with paired groups of samples (e.g., treated vs. control, or diseased vs. healthy) and aim to identify discriminating features, or biomarkers, based on high-dimensional omics data. This problem is fundamentally a two-sample problem, requiring a statistical test to establish a difference between the groups and a method to interpret which features cause that difference.

While the multivariate Maximum Mean Discrepancy (MMD) test can quantify group-level differences, the identification of specific features (biomarkers) usually requires a separate, often less powerful, univariate feature selection step. The objective of this study was to introduce a novel method that combines two-sample testing and feature selection into a single, unified experiment.

Methods: SpInOpt-MMD

The authors developed SpInOpt-MMD (Sparse, Interpretable, and Optimized MMD test), a novel statistical framework designed to simultaneously test for differences between two high-dimensional datasets and identify the most relevant features contributing to that difference.

Approach: SpInOpt-MMD extends the standard MMD by incorporating sparse and interpretable optimization techniques. This allows the model to quantify the difference between two sample distributions (two-sample test) while simultaneously performing feature selection.
Feature Selection: Unlike methods that rely on subsequent univariate analysis or complex post-hoc interpretation (like SHapley Additive exPlanations, or SHAP), SpInOpt-MMD directly outputs the set of statistically significant and distinguishing features (biomarkers) as part of the core testing process.
Versatility: The method is versatile and was demonstrated on a variety of data types, including gene expression measurements, text data, and images.

Key Results and Conclusion

The evaluation of SpInOpt-MMD highlighted its effectiveness, particularly in challenging scenarios:

Superior Performance: SpInOpt-MMD was shown to be highly effective at identifying relevant features, even in small sample sizes, and demonstrated superior performance compared to traditional feature selection methods such as SHAP and univariate association analysis in several experiments.
Unified Analysis: The method provides a powerful, single-step solution for the two-sample testing and biomarker identification problem in high-dimensional settings, which is crucial for multi-omics research.
Open-Access Resource: The code and links to the public data are made available to promote reproducibility and widespread adoption of the new method.