Machine learning-guided deconvolution of plasma protein levels

biomarker discovery
confounding
genetic variation
machine learning
proteomics
  • Objective: Used machine learning (ML) to systematically identify and quantify the contribution of over 1,800 characteristics (health, genetic, technical) to the variation in approximately 3,000 plasma protein levels across 43,240 UK Biobank individuals.
  • Key Result: A median of 20 factors explained an average of 19.4% of protein variance. Modifiable characteristics (median: 10.0%) were found to explain significantly more variation than genetic factors (median: 3.9%).
  • Implication: The study provides a crucial resource (knowledge graph, R package) and framework for understanding protein origins, clustering proteins by their drivers (e.g., disease, pre-analytical factors), and guiding the identification of biologically relevant biomarkers and drug target engagement markers.
Published

23 January 2026

PubMed: 41068475 DOI: 10.1038/s44320-025-00158-6 Overview generated by: Gemini 2.5 Flash, 28/11/2025

Key Findings: Deconvoluting the Sources of Plasma Protein Variation

This study used a machine learning (ML) approach to systematically identify and quantify the key factors that determine the variation in thousands of plasma protein levels, aiming to overcome the challenge of limited understanding of protein origins that hampers biomarker translation.

Primary Determinants of Protein Levels

The ML model, which assessed over 1,800 participant and sample characteristics, found that a median of 20 factors (ranging from 1 to 37) jointly explained an average of 19.4% (up to 100.0%) of the variance in approximately 3,000 protein targets.

Crucially, modifiable characteristics (e.g., health metrics, disease status, lifestyle) explained significantly more variance (median: 10.0%) compared to genetic variation (median: 3.9%). This suggests that dynamic, non-genetic factors are the primary drivers of plasma protein differences between individuals.

Segregation and Clustering

Proteins were found to segregate into distinct clusters based on their shared explanatory factors. These clusters revealed proteins primarily driven by: * Human Health and Disease: Indicators of health status and disease. * Pre-analytical Variation: Technical and sample-handling measures, such as accidental activation of platelets.

Ancestry, Sex, and Robustness

The overall explanatory factors were largely consistent across different sexes and ancestral groups. However, the analysis identified specific proteins where the underlying explanatory factors differed by: * Sex: 1,374 proteins. * Ancestry: 74 proteins.

Resource and Application

The study establishes a valuable resource to guide biomarker and drug target discovery, including: 1. Knowledge Graph: An integrated knowledge graph linking the identified explanatory factors with genetic studies and drug characteristics, intended to guide the identification of drug target engagement markers. 2. Biomarker Identification: Demonstrated utility by identifying disease-specific biomarkers, such as matrix metalloproteinase 12 (MMP12) for abdominal aortic aneurysm. 3. Framework: Developed a widely applicable R package and an interactive web portal for researchers to explore all results and integrate the findings into ongoing studies.

Methods

  • Cohort: 43,240 participants from the UK Biobank.
  • Data: Approximately 3,000 plasma proteins were measured, alongside >1,800 participant and sample characteristics.
  • Analysis: Machine learning was used to identify and quantify the variance explained by different factors, with models being consistent across sexes and ancestral groups.