Proteomic signatures improve risk prediction for common and rare diseases

ancestral differences
biomarkers
clinical informatics
common disease
confounding
machine learning
proteomics
rare disease
risk prediction
  • Objective: This large-scale study demonstrated the ability of plasma proteomic signatures to enhance the 10-year incidence risk prediction for 218 common and rare diseases in the UK Biobank (UKB-PPP) cohort.
  • Result: Sparse proteomic models (using 5–20 proteins) significantly improved the C-index over models based on basic clinical information for 67 diseases, including hard-to-diagnose conditions like multiple myeloma and motor neuron disease.
  • Robustness/Confounding: The analysis highlighted that residual confounding is a major issue, as over 80% of initial associations attenuated after adjusting for demographic and clinical factors, underscoring the necessity of understanding protein determinants to ensure findings are biologically relevant and not spurious.
Published

23 January 2026

PubMed: 39039249 DOI: 10.1038/s41591-024-03142-z Overview generated by: Gemini 2.5 Flash, 28/11/2025

Key Findings: Proteomics Enhances Disease Risk Prediction

This large-scale study, utilizing the United Kingdom Biobank Pharma Proteomics Project (UKB-PPP), demonstrates that incorporating plasma proteomic signatures significantly improves the prediction of the 10-year incidence risk for a wide range of common and rare diseases compared to models based solely on clinical information or polygenic risk scores (PRS).

Superior Predictive Performance

The core finding is that sparse prediction models using as few as 5 to 20 plasma proteins were superior to models built on basic clinical information alone for 67 pathologically diverse diseases. The median improvement in predictive performance (delta C-index) was 0.07, with gains as high as 0.31 for certain conditions.

Outperformance over Clinical Assays

Crucially, the protein models also outperformed models that combined basic clinical information with data from 37 commonly used clinical assays for 52 diseases. This indicates that the plasma proteome captures unique biological signals not fully reflected in standard clinical blood tests. Diseases where protein models showed significant improvement include: * Hematological Cancers: Multiple myeloma and non-Hodgkin lymphoma. * Neurodegenerative/Neurological Disorders: Motor neuron disease. * Cardiopulmonary Diseases: Pulmonary fibrosis and dilated cardiomyopathy.

Study Design and Methods

The study integrated deep proteomic data with extensive clinical follow-up data from the UK Biobank cohort.

Data and Cohort

  • Cohort: 41,931 individuals from the UKB-PPP.
  • Proteome: Measurements for approximately 3,000 plasma proteins using Olink Proximity Extension Assays.
  • Target Outcomes: 10-year incidence for 218 common and rare diseases.

Statistical Modeling

  • Model Type: Sparse prediction models (using elastic net regression) were developed to predict disease incidence.
  • Baseline Model: All models were compared against a clinical model incorporating basic demographic information (age, sex, BMI, smoking status) and a comorbidity score.
  • Three Tiers of Models:
    1. Clinical model only.
    2. Clinical model + 5 to 20 selected proteins (sparse proteomic signature).
    3. Clinical model + data from 37 clinical assays.
  • Comparison: The performance of these models was quantified using the C-index (Area Under the Curve, AUC) for time-to-event data.

Results: Disease-Specific Insights

Multiple Myeloma

For multiple myeloma, a plasma cell cancer, a proteomic signature improved the C-index by 0.28 over the clinical model. The top predictor was Immunoglobulin free light chain kappa (IGFKL), a known marker of plasma cell activity, demonstrating the model’s ability to highlight clinically relevant biomarkers.

Motor Neuron Disease (MND)

For MND, a fatal neurological disorder, the model’s C-index improved by 0.17. The primary predictors were TDP-43 (Tar-DNA binding protein 43) and proteins associated with neuroinflammation, providing evidence for distinct biological pathways at play years before diagnosis.

Robustness, Confounding, and Determinants

The authors stress the importance of understanding the determinants of protein variation to avoid spurious associations and Type 1 errors.

Impact of Residual Confounding

Prioritization analyses revealed that after regressing out variance explained by demographic and other clinical factors, over 80% of non-null associations attenuated, strongly suggesting the presence of residual confounding in protein-outcome relationships. However, robust associations highlighted well-established clinical markers, such as Prostate-Specific Antigen (PSA), demonstrating the method’s ability to retain true biological signals when accounting for confounders.

Ancestral and Sex Differences

  • Ancestry: The study noted significant ancestral differences in the variance of protein levels explained by genetic factors (pQTLs), primarily driven by cis- and trans-pQTL effects (e.g., differences in minor allele frequency, MAF), which suggests differential effects for the same variant across ancestral groups. These differences were not solely attributable to differences in sample size (N).
  • Sex: Although few sex-differential genetic effects were found (consistent with findings from large-scale GWASs), approximately one-third of the tested proteins exhibited differences in levels due to varying participant characteristics between males and females (e.g., medication use).

Comparison to Polygenic Risk Scores (PRS) and MR

The study confirmed that the proteomic signature provided a greater improvement in prediction over the clinical model than the corresponding PRS for most diseases. However, when attempting to compare protein prediction using genetically imputed plasma protein levels (e.g., in Mendelian Randomization analyses) versus directly measured protein levels with outcomes, there was very little consistency, even after adjusting for protein factors. This highlights a limitation in using imputed protein levels for causal inference in this context.

Conclusions and Future Directions

The study strongly supports the use of high-throughput plasma proteomics as an objective, non-invasive method to improve disease risk prediction across a diverse array of conditions. The sparse signatures identified represent potential targets for early intervention and monitoring. The findings underscore the critical importance of characterizing and accounting for protein determinants to distinguish biologically relevant findings from false positives and to ensure accurate risk stratification in clinical applications. The authors suggest that integrating these proteomic biomarkers into clinical practice could facilitate earlier diagnosis and personalized screening programs.