Genetic prediction of complex traits with polygenic scores: A statistical review

GWAS
genetic prediction
polygenic risk scores
polygenic scores
review
statistical genetics
  • This statistical review comprehensively analyzes 46 methods for Polygenic Score (PGS) construction, unifying most of them under a multiple linear regression framework to clarify their assumptions regarding effect size distribution and Linkage Disequilibrium (LD).
  • The review concludes that optimal PGS performance (accuracy) is highly dependent on the trait’s genetic architecture and is significantly improved by using Bayesian/Regularization methods (e.g., LDpred, PRS-CS) that explicitly model LD and incorporate informed prior distributions for SNP effects.
  • A critical challenge highlighted is the significant loss of transferability across ancestral populations, underscoring the need for more diverse training data and methods that better address ancestral heterogeneity and incorporate non-additive and Gene-by-Environment (GxE) effects.
Published

23 January 2026

PubMed: 34243982
DOI: 10.1016/j.tig.2021.06.004
Overview generated by: Gemini 2.5 Flash, 26/11/2025

Key Findings

This comprehensive statistical review provides an exhaustive analysis of the landscape of Polygenic Score (PGS) methods, a critical area in human genetics focused on predicting complex traits and disease risk using genetic data. The authors review 46 different methods for PGS construction, establishing a unifying multiple linear regression framework to connect and categorize the majority of these techniques. The core conclusion is that the optimal PGS method is highly dependent on the genetic architecture of the target trait (e.g., polygenicity, effect size distribution) and the nature of the available training and target data. The review serves as an essential reference, providing both the statistical underpinnings for method developers and practical guidance (including a decision tree) for analysts performing PGS analysis in clinical and research settings.

Categorization and Statistical Framework

The review structures the diverse landscape of PGS methods into a coherent statistical framework, primarily centered on how they estimate the SNP effect sizes, which serve as the weights for the Polygenic Score.

Core PGS Calculation

In its simplest form, the PGS for an individual (\(i\)) is a weighted sum of their genotypes across \(M\) single nucleotide polymorphisms (SNPs): \[\text{PGS}_i = \sum_{j=1}^{M} G_{ij} \hat{\beta}_j\] where \(G_{ij}\) is the genotype of individual \(i\) at SNP \(j\), and \(\hat{\beta}_j\) is the estimated genetic effect size (the weight) for that SNP.

Unifying Multiple Linear Regression Framework

The authors classify most modern PGS methods as variants of a general multiple linear regression model: \[\mathbf{Y} = \mathbf{G}\mathbf{\beta} + \mathbf{\epsilon}\] where \(\mathbf{Y}\) is the phenotype vector, \(\mathbf{G}\) is the genotype matrix, \(\mathbf{\beta}\) is the vector of true genetic effects, and \(\mathbf{\epsilon}\) is the error term. Different PGS methods (e.g., LDpred, S-Bayes) are distinguished by their assumptions about the effect size vector, \(\mathbf{\beta}\), and how they account for the correlation structure of the genotypes (Linkage Disequilibrium or LD).

Classification of PGS Methods

The review categorizes the 46 analyzed methods based on their underlying statistical approach:

  1. Classical Methods: These include P-value Thresholding and Clumping (P+T), which selects a subset of independent SNPs based on their p-values. This remains a simple and surprisingly effective baseline method.
  2. Bayesian and Regularization Methods: These methods explicitly model the genetic architecture by incorporating prior distributions on the SNP effect sizes (\(\mathbf{\beta}\)) and/or regularizing the estimates to account for noise and LD. Examples include:
    • LDpred/LDpred2: Assumes a fraction of SNPs are causal and uses an LD reference panel to shrink effect sizes.
    • BayesC/BayesR: Assumes effect sizes come from a mixture of normal distributions, allowing for a few large effects and many small ones.
    • Lasso/Ridge Regression: Uses penalization to optimize effect size selection and estimation.
  3. Summary Statistics-Based Methods: Methods that operate entirely on GWAS summary statistics and an external LD reference panel (e.g., LDpred, S-Bayes, PRS-CS). These are the most computationally efficient and widely used due to data sharing conventions.
  4. Meta-Dimensional Methods: These include methods that incorporate information beyond SNP effects, such as functional annotations or gene expression data (e.g., MetaXcan).

Practical Considerations and Performance

Factors Affecting PGS Performance

The prediction accuracy of a PGS (typically measured by the coefficient of determination, \(R^2\)) is highly sensitive to several factors:

  1. Genetic Architecture: Methods that accurately model the true genetic architecture (e.g., high polygenicity vs. oligenicity) perform best. Bayesian methods often excel because they flexibly model the prior distribution of effect sizes.
  2. Training Sample Size: Performance is highly dependent on the size of the GWAS used to estimate \(\hat{\beta}\). Larger sample sizes generally lead to more accurate effect size estimates and thus better PGS performance.
  3. Ancestral Divergence: Prediction accuracy significantly drops when the target population ancestry differs substantially from the training population (e.g., GWAS conducted primarily in European populations but applied to African populations). This highlights the critical issue of transferability and health equity.
  4. Linkage Disequilibrium (LD) Modeling: Explicitly accounting for LD, such as in LDpred and PRS-CS, is crucial for improving prediction accuracy compared to simple methods like P+T.

Decision Tree for PGS Analysis

The review provides a practical decision tree to help researchers select the appropriate PGS method based on the available data and research question:

Key branches include: * Input Data: Is individual-level data or only GWAS summary data available? * Trait Complexity: Analyzing a single trait or multiple correlated traits? * Modeling Approach: Choosing between model-based (e.g., linear mixed models) and algorithm-based (e.g., machine learning) methods.

Challenges and Future Directions

The authors identify several major challenges that need to be addressed to realize the full clinical potential of PGS:

  1. Addressing Heterogeneity (Ancestry): Developing methods that maintain high predictive accuracy across diverse ancestral populations and that can properly perform multi-ancestry meta-analysis to generate universally accurate PGS.
  2. Non-Additive and GxE Effects: Incorporating non-additive genetic effects (dominance, epistasis) and Gene-by-Environment (GxE) interactions into the PGS model. Current models are largely additive and thus leave substantial variance unexplained.
  3. Causal Variants and Fine-Mapping: Moving beyond marker SNPs to accurately identify and weight the true causal variants to improve biological relevance and prediction.
  4. Clinical Implementation: Establishing standardized reporting guidelines, validating PGS in independent clinical cohorts, and addressing the ethical concerns related to using PGS in personalized medicine.