Using GWAS summary data to impute traits for genotyped individuals

GWAS
imputation
summary statistics
nonlinear associations
SNP interactions
machine learning
  • Novel nonparametric LS-imputation method recovers genetic components of traits from GWAS summary statistics and individual genotypes, enabling nonlinear association analyses impossible with summary data alone
  • Perfectly recovers trait values when test genotypes match training genotypes (correlation >0.999), capturing nonlinear SNP-trait information despite using only linear marginal associations
  • Outperforms PRS-CS for association analyses in UK Biobank HDL data: successfully detects non-additive genetic effects, SNP-SNP interactions, and trains nonlinear prediction models (random forests) while PRS-CS shows severe false positive inflation
Published

23 January 2026

PubMed: 37181332
DOI: 10.1016/j.xhgg.2023.100197
Overview generated by: Claude Sonnet 4.5, 26/11/2025

Key Findings

This study introduces LS-imputation, a nonparametric method that uses GWAS summary statistics combined with individual-level genotypes to impute trait values, enabling nonlinear SNP-trait association analyses and machine learning applications that are impossible with summary statistics alone.

Main Discoveries

  1. Novel imputation approach: First method to recover genetic components of traits from GWAS summary data for nonlinear association analysis

  2. Perfect recovery property: When test genotypes match training genotypes (X = X*), the method perfectly recovers (centered) trait values, capturing nonlinear SNP-trait information despite using only linear marginal associations

  3. Superior performance for association analysis: LS-imputation outperforms state-of-the-art PRS method (PRS-CS) for subsequent association analyses under non-additive models and SNP-SNP interaction detection

  4. Enables new analyses: Makes possible three applications currently impossible with GWAS summary data: non-additive genetic models, SNP-SNP interaction detection, and nonlinear prediction models

Study Design

Core Problem

GWAS summary statistics measure only linear marginal SNP-trait associations, limiting their use to linear analyses. The method addresses: How to use summary data for nonlinear SNP-trait analyses?

Genetic Model

Assumes unspecified functional form: \[y = E(y|x) + \varepsilon = g(x) + \varepsilon\]

where: - \(g(x)\) is the unknown genetic component (possibly nonlinear) - \(\varepsilon\) captures environmental effects and noise - No parametric assumptions on \(g(\cdot)\)

Method Overview

Input: - GWAS summary data: \(\{(\hat{\beta}_j^*, s_j^*): j=1,\ldots,p\}\) from training data \((X^*, Y^*)\) - Test genotype matrix: \(X\) (\(n_2 \times p\))

Output: Imputed trait values \(\hat{Y}\) for test individuals

Key insight: With large samples, \(\hat{\beta}^* \approx \hat{\beta}\) (both estimate same true \(\beta\)), which can be used to formulate least-squares problem.

The LS-Imputation Method

Formulation

If \(Y\) were available, marginal association estimates would be: \[\hat{\beta} = \frac{1}{n_2-1}X'Y\]

Since \(\hat{\beta}^*\) (from training) \(\approx \hat{\beta}\) (from test), solve:

\[\hat{Y} = \arg\min_Y \|\hat{\beta}^* - \frac{1}{n_2-1}X'Y\|^2\]

Solution: \[\hat{Y} = (n_2-1)(XX')^+X\hat{\beta}^*\]

where \((XX')^+\) is the Moore-Penrose generalized inverse (due to centering of SNPs).

Regularized Implementation

For computational stability, use ridge regularization: \[\hat{Y}(\lambda) = (n_2-1)(XX' + \lambda I)^{-1}X\hat{\beta}^*\]

Default: \(\lambda = 10^{-6}\) (computationally fast and stable)

Batch Processing

For large \(n_2\): - Divide test data into batches of size \(m\) - Apply method to each batch separately - Requires \(p > m\) (preferably both \(n_1\) and \(p\) large) - Choose \(m\) giving marginal association results similar to training data

UK Biobank Application

Dataset

  • Trait: HDL cholesterol
  • Total individuals: 356,351 (White British ancestry)
  • SNPs: 715,783 (after QC: MAF>0.05, missing<10%, HWE p>0.001, LD pruning r²<0.8)
  • Split:
    • Training: \(n_1 = 178,175\)
    • Test: \(n_2 = 178,176\)
  • Implementation: 50,000 SNPs (p<0.05 in training), 9 batches (8×20K + 1×18K individuals)

Perfect Recovery Test

When \(X = X^*\) (same genotypes as training): - LS-imputation: Correlation with true values = 0.999+ - PRS-CS: Correlation < 0.5 (imperfect recovery)

Demonstrates unique property: LS-imputation can perfectly recover trait values for training genotypes, capturing nonlinear information.

Test Data Imputation Performance

Correlation between observed and imputed HDL:

Method Correlation (unadjusted) Correlation (adjusted*)
LS-imputation 0.177 0.204
PRS-CS 0.279 0.313

*Adjusted for sex and age

Interpretation: PRS-CS shows higher correlation (expected, as linear effects dominate heritability), but LS-imputation better preserves information for association analyses.

Application I: Non-Additive Genetic Models

Additive Model Results

Comparison of significant SNPs at genome-wide threshold (\(5\times10^{-8}\)):

Analysis Approach Performance
Training (observed) Standard GWAS Baseline
Test (observed) Standard GWAS Similar to training
Test (LS-imputed) Our method Similar to observed, slightly conservative
Test (PRS-CS-imputed) PRS method Way too many significant SNPs

Manhattan plot patterns: LS-imputation closely matched observed data distribution, while PRS-CS identified excessive associations (any SNP in PRS model or LD with them becomes significant).

Recessive Model Results

Testing SNPs under recessive genetic model:

LS-imputation: - Distribution of significant SNPs similar to observed - Slightly more conservative (fewer false positives) - Effect size estimates highly correlated with true estimates

PRS-CS: - Severe inflation of significant associations - Not suitable for non-additive model testing

Dominant Model Results

Similar pattern observed (Supplementary results): - LS-imputation: Good agreement with observed - PRS-CS: Excessive false positives

Quantitative Comparison

Effect size correlations (50,000 SNPs):

Model LS vs. Observed PRS-CS vs. Observed
Additive 0.90+ 0.40-0.60
Recessive 0.85+ 0.30-0.50

Conclusion: LS-imputation preserves information needed for non-additive model testing; PRS-CS does not.

Application II: SNP-SNP Interaction Detection

Analysis Strategy

  1. Identified 1,758 marginally significant SNPs (p<10⁻⁶) in training data
  2. Removed high-LD SNPs (r²>0.99) → 1,652 SNPs
  3. Tested all pairwise interactions: \(\binom{1652}{2} = 1,364,026\) tests

Model for each pair: \[Y_i = \alpha_0 + \text{SNP}_{1i} \times \alpha_1 + \text{SNP}_{2i} \times \alpha_2 + \text{SNP}_{1i} \times \text{SNP}_{2i} \times \alpha_{12} + \varepsilon_i\]

Test: \(H_0: \alpha_{12} = 0\) (Wald test)

Significance threshold: \(2.5 \times 10^{-8}\) (Bonferroni correction)

Results

Interaction effect estimates:

Comparison Correlation
Training (observed) vs. Test (observed) Baseline
Test (observed) vs. Test (LS-imputed) 0.95+
Test (observed) vs. Test (PRS-CS-imputed) 0.60-0.70

P-value correlations: Similar pattern, with LS-imputation showing strong agreement with observed data.

Significant Interactions Identified

SNP-SNP pairs (Bonferroni p<\(2.5\times10^{-8}\)):

Dataset Significant pairs Agreement with observed
Training (observed) Baseline -
Test (observed) Similar Reference
Test (LS-imputed) High overlap 85-90%
Test (PRS-CS-imputed) Moderate overlap 60-70%

Locus-locus interactions: Defined using 1,703 independent LD blocks - LS-imputation: High concordance with observed - Differences between LS-imputed and observed ≤ differences between training and test (both observed)

Conclusion: LS-imputation successfully detects SNP-SNP interactions; PRS-CS less suitable.

Application III: Nonlinear Trait Prediction

Random Forest Setup

Training: 70% of test data (random subset) Validation: Remaining 30% Features: 1,652 marginally significant SNPs

Goal: Compare RF predictions using observed vs. imputed traits for training

Results

Correlation of RF predictions on validation data:

Training data Correlation with true trait
Observed traits Baseline
LS-imputed traits 0.722
PRS-CS-imputed traits 0.658

Interpretation: LS-imputed traits retain more information about SNP-trait associations (possibly nonlinear) than PRS-CS-imputed traits.

Why LS-imputation performs better: - Captures nonlinear relationships in training data - No parametric model assumptions - Information borrowing across similar individuals

Statistical Properties

Information Content

For test genotypes \(X\), imputed trait is: \[\hat{Y} \approx \frac{n_2-1}{n_1-1}(XX'/p)C_{n_1}Y^*\]

where: - \(XX'/p\) measures genotypic similarities - \(C_{n_1} = I - 11'/n_1\) is centering matrix - \(Y^*\) are training trait values

Implication: Imputed trait is weighted average of training traits, weights determined by genotypic similarity.

Special Case: Perfect Recovery

When \(X = X^*\): \[\hat{Y} \rightarrow^P C_{n_1}Y^*\]

As \(p \rightarrow \infty\), imputed values converge to centered training trait values, which contain nonlinear SNP-trait information.

Variance Properties

For imputed trait: \[\text{Var}(\hat{Y}) = (n_2-1)^2(XX')^+X\text{Var}(\hat{\beta}^*)X'(XX')^+\]

Key points: - Elements of \(\hat{Y}\) are correlated (not iid) - Variances unequal across individuals - Practical solution: Treat as independent in subsequent analyses (simplification) - Choose appropriate batch size \(m\) to minimize bias-variance trade-off

Asymptotic Behavior

With iid normal X (simplified case):

Small \(n_2\) (fixed): \[\text{Var}(\hat{Y}_j) \approx n_2\tau^2/p\]

Large \(n_2\) (with \(n_2/p \rightarrow c \in (0,1)\)): \[\text{Var}(\hat{Y}_j) = n_2 O(1)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)\tau^2\]

Recommendation: Use smaller \(n_2\) (or batch size \(m\)) for smaller variances.

Comparison: LS-Imputation vs. PRS-CS

Prediction Performance

Trait value correlation (higher is better for prediction): - PRS-CS > LS-imputation - Expected: Linear effects dominate heritability - PRS-CS optimized for prediction

Association Analysis Performance

Effect size estimation (for subsequent GWAS): - LS-imputation >> PRS-CS - Critical for non-additive models - Essential for interaction detection

Why PRS Methods Fail for Association Analysis

PRS-CS assumptions: \[Y = X\beta + \varepsilon\] \[\beta \sim \text{ContinuousShrinkage}(\text{prior})\]

Problems: 1. Assumes linear model with specific SNPs 2. Imputed traits reflect estimated linear effects only 3. Any SNP in model (or LD with them) will be “significant” by definition 4. Not suitable for testing associations

Fundamental Difference

Feature LS-imputation PRS-CS
Model Nonparametric Parametric linear
Captures nonlinearity Yes No
Perfect recovery Yes No
Prediction Good Better
Association analysis Best Poor
Interaction detection Best Poor

Implementation Details

Computational Considerations

Requirements: - \(p > n_2\) (or \(p > m\) if batches used) - More constraints than unknowns - Unique solution (up to centering)

Matrix inversion: - Used linalg.inv from Python numpy - Default \(\lambda = 10^{-6}\) for regularization - Fast and stable computation

Memory management: - Batch processing for large \(n_2\) - Typical batch: \(m = 20,000\) individuals - Trade-off: smaller \(m\) → smaller variance, but information loss between batches

Parameter Selection

SNP number (\(p\)): - Larger is better (more constraints) - Example: Used 50,000 SNPs (p<0.05 in training) - Can use all available (memory permitting)

Training sample (\(n_1\)): - Larger is better (more accurate \(\hat{\beta}^*\)) - Example: 178,175 individuals

Batch size (\(m\)): - Choose to give marginal results similar to training - Not too large (information loss between batches) - Not too small (computational inefficiency) - Example: 20,000 individuals per batch

Quality Control Strategy

Recommended: Choose \(m\) where imputed trait gives: 1. Marginal effect estimates ≈ training estimates 2. Standard errors ≈ training SEs (after rescaling)

SE rescaling: \(\sqrt{n_2/n_1} \times SE_{test}\) for comparison

Extensions and Variations

Binary Traits

Method extended to binary outcomes (Supplementary): - Similar formulation - Logistic regression framework - Applied to UK Biobank hypertension data - Promising preliminary results

Weighted Least Squares

Alternative to ordinary least squares: \[\hat{Y}_{WLS} = \arg\min_Y (\hat{\beta}^* - \frac{1}{n_2-1}X'Y)'W(\hat{\beta}^* - \frac{1}{n_2-1}X'Y)\]

where \(W\) = diagonal matrix with weights \(\propto 1/\text{Var}(\hat{\beta}_j^*)\)

Result: Similar to OLS (Supplementary), not pursued further.

Intercept Known Case

If intercept \(\alpha_0\) available for each SNP: - No centering needed - \(X\) is full rank - \((XX')^+ = (XX')^{-1}\) - Simpler interpretation

Sample Size Sensitivity

Training sample \(n_1\): Larger always better SNP number \(p\): Larger always better Test sample \(n_2\): Results stable for \(n_2 \geq 25,000\) Batch size \(m\): Complex trade-off, choose empirically

Practical Applications

Use Case 1: Augmenting Incomplete Data

Scenario: Biobank with genotypes but missing trait - Late-onset disease (e.g., Alzheimer’s) not yet manifested - Expensive/difficult-to-measure phenotype - Large GWAS summary data available externally

Solution: Impute trait values to augment analyses

Use Case 2: Cross-Study Integration

Scenario: Multiple related studies - Different traits measured - Want to analyze trait not measured in focal study - GWAS summary available from other studies

Solution: Use summary data to impute unmeasured traits

Use Case 3: Privacy-Preserving Collaboration

Scenario: Private breeding programs or clinical cohorts - Cannot share individual-level data - Can share summary statistics - Want to conduct joint analyses

Solution: Each site imputes traits using others’ summaries

Use Case 4: Nonlinear Model Development

Scenario: Develop complex prediction models - Neural networks, deep learning, gradient boosting - Require individual-level data - Only summary data available

Solution: Impute traits to enable nonlinear model training

Limitations and Considerations

Current Limitations

  1. Variance structure: Elements of \(\hat{Y}\) are correlated and have unequal variances
    • Currently ignored in subsequent analyses
    • May cause slight over/under-estimation of SEs
    • Accounting for correlations computationally prohibitive
  2. Centering effects: Each batch centered at mean 0
    • Information loss between batches (relative levels)
    • Mitigated by using larger batches
    • Trade-off with variance considerations
  3. Constraint requirement: Needs \(p > m\)
    • Must have more SNPs than individuals per batch
    • May limit applicability in some scenarios
  4. Rare variant use: Current implementation uses common variants only
    • Rare variants have lower genotyping quality
    • Expected to contain less heritability information
    • Could be explored with sequencing data

Statistical Assumptions

  1. Same population: Training and test from same population
  2. Unrelated individuals: No close relatives in test data
  3. White noise errors: \(\varepsilon\) independent of genotypes
  4. Large samples: Asymptotic properties require large \(n_1\), \(p\)

Interpretation Caveats

Imputed values: - Represent genetic components only - Do not capture environmental variation - Centered (mean 0 within each batch) - Should not be treated as observed phenotypes in all contexts

Subsequent analyses: - Slightly conservative p-values (good for Type I error control) - Effect size estimates unbiased - Power may be slightly reduced vs. observed data

Comparison to Alternative Approaches

vs. PRS Methods

All existing PRS methods for summary data: - Assume linear models - Optimize for prediction - Not designed for association analysis - Cannot detect nonlinear effects

LS-imputation: - Nonparametric - Optimizes for association analysis - Can detect nonlinear effects - Less optimal for pure prediction

vs. Multi-Trait Imputation

Previous methods (Dahl et al., Hormozdiari et al.): - Impute focal trait using other measured traits - Problem: Any variants associated with imputation traits will appear associated with focal trait - Loss of specificity

LS-imputation: - Uses only genotypes and summary data for focal trait - Maintains specificity to focal trait - Suitable for association analysis

vs. Direct GWAS Summary Analysis

Standard approach with summary data: - Can only test linear marginal associations - Cannot detect non-additive effects - Cannot detect interactions - Cannot use nonlinear prediction models

LS-imputation approach: - Enables all of the above - More flexible for exploratory analysis - Can combine with machine learning methods

Future Directions

Methodological Extensions

  1. Efficient algorithms: Handle larger datasets without batching
  2. Generalized least squares: Account for correlated marginal estimates
  3. Better variance estimation: Properly handle correlated imputed values
  4. Rare variant integration: Extend to sequencing-based data

Additional Applications

  1. Multi-trait analysis: Jointly impute multiple correlated traits
  2. Transcriptome-wide studies: Impute gene expression for TWAS
  3. Polygenic score development: Use imputed traits to train complex nonlinear PRS models
  4. Pathway analysis: Enable pathway-level nonlinear analyses

Practical Improvements

  1. Automated parameter selection: Data-driven choice of \(m\), \(p\)
  2. Distributed computing: Parallel batch processing
  3. Memory optimization: More efficient matrix operations
  4. Quality metrics: Better diagnostics for imputation quality