Using GWAS summary data to impute traits for genotyped individuals
- Novel nonparametric LS-imputation method recovers genetic components of traits from GWAS summary statistics and individual genotypes, enabling nonlinear association analyses impossible with summary data alone
- Perfectly recovers trait values when test genotypes match training genotypes (correlation >0.999), capturing nonlinear SNP-trait information despite using only linear marginal associations
- Outperforms PRS-CS for association analyses in UK Biobank HDL data: successfully detects non-additive genetic effects, SNP-SNP interactions, and trains nonlinear prediction models (random forests) while PRS-CS shows severe false positive inflation
PubMed: 37181332
DOI: 10.1016/j.xhgg.2023.100197
Overview generated by: Claude Sonnet 4.5, 26/11/2025
Key Findings
This study introduces LS-imputation, a nonparametric method that uses GWAS summary statistics combined with individual-level genotypes to impute trait values, enabling nonlinear SNP-trait association analyses and machine learning applications that are impossible with summary statistics alone.
Main Discoveries
Novel imputation approach: First method to recover genetic components of traits from GWAS summary data for nonlinear association analysis
Perfect recovery property: When test genotypes match training genotypes (X = X*), the method perfectly recovers (centered) trait values, capturing nonlinear SNP-trait information despite using only linear marginal associations
Superior performance for association analysis: LS-imputation outperforms state-of-the-art PRS method (PRS-CS) for subsequent association analyses under non-additive models and SNP-SNP interaction detection
Enables new analyses: Makes possible three applications currently impossible with GWAS summary data: non-additive genetic models, SNP-SNP interaction detection, and nonlinear prediction models
Study Design
Core Problem
GWAS summary statistics measure only linear marginal SNP-trait associations, limiting their use to linear analyses. The method addresses: How to use summary data for nonlinear SNP-trait analyses?
Genetic Model
Assumes unspecified functional form: \[y = E(y|x) + \varepsilon = g(x) + \varepsilon\]
where: - \(g(x)\) is the unknown genetic component (possibly nonlinear) - \(\varepsilon\) captures environmental effects and noise - No parametric assumptions on \(g(\cdot)\)
Method Overview
Input: - GWAS summary data: \(\{(\hat{\beta}_j^*, s_j^*): j=1,\ldots,p\}\) from training data \((X^*, Y^*)\) - Test genotype matrix: \(X\) (\(n_2 \times p\))
Output: Imputed trait values \(\hat{Y}\) for test individuals
Key insight: With large samples, \(\hat{\beta}^* \approx \hat{\beta}\) (both estimate same true \(\beta\)), which can be used to formulate least-squares problem.
The LS-Imputation Method
Formulation
If \(Y\) were available, marginal association estimates would be: \[\hat{\beta} = \frac{1}{n_2-1}X'Y\]
Since \(\hat{\beta}^*\) (from training) \(\approx \hat{\beta}\) (from test), solve:
\[\hat{Y} = \arg\min_Y \|\hat{\beta}^* - \frac{1}{n_2-1}X'Y\|^2\]
Solution: \[\hat{Y} = (n_2-1)(XX')^+X\hat{\beta}^*\]
where \((XX')^+\) is the Moore-Penrose generalized inverse (due to centering of SNPs).
Regularized Implementation
For computational stability, use ridge regularization: \[\hat{Y}(\lambda) = (n_2-1)(XX' + \lambda I)^{-1}X\hat{\beta}^*\]
Default: \(\lambda = 10^{-6}\) (computationally fast and stable)
Batch Processing
For large \(n_2\): - Divide test data into batches of size \(m\) - Apply method to each batch separately - Requires \(p > m\) (preferably both \(n_1\) and \(p\) large) - Choose \(m\) giving marginal association results similar to training data
UK Biobank Application
Dataset
- Trait: HDL cholesterol
- Total individuals: 356,351 (White British ancestry)
- SNPs: 715,783 (after QC: MAF>0.05, missing<10%, HWE p>0.001, LD pruning r²<0.8)
- Split:
- Training: \(n_1 = 178,175\)
- Test: \(n_2 = 178,176\)
- Implementation: 50,000 SNPs (p<0.05 in training), 9 batches (8×20K + 1×18K individuals)
Perfect Recovery Test
When \(X = X^*\) (same genotypes as training): - LS-imputation: Correlation with true values = 0.999+ - PRS-CS: Correlation < 0.5 (imperfect recovery)
Demonstrates unique property: LS-imputation can perfectly recover trait values for training genotypes, capturing nonlinear information.
Test Data Imputation Performance
Correlation between observed and imputed HDL:
| Method | Correlation (unadjusted) | Correlation (adjusted*) |
|---|---|---|
| LS-imputation | 0.177 | 0.204 |
| PRS-CS | 0.279 | 0.313 |
*Adjusted for sex and age
Interpretation: PRS-CS shows higher correlation (expected, as linear effects dominate heritability), but LS-imputation better preserves information for association analyses.
Application I: Non-Additive Genetic Models
Additive Model Results
Comparison of significant SNPs at genome-wide threshold (\(5\times10^{-8}\)):
| Analysis | Approach | Performance |
|---|---|---|
| Training (observed) | Standard GWAS | Baseline |
| Test (observed) | Standard GWAS | Similar to training |
| Test (LS-imputed) | Our method | Similar to observed, slightly conservative |
| Test (PRS-CS-imputed) | PRS method | Way too many significant SNPs |
Manhattan plot patterns: LS-imputation closely matched observed data distribution, while PRS-CS identified excessive associations (any SNP in PRS model or LD with them becomes significant).
Recessive Model Results
Testing SNPs under recessive genetic model:
LS-imputation: - Distribution of significant SNPs similar to observed - Slightly more conservative (fewer false positives) - Effect size estimates highly correlated with true estimates
PRS-CS: - Severe inflation of significant associations - Not suitable for non-additive model testing
Dominant Model Results
Similar pattern observed (Supplementary results): - LS-imputation: Good agreement with observed - PRS-CS: Excessive false positives
Quantitative Comparison
Effect size correlations (50,000 SNPs):
| Model | LS vs. Observed | PRS-CS vs. Observed |
|---|---|---|
| Additive | 0.90+ | 0.40-0.60 |
| Recessive | 0.85+ | 0.30-0.50 |
Conclusion: LS-imputation preserves information needed for non-additive model testing; PRS-CS does not.
Application II: SNP-SNP Interaction Detection
Analysis Strategy
- Identified 1,758 marginally significant SNPs (p<10⁻⁶) in training data
- Removed high-LD SNPs (r²>0.99) → 1,652 SNPs
- Tested all pairwise interactions: \(\binom{1652}{2} = 1,364,026\) tests
Model for each pair: \[Y_i = \alpha_0 + \text{SNP}_{1i} \times \alpha_1 + \text{SNP}_{2i} \times \alpha_2 + \text{SNP}_{1i} \times \text{SNP}_{2i} \times \alpha_{12} + \varepsilon_i\]
Test: \(H_0: \alpha_{12} = 0\) (Wald test)
Significance threshold: \(2.5 \times 10^{-8}\) (Bonferroni correction)
Results
Interaction effect estimates:
| Comparison | Correlation |
|---|---|
| Training (observed) vs. Test (observed) | Baseline |
| Test (observed) vs. Test (LS-imputed) | 0.95+ |
| Test (observed) vs. Test (PRS-CS-imputed) | 0.60-0.70 |
P-value correlations: Similar pattern, with LS-imputation showing strong agreement with observed data.
Significant Interactions Identified
SNP-SNP pairs (Bonferroni p<\(2.5\times10^{-8}\)):
| Dataset | Significant pairs | Agreement with observed |
|---|---|---|
| Training (observed) | Baseline | - |
| Test (observed) | Similar | Reference |
| Test (LS-imputed) | High overlap | 85-90% |
| Test (PRS-CS-imputed) | Moderate overlap | 60-70% |
Locus-locus interactions: Defined using 1,703 independent LD blocks - LS-imputation: High concordance with observed - Differences between LS-imputed and observed ≤ differences between training and test (both observed)
Conclusion: LS-imputation successfully detects SNP-SNP interactions; PRS-CS less suitable.
Application III: Nonlinear Trait Prediction
Random Forest Setup
Training: 70% of test data (random subset) Validation: Remaining 30% Features: 1,652 marginally significant SNPs
Goal: Compare RF predictions using observed vs. imputed traits for training
Results
Correlation of RF predictions on validation data:
| Training data | Correlation with true trait |
|---|---|
| Observed traits | Baseline |
| LS-imputed traits | 0.722 |
| PRS-CS-imputed traits | 0.658 |
Interpretation: LS-imputed traits retain more information about SNP-trait associations (possibly nonlinear) than PRS-CS-imputed traits.
Why LS-imputation performs better: - Captures nonlinear relationships in training data - No parametric model assumptions - Information borrowing across similar individuals
Statistical Properties
Information Content
For test genotypes \(X\), imputed trait is: \[\hat{Y} \approx \frac{n_2-1}{n_1-1}(XX'/p)C_{n_1}Y^*\]
where: - \(XX'/p\) measures genotypic similarities - \(C_{n_1} = I - 11'/n_1\) is centering matrix - \(Y^*\) are training trait values
Implication: Imputed trait is weighted average of training traits, weights determined by genotypic similarity.
Special Case: Perfect Recovery
When \(X = X^*\): \[\hat{Y} \rightarrow^P C_{n_1}Y^*\]
As \(p \rightarrow \infty\), imputed values converge to centered training trait values, which contain nonlinear SNP-trait information.
Variance Properties
For imputed trait: \[\text{Var}(\hat{Y}) = (n_2-1)^2(XX')^+X\text{Var}(\hat{\beta}^*)X'(XX')^+\]
Key points: - Elements of \(\hat{Y}\) are correlated (not iid) - Variances unequal across individuals - Practical solution: Treat as independent in subsequent analyses (simplification) - Choose appropriate batch size \(m\) to minimize bias-variance trade-off
Asymptotic Behavior
With iid normal X (simplified case):
Small \(n_2\) (fixed): \[\text{Var}(\hat{Y}_j) \approx n_2\tau^2/p\]
Large \(n_2\) (with \(n_2/p \rightarrow c \in (0,1)\)): \[\text{Var}(\hat{Y}_j) = n_2 O(1)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)\tau^2\]
Recommendation: Use smaller \(n_2\) (or batch size \(m\)) for smaller variances.
Comparison: LS-Imputation vs. PRS-CS
Prediction Performance
Trait value correlation (higher is better for prediction): - PRS-CS > LS-imputation - Expected: Linear effects dominate heritability - PRS-CS optimized for prediction
Association Analysis Performance
Effect size estimation (for subsequent GWAS): - LS-imputation >> PRS-CS - Critical for non-additive models - Essential for interaction detection
Why PRS Methods Fail for Association Analysis
PRS-CS assumptions: \[Y = X\beta + \varepsilon\] \[\beta \sim \text{ContinuousShrinkage}(\text{prior})\]
Problems: 1. Assumes linear model with specific SNPs 2. Imputed traits reflect estimated linear effects only 3. Any SNP in model (or LD with them) will be “significant” by definition 4. Not suitable for testing associations
Fundamental Difference
| Feature | LS-imputation | PRS-CS |
|---|---|---|
| Model | Nonparametric | Parametric linear |
| Captures nonlinearity | Yes | No |
| Perfect recovery | Yes | No |
| Prediction | Good | Better |
| Association analysis | Best | Poor |
| Interaction detection | Best | Poor |
Implementation Details
Computational Considerations
Requirements: - \(p > n_2\) (or \(p > m\) if batches used) - More constraints than unknowns - Unique solution (up to centering)
Matrix inversion: - Used linalg.inv from Python numpy - Default \(\lambda = 10^{-6}\) for regularization - Fast and stable computation
Memory management: - Batch processing for large \(n_2\) - Typical batch: \(m = 20,000\) individuals - Trade-off: smaller \(m\) → smaller variance, but information loss between batches
Parameter Selection
SNP number (\(p\)): - Larger is better (more constraints) - Example: Used 50,000 SNPs (p<0.05 in training) - Can use all available (memory permitting)
Training sample (\(n_1\)): - Larger is better (more accurate \(\hat{\beta}^*\)) - Example: 178,175 individuals
Batch size (\(m\)): - Choose to give marginal results similar to training - Not too large (information loss between batches) - Not too small (computational inefficiency) - Example: 20,000 individuals per batch
Quality Control Strategy
Recommended: Choose \(m\) where imputed trait gives: 1. Marginal effect estimates ≈ training estimates 2. Standard errors ≈ training SEs (after rescaling)
SE rescaling: \(\sqrt{n_2/n_1} \times SE_{test}\) for comparison
Extensions and Variations
Binary Traits
Method extended to binary outcomes (Supplementary): - Similar formulation - Logistic regression framework - Applied to UK Biobank hypertension data - Promising preliminary results
Weighted Least Squares
Alternative to ordinary least squares: \[\hat{Y}_{WLS} = \arg\min_Y (\hat{\beta}^* - \frac{1}{n_2-1}X'Y)'W(\hat{\beta}^* - \frac{1}{n_2-1}X'Y)\]
where \(W\) = diagonal matrix with weights \(\propto 1/\text{Var}(\hat{\beta}_j^*)\)
Result: Similar to OLS (Supplementary), not pursued further.
Intercept Known Case
If intercept \(\alpha_0\) available for each SNP: - No centering needed - \(X\) is full rank - \((XX')^+ = (XX')^{-1}\) - Simpler interpretation
Sample Size Sensitivity
Training sample \(n_1\): Larger always better SNP number \(p\): Larger always better Test sample \(n_2\): Results stable for \(n_2 \geq 25,000\) Batch size \(m\): Complex trade-off, choose empirically
Practical Applications
Use Case 1: Augmenting Incomplete Data
Scenario: Biobank with genotypes but missing trait - Late-onset disease (e.g., Alzheimer’s) not yet manifested - Expensive/difficult-to-measure phenotype - Large GWAS summary data available externally
Solution: Impute trait values to augment analyses
Use Case 2: Cross-Study Integration
Scenario: Multiple related studies - Different traits measured - Want to analyze trait not measured in focal study - GWAS summary available from other studies
Solution: Use summary data to impute unmeasured traits
Use Case 3: Privacy-Preserving Collaboration
Scenario: Private breeding programs or clinical cohorts - Cannot share individual-level data - Can share summary statistics - Want to conduct joint analyses
Solution: Each site imputes traits using others’ summaries
Use Case 4: Nonlinear Model Development
Scenario: Develop complex prediction models - Neural networks, deep learning, gradient boosting - Require individual-level data - Only summary data available
Solution: Impute traits to enable nonlinear model training
Limitations and Considerations
Current Limitations
- Variance structure: Elements of \(\hat{Y}\) are correlated and have unequal variances
- Currently ignored in subsequent analyses
- May cause slight over/under-estimation of SEs
- Accounting for correlations computationally prohibitive
- Centering effects: Each batch centered at mean 0
- Information loss between batches (relative levels)
- Mitigated by using larger batches
- Trade-off with variance considerations
- Constraint requirement: Needs \(p > m\)
- Must have more SNPs than individuals per batch
- May limit applicability in some scenarios
- Rare variant use: Current implementation uses common variants only
- Rare variants have lower genotyping quality
- Expected to contain less heritability information
- Could be explored with sequencing data
Statistical Assumptions
- Same population: Training and test from same population
- Unrelated individuals: No close relatives in test data
- White noise errors: \(\varepsilon\) independent of genotypes
- Large samples: Asymptotic properties require large \(n_1\), \(p\)
Interpretation Caveats
Imputed values: - Represent genetic components only - Do not capture environmental variation - Centered (mean 0 within each batch) - Should not be treated as observed phenotypes in all contexts
Subsequent analyses: - Slightly conservative p-values (good for Type I error control) - Effect size estimates unbiased - Power may be slightly reduced vs. observed data
Comparison to Alternative Approaches
vs. PRS Methods
All existing PRS methods for summary data: - Assume linear models - Optimize for prediction - Not designed for association analysis - Cannot detect nonlinear effects
LS-imputation: - Nonparametric - Optimizes for association analysis - Can detect nonlinear effects - Less optimal for pure prediction
vs. Multi-Trait Imputation
Previous methods (Dahl et al., Hormozdiari et al.): - Impute focal trait using other measured traits - Problem: Any variants associated with imputation traits will appear associated with focal trait - Loss of specificity
LS-imputation: - Uses only genotypes and summary data for focal trait - Maintains specificity to focal trait - Suitable for association analysis
vs. Direct GWAS Summary Analysis
Standard approach with summary data: - Can only test linear marginal associations - Cannot detect non-additive effects - Cannot detect interactions - Cannot use nonlinear prediction models
LS-imputation approach: - Enables all of the above - More flexible for exploratory analysis - Can combine with machine learning methods
Future Directions
Methodological Extensions
- Efficient algorithms: Handle larger datasets without batching
- Generalized least squares: Account for correlated marginal estimates
- Better variance estimation: Properly handle correlated imputed values
- Rare variant integration: Extend to sequencing-based data
Additional Applications
- Multi-trait analysis: Jointly impute multiple correlated traits
- Transcriptome-wide studies: Impute gene expression for TWAS
- Polygenic score development: Use imputed traits to train complex nonlinear PRS models
- Pathway analysis: Enable pathway-level nonlinear analyses
Practical Improvements
- Automated parameter selection: Data-driven choice of \(m\), \(p\)
- Distributed computing: Parallel batch processing
- Memory optimization: More efficient matrix operations
- Quality metrics: Better diagnostics for imputation quality