Genome-wide association scans for secondary traits using case-control samples
- This statistical methodology paper examines the bias introduced when a case-control GWAS (designed for a primary disease \(D\)) is used to analyze a secondary quantitative trait (\(T\)).
- It demonstrates that naïve analysis (ignoring case-control ascertainment) leads to biased effect estimates for the marker-secondary trait association (\(G-T\)) specifically when both the marker \(G\) and the trait \(T\) are independently associated with the primary disease \(D\).
- The authors propose using Inverse-Probability-of-Sampling-Weighted (IPW) regression as the robust method, which provides unbiased estimates in all scenarios, though at the cost of reduced statistical power, recommending naïve analysis for markers not associated with the primary disease.
PubMed: 19365863
DOI: 10.1002/gepi.20424
Overview generated by: Gemini 2.5 Flash, 26/11/2025
Key Findings
This paper addresses a critical methodological issue in genetic epidemiology where researchers attempt to maximize the return on investment from an expensive case-control Genome-Wide Association Study (GWAS) by analyzing additional, or secondary, quantitative traits (e.g., body mass index, mammographic density) collected on the same subjects. The core finding is that a naïve analysis (ignoring the case-control ascertainment) can lead to biased estimates of the association between a genetic marker and the secondary trait, particularly when both the marker and the secondary trait are independently associated with the primary disease risk. The authors demonstrate that the use of Inverse-Probability-of-Sampling-Weighted (IPW) regression provides unbiased estimates of the marker-secondary trait association in all scenarios, although it may suffer from reduced statistical power compared to the biased naïve methods.
Statistical Problem: Ascertainment Bias
Case-control studies are designed to test the association between a genetic marker and a primary binary disease (e.g., breast cancer). When using these same samples to study a quantitative trait (the secondary trait, e.g., mammographic density), the selection process (ascertainment) based on the disease status introduces a bias.
Naïve Analysis Scenarios
The study mathematically and via simulation tested the performance of two “naïve” approaches for testing the association between a marker (\(G\)) and a secondary trait (\(T\)) using case-control data, where \(D\) is the primary disease status:
- Ignoring \(D\): Regressing \(T\) on \(G\) in the combined sample, ignoring case-control status.
- Stratifying on \(D\): Regressing \(T\) on \(G\) separately within cases and controls, and then combining the results (e.g., via meta-analysis).
The paper shows that both naïve methods have:
- Proper Type I Error Rates (Unbiased Test): When testing the null hypothesis of no \(G-T\) association, the methods maintain the correct Type I error rate unless both \(G\) and \(T\) are independently associated with the primary disease \(D\).
- Unbiased Estimates (Under Alternative): Under the alternative hypothesis (i.e., a true \(G-T\) association exists), the estimated effect size is unbiased only if the secondary trait \(T\) is not associated with the primary disease \(D\).
Source of Bias
The bias in the naïve methods occurs when a significant confounding pathway exists: \(G \rightarrow D \leftarrow T\). Since the case-control study non-randomly samples based on \(D\), this selection distorts the observed association between \(G\) and \(T\).
Solution: Inverse-Probability-of-Sampling Weighting (IPW)
The authors propose using IPW regression to correct for the ascertainment bias. IPW uses weights in the regression calculation that are inversely proportional to the probability of an individual being sampled into the study.
- The weights are calculated based on the sampling fractions of cases and controls.
- Performance: IPW regression yielded unbiased estimates of the \(G-T\) association and maintained the proper Type I error rate in all scenarios considered, regardless of the association between \(G\), \(T\), and \(D\).
- Trade-off: IPW regression consistently demonstrated lower statistical power than the naïve analyses in situations where the naïve analyses were also unbiased.
Practical Recommendations
The study concludes with practical recommendations for GWAS analysis of secondary traits:
- General Markers: For the vast majority of markers tested in a GWAS (which are not associated with the primary disease \(D\)), the naïve analyses are valid tests of association and provide nearly unbiased estimates of the \(G-T\) association.
- Disease-Associated Markers: Care must be taken when both the marker (\(G\)) and the secondary trait (\(T\)) are associated with the primary disease (\(D\)). In this scenario, the naïve estimates will be biased, and IPW regression is the statistically valid method to obtain an unbiased estimate of the \(G-T\) effect.
- Illustration: The authors illustrate the potential for bias using an analysis of the relationship between a marker in the FGFR2 gene (a known breast cancer risk locus) and mammographic density in a breast cancer case-control sample.