Extensive co-regulation of neighboring genes complicates the use of eQTLs in target gene prioritization
- Research Problem: Systematic benchmarking showed that using eQTL colocalization methods to prioritize causal genes for GWAS hits is often complicated by extensive co-regulation of neighboring genes and is less effective than simpler heuristics.
- Key Result: The simple strategy of assigning fine-mapped pQTLs to the closest protein coding gene significantly outperformed all tested Bayesian colocalization methods, achieving 76.9% recall and 71.9% precision.
- Conclusion: Linking GWAS variants to target genes remains challenging using eQTL evidence alone, and robust gene prioritization requires the triangulation of evidence from multiple functional sources to improve confidence.
PubMed: 39210598 DOI: 10.1016/j.xhgg.2024.100348 Overview generated by: Gemini 2.5 Flash, 28/11/2025
Key Findings: Limitations of eQTL Colocalization in Causal Gene Prioritization
The study addresses the fundamental problem in human genetics of linking non-coding regulatory variants from genome-wide association studies (GWASs) to their causal target genes. It establishes that the practice of using gene expression quantitative trait loci (eQTLs) for colocalization analysis to prioritize targets is complicated by the extensive co-regulation of neighboring genes. The authors found that existing colocalization methods are generally outperformed by a much simpler heuristic: assigning the fine-mapped variant to the closest protein-coding gene.
Study Design and Benchmarking Methods
Ground Truth Dataset
The researchers created a large ground truth dataset by re-analyzing fine-mapped plasma protein QTL (pQTL) data from 3,301 individuals in the INTERVAL cohort. Focusing on variants located within or close to the affected protein (cis-pQTLs), they assumed that the gene coding for the protein was the most likely causal gene for 793 proteins. This assumption provided a robust ground truth set for systematic benchmarking.
Methods Evaluated
The study systematically compared three Bayesian colocalization methods and five Mendelian Randomization (MR) approaches:
- Colocalization Methods:
coloc.susie(supports multiple causal variants).coloc.abf(assumes a single causal variant).CLPP(colocalization posterior probability defined at the variant level).
- Mendelian Randomization (MR) Methods: Five varieties were tested, including standard inverse-variance weighted MR (IVW-MR), MR-RAPS, and MRLocus.
Results
Performance of Prioritization Strategies
- Closest Gene Heuristic: Assigning pQTLs to their closest protein coding gene achieved the highest performance overall: 76.9% recall and 71.9% precision.
- Colocalization Methods: When comparing performance using only the strongest signal,
coloc.susiehad the highest recall (46.3%) but lowest precision (45.1%). Conversely,CLPPwas the most precise (68.5%) but yielded the correct gene for less than a fifth of the proteins (17.5% recall). - Colocalization + MR: Combining colocalization evidence with MR—restricting analysis to cases with two or more independent colocalizing signals—increased precision substantially to 81%. However, this led to a massive reduction in recall to just 7.1%, primarily due to the limited power of eQTL datasets to detect secondary eQTL signals.
MR Assumption Violations
The standard inverse-variance-weighted MR often produced many false positives. The authors found that cis-eQTLs frequently violated MR assumptions, underscoring that using robust inference methods to account for these violations is essential to avoid inaccurate results.
Conclusions and Recommendations
The study demonstrates that colocalization methods using eQTL data alone are insufficient and less effective than expected for accurately prioritizing causal genes underlying GWAS variants. The complexity of co-regulation highlights the need for robust methods. Ultimately, prioritizing novel targets in human genetics requires triangulation of evidence from multiple sources (e.g., integrating pQTLs, eQTLs, and other functional data) rather than relying on a single data modality.