Large-scale composite hypothesis testing procedure for omics data analyses
- Novel method (qch_copula) for testing composite hypotheses across multiple traits/omics levels using mixture models with copula functions
- Controls Type I error effectively while achieving superior detection power compared to 8 state-of-the-art methods across diverse correlation scenarios
- Memory-efficient implementation enables analysis of 105-106 markers with up to 20 traits, validated through psychiatric disorder pleiotropy and plant virus resistance studies
PubMed: 40918066
DOI: 10.1093/nargab/lqaf118
Overview generated by: Claude Sonnet 4.5, 25/11/2025
Key Findings
This study introduces qch_copula, a novel composite hypothesis testing (CHT) method for analyzing multiple traits or omics levels simultaneously, addressing key limitations in existing approaches for large-scale genomic studies.
Main Discoveries
Novel P-value derivation: First method to provide rigorously defined P-values directly from mixture model approaches for composite hypothesis testing
Superior performance: qch_copula effectively controls Type I error rates while maintaining higher detection power compared to eight state-of-the-art methods (DACT, HDMT, PLACO, adaFilter, IMIX, c-csmGmm, Primo, qch)
Scalability breakthrough: Memory-efficient EM algorithm reduces storage from O(n × 2^Q) to O(n + 2^Q), enabling analysis of up to 20 traits and 105-106 markers
Dependency modeling: Explicitly accounts for correlations between traits/omics levels through copula functions, improving false positive control
Study Design
Methodological Framework
The method addresses testing composite null hypotheses of the form H₀: “a marker/gene has effects in at most q̃ - 1 conditions” versus H₁: “effects in at least q̃ conditions.”
Key components: - Mixture model with 2^Q components (one per configuration) - Gaussian copula to capture dependencies between conditions - Nonparametric estimation of alternative distributions - Two-step inference: marginal distributions, then proportions and copula parameters
Model Specification
For Q conditions, z-scores (negative probit transforms of P-values) follow:
\[Z_i \sim \sum_{c \in C} w_c \psi_c\]
where each component ψ_c combines: - Univariate marginal distributions F^q_0 (null) and F^q_1 (alternative) - Copula function C_θ describing dependencies
Posterior-based P-value:
\[\text{pval}(z) = \frac{1}{n\hat{W}_0} \sum_{j=1}^n \mathbb{1}_{\{\hat{\tau}_j > \hat{\tau}_i\}} (1 - \hat{\tau}_j)\]
where τ represents the posterior probability of belonging to alternative configurations.
Major Results
Simulation Study Design
Evaluated across 24 settings varying: - Number of conditions: Q = 2, 8, 16 - Correlation levels: ρ = 0, 0.3, 0.5, 0.7 - Scenarios: sparse (≥90% null) vs. dense (50% alternative) - Sample size: n = 10^5 markers
Type I Error Control (Q = 2)
Independent traits (ρ = 0): - Most methods controlled FDR at nominal 5% level - DACT-JC showed substantial inflation (~0.25) - Minor deviations for PLACO and c-csmGmm (~0.08)
Correlated traits (ρ = 0.3): - Only DACT_Efron and qch_copula maintained proper FDR control - qch showed severe inflation (FDR = 0.256 sparse, 0.145 dense) - HDMT, PLACO, IMIX, c-csmGmm all exceeded nominal level
Higher correlations (ρ = 0.5, 0.7): - qch_copula consistently controlled FDR near 0.05 - Other methods (except DACT_Efron) showed increased inflation
Type I Error Control (Q = 8, 16)
Q = 8 with ρ = 0.3: - All three scalable methods (Primo, adaFilter, qch_copula) controlled FDR - Primo and adaFilter were conservative (FDR < 0.02 in most cases) - qch_copula maintained FDR close to nominal level
Q = 16 with ρ = 0.3: - qch_copula: slight inflation in sparse scenario (FDR ≤ 0.08) - adaFilter: comparable performance - Primo: computational failure (>24h runtime or memory exhaustion)
Spatial dependence (ξ = 0.3): - Improved FDR control for qch_copula and Primo - No impact on adaFilter - qch_copula reduced FDR from ~0.11 to <0.065 in challenging scenarios
Detection Power
Q = 2: - DACT_Efron: zero power (too conservative) - qch_copula: 0.03-0.124 depending on scenario
Q = 8, ρ = 0.3:
| Testing hypothesis | Primo Power | adaFilter Power | qch_copula Power |
|---|---|---|---|
| ≥2 traits (dense) | 0.143 | 0.317 | 0.609 |
| ≥4 traits (dense) | 0.126 | 0.148 | 0.534 |
| ≥8 traits (dense) | 0.125 | 0.033 | 0.186 |
Q = 16, ρ = 0.3:
| Testing hypothesis | adaFilter Power | qch_copula Power |
|---|---|---|
| ≥2 traits (dense) | 0.278 | 0.646 |
| ≥4 traits (dense) | 0.166 | 0.678 |
| ≥8 traits (sparse) | 0.097 | 0.568 |
qch_copula showed 6× higher power than adaFilter for detecting associations with ≥8 traits in sparse scenarios.
Computational Efficiency
Memory-Efficient EM Algorithm
Classical approach: Stores full posterior matrix T = (τ_ic) requiring 26 GB for n=10^5, Q=15
qch_copula approach: - Computes posteriors on-the-fly during M-step - Stores only summary statistics S^(t)_i - Reduces memory from 26 GB to 1 MB (same example)
Runtime (Q=16, n=10^5): - Model fitting: 78 minutes - Per hypothesis test: ~1 minute - Platform: Single thread, 3.2 GB RAM
Application I: Psychiatric Disorders
Dataset
- 14 psychiatric disorders from Psychiatric Genomics Consortium
- 5,172,884 common SNPs
- Objective: Identify pleiotropic regions associated with ≥8 disorders
Comparison with PLACO
Original analysis (PLACO): - Aggregated SNPs to 26,024 genes using MAGMA - Performed 91 pairwise analyses (all combinations of 2 disorders) - Identified 38 candidate genes
qch_copula analysis: - Direct SNP-level analysis (no aggregation) - Single joint test across all 14 disorders - Identified 1,608 SNPs in 28 distinct regions
Novel Findings
35/38 PLACO genes confirmed plus 8 new regions:
| Region | Chr | Position (Mb) | # SNPs | Top SNP P-value |
|---|---|---|---|---|
| Novel 1 | 5 | 103.6-104.0 | 338 | 5.16×10^-12 |
| Novel 2 | 1 | 73.8-73.9 | 156 | 1.01×10^-7 |
| Novel 3 | 3 | 52.6-53.1 | 90 | 7.27×10^-8 |
Chromosome 5 region (top finding): - 338 SNPs detected - 25 SNPs associated with 11 disorders - Overlaps with RP11-6N13.1 gene - Previously reported for ADHD, ASD, BIP, MDD, SCZ, TS
Three PLACO-only genes (NEGR1, TMX2, C11orf31): - Had only 3-7 P-values <0.01 out of 14 (insufficient for ≥8 disorders) - Likely false positives from pairwise approach
Application II: Cucumber Virus Resistance
Dataset
- 289 cucumber lines (elite, landraces, hybrids)
- 6 viruses: CGMMV, CMV, CVYV, PRSV, WMV, ZYMV
- 339,804 common SNPs (after QC)
- Objective: Detect QTLs for multi-virus resistance
Results by Pleiotropy Level
| Number of viruses | # SNPs detected | # Regions |
|---|---|---|
| ≥2 | 1,845 | 5 |
| ≥3 | 164 | 1 |
| ≥4 | 15 | 1 |
Identified Hotspot Regions
Five regions associated with ≥2 viruses:
| Region | Viruses | Original study | External validation |
|---|---|---|---|
| Chr 5: 6.3-8.8 Mb | WMV, CGMMV, CVYV, CMV | Reported | - |
| Chr 6: 6.8-14.7 Mb | PRSV, ZYMV | Reported | - |
| Chr 1: 9.1-10.1 Mb | - | Novel | - |
| Chr 2: 1.3 Mb | PRSV, ZYMV | Novel | CMV, CABYV QTLs |
| Chr 6: 22.8-26.4 Mb | PRSV, ZYMV | Novel | WMV, CABYV QTLs |
Three novel regions not reported in original study: - Two validated by independent studies showing shared resistance mechanisms - Demonstrates enhanced power of joint analysis over individual GWAS
Methodological Insights
Advantages over Existing Methods
Mixture model approaches (IMIX, c-csmGmm, Primo): - Fully parametric: constrain alternative distributions to Gaussian - qch_copula: nonparametric estimation with copula dependencies
Pairwise methods (PLACO, HDMT): - Limited to Q=2 - Multiple pairwise tests lack statistical guarantees for joint inference - Can produce inconsistent results
Filtering methods (adaFilter): - More conservative - Lower power for stringent hypotheses
P-values vs. Posteriors
qch_copula establishes theoretical equivalence between: - Adaptive Benjamini-Hochberg FDR control on derived P-values - Local FDR control on posteriors
Advantages of P-values: - Compatible with any multiple testing procedure - Enable diagnostic tools (QQ-plots, histograms) - Allow Volcano/Manhattan plots - Facilitate method comparison
Copula Modeling Strategy
Single correlation matrix across all components: - Balances model flexibility and computational efficiency - Avoids poor estimation from under-represented components - Captures essential dependency structure
Alternative approaches: - Component-specific matrices (IMIX): computationally prohibitive - No dependencies (qch): severe Type I error inflation
Robustness to Dependencies
Within-series correlation (ξ = 0.3): - Minimal impact on most methods - Actually improved FDR control for qch_copula - Particularly beneficial in challenging scenarios (Q=16)
Further options: - Combine with dependency-aware multiple testing procedures - Apply local score techniques for spatial clustering
Practical Recommendations
Choosing q̃ (Minimum Effect Threshold)
Multiple strategies: 1. Hypothesis-driven: Based on research question (e.g., pleiotropy → q̃=2) 2. Prior evidence: From previous analyses (e.g., q̃=8 from literature) 3. Cost-benefit: Economic considerations (breeding value of multi-trait resistance) 4. Exploratory: Test multiple q̃ values for ranking
Method Selection
Use qch_copula when: - Q > 2 traits/conditions - Correlations between traits exist - Need rigorously defined P-values - Large-scale data (105-106 markers, Q ≤ 20) - Testing various composite hypotheses
Consider alternatives when: - Q = 2 and no dependencies: simpler methods may suffice - Extremely high correlations (ρ > 0.7): expect minor FDR inflation - Need for component-specific direction effects: extensions required
Implementation Details
Available in R package qch on CRAN
Key functions: - Model fitting with copula dependencies - P-value computation for any composite hypothesis - Multiple hypothesis testing without re-estimation
Limitations and Extensions
Current Limitations
- Correlation levels: Slight FDR inflation at ρ ≥ 0.5 for large Q (16+)
- Independence assumption: Items assumed independent within series
- Direction agnostic: Does not account for effect signs
- Post-hoc inference: Testing multiple q̃ values raises multiple comparisons issues
Ongoing/Future Work
Effect direction: - Extension accounting for effect signs available in qch package (independent case) - Copula + direction effects: under development
Within-series dependencies: - Compatible with dependency-aware multiple testing - Integration with local score techniques
Model selection: - Post-hoc inference procedures for q̃ selection