Large-scale composite hypothesis testing procedure for omics data analyses

GWAS
composite hypothesis testing
copula
mixture models
multiple testing
pleiotropy
  • Novel method (qch_copula) for testing composite hypotheses across multiple traits/omics levels using mixture models with copula functions
  • Controls Type I error effectively while achieving superior detection power compared to 8 state-of-the-art methods across diverse correlation scenarios
  • Memory-efficient implementation enables analysis of 105-106 markers with up to 20 traits, validated through psychiatric disorder pleiotropy and plant virus resistance studies
Published

23 January 2026

PubMed: 40918066
DOI: 10.1093/nargab/lqaf118
Overview generated by: Claude Sonnet 4.5, 25/11/2025

Key Findings

This study introduces qch_copula, a novel composite hypothesis testing (CHT) method for analyzing multiple traits or omics levels simultaneously, addressing key limitations in existing approaches for large-scale genomic studies.

Main Discoveries

  1. Novel P-value derivation: First method to provide rigorously defined P-values directly from mixture model approaches for composite hypothesis testing

  2. Superior performance: qch_copula effectively controls Type I error rates while maintaining higher detection power compared to eight state-of-the-art methods (DACT, HDMT, PLACO, adaFilter, IMIX, c-csmGmm, Primo, qch)

  3. Scalability breakthrough: Memory-efficient EM algorithm reduces storage from O(n × 2^Q) to O(n + 2^Q), enabling analysis of up to 20 traits and 105-106 markers

  4. Dependency modeling: Explicitly accounts for correlations between traits/omics levels through copula functions, improving false positive control

Study Design

Methodological Framework

The method addresses testing composite null hypotheses of the form H₀: “a marker/gene has effects in at most q̃ - 1 conditions” versus H₁: “effects in at least q̃ conditions.”

Key components: - Mixture model with 2^Q components (one per configuration) - Gaussian copula to capture dependencies between conditions - Nonparametric estimation of alternative distributions - Two-step inference: marginal distributions, then proportions and copula parameters

Model Specification

For Q conditions, z-scores (negative probit transforms of P-values) follow:

\[Z_i \sim \sum_{c \in C} w_c \psi_c\]

where each component ψ_c combines: - Univariate marginal distributions F^q_0 (null) and F^q_1 (alternative) - Copula function C_θ describing dependencies

Posterior-based P-value:

\[\text{pval}(z) = \frac{1}{n\hat{W}_0} \sum_{j=1}^n \mathbb{1}_{\{\hat{\tau}_j > \hat{\tau}_i\}} (1 - \hat{\tau}_j)\]

where τ represents the posterior probability of belonging to alternative configurations.

Major Results

Simulation Study Design

Evaluated across 24 settings varying: - Number of conditions: Q = 2, 8, 16 - Correlation levels: ρ = 0, 0.3, 0.5, 0.7 - Scenarios: sparse (≥90% null) vs. dense (50% alternative) - Sample size: n = 10^5 markers

Type I Error Control (Q = 2)

Independent traits (ρ = 0): - Most methods controlled FDR at nominal 5% level - DACT-JC showed substantial inflation (~0.25) - Minor deviations for PLACO and c-csmGmm (~0.08)

Correlated traits (ρ = 0.3): - Only DACT_Efron and qch_copula maintained proper FDR control - qch showed severe inflation (FDR = 0.256 sparse, 0.145 dense) - HDMT, PLACO, IMIX, c-csmGmm all exceeded nominal level

Higher correlations (ρ = 0.5, 0.7): - qch_copula consistently controlled FDR near 0.05 - Other methods (except DACT_Efron) showed increased inflation

Type I Error Control (Q = 8, 16)

Q = 8 with ρ = 0.3: - All three scalable methods (Primo, adaFilter, qch_copula) controlled FDR - Primo and adaFilter were conservative (FDR < 0.02 in most cases) - qch_copula maintained FDR close to nominal level

Q = 16 with ρ = 0.3: - qch_copula: slight inflation in sparse scenario (FDR ≤ 0.08) - adaFilter: comparable performance - Primo: computational failure (>24h runtime or memory exhaustion)

Spatial dependence (ξ = 0.3): - Improved FDR control for qch_copula and Primo - No impact on adaFilter - qch_copula reduced FDR from ~0.11 to <0.065 in challenging scenarios

Detection Power

Q = 2: - DACT_Efron: zero power (too conservative) - qch_copula: 0.03-0.124 depending on scenario

Q = 8, ρ = 0.3:

Testing hypothesis Primo Power adaFilter Power qch_copula Power
≥2 traits (dense) 0.143 0.317 0.609
≥4 traits (dense) 0.126 0.148 0.534
≥8 traits (dense) 0.125 0.033 0.186

Q = 16, ρ = 0.3:

Testing hypothesis adaFilter Power qch_copula Power
≥2 traits (dense) 0.278 0.646
≥4 traits (dense) 0.166 0.678
≥8 traits (sparse) 0.097 0.568

qch_copula showed 6× higher power than adaFilter for detecting associations with ≥8 traits in sparse scenarios.

Computational Efficiency

Memory-Efficient EM Algorithm

Classical approach: Stores full posterior matrix T = (τ_ic) requiring 26 GB for n=10^5, Q=15

qch_copula approach: - Computes posteriors on-the-fly during M-step - Stores only summary statistics S^(t)_i - Reduces memory from 26 GB to 1 MB (same example)

Runtime (Q=16, n=10^5): - Model fitting: 78 minutes - Per hypothesis test: ~1 minute - Platform: Single thread, 3.2 GB RAM

Application I: Psychiatric Disorders

Dataset

  • 14 psychiatric disorders from Psychiatric Genomics Consortium
  • 5,172,884 common SNPs
  • Objective: Identify pleiotropic regions associated with ≥8 disorders

Comparison with PLACO

Original analysis (PLACO): - Aggregated SNPs to 26,024 genes using MAGMA - Performed 91 pairwise analyses (all combinations of 2 disorders) - Identified 38 candidate genes

qch_copula analysis: - Direct SNP-level analysis (no aggregation) - Single joint test across all 14 disorders - Identified 1,608 SNPs in 28 distinct regions

Novel Findings

35/38 PLACO genes confirmed plus 8 new regions:

Region Chr Position (Mb) # SNPs Top SNP P-value
Novel 1 5 103.6-104.0 338 5.16×10^-12
Novel 2 1 73.8-73.9 156 1.01×10^-7
Novel 3 3 52.6-53.1 90 7.27×10^-8

Chromosome 5 region (top finding): - 338 SNPs detected - 25 SNPs associated with 11 disorders - Overlaps with RP11-6N13.1 gene - Previously reported for ADHD, ASD, BIP, MDD, SCZ, TS

Three PLACO-only genes (NEGR1, TMX2, C11orf31): - Had only 3-7 P-values <0.01 out of 14 (insufficient for ≥8 disorders) - Likely false positives from pairwise approach

Application II: Cucumber Virus Resistance

Dataset

  • 289 cucumber lines (elite, landraces, hybrids)
  • 6 viruses: CGMMV, CMV, CVYV, PRSV, WMV, ZYMV
  • 339,804 common SNPs (after QC)
  • Objective: Detect QTLs for multi-virus resistance

Results by Pleiotropy Level

Number of viruses # SNPs detected # Regions
≥2 1,845 5
≥3 164 1
≥4 15 1

Identified Hotspot Regions

Five regions associated with ≥2 viruses:

Region Viruses Original study External validation
Chr 5: 6.3-8.8 Mb WMV, CGMMV, CVYV, CMV Reported -
Chr 6: 6.8-14.7 Mb PRSV, ZYMV Reported -
Chr 1: 9.1-10.1 Mb - Novel -
Chr 2: 1.3 Mb PRSV, ZYMV Novel CMV, CABYV QTLs
Chr 6: 22.8-26.4 Mb PRSV, ZYMV Novel WMV, CABYV QTLs

Three novel regions not reported in original study: - Two validated by independent studies showing shared resistance mechanisms - Demonstrates enhanced power of joint analysis over individual GWAS

Methodological Insights

Advantages over Existing Methods

Mixture model approaches (IMIX, c-csmGmm, Primo): - Fully parametric: constrain alternative distributions to Gaussian - qch_copula: nonparametric estimation with copula dependencies

Pairwise methods (PLACO, HDMT): - Limited to Q=2 - Multiple pairwise tests lack statistical guarantees for joint inference - Can produce inconsistent results

Filtering methods (adaFilter): - More conservative - Lower power for stringent hypotheses

P-values vs. Posteriors

qch_copula establishes theoretical equivalence between: - Adaptive Benjamini-Hochberg FDR control on derived P-values - Local FDR control on posteriors

Advantages of P-values: - Compatible with any multiple testing procedure - Enable diagnostic tools (QQ-plots, histograms) - Allow Volcano/Manhattan plots - Facilitate method comparison

Copula Modeling Strategy

Single correlation matrix across all components: - Balances model flexibility and computational efficiency - Avoids poor estimation from under-represented components - Captures essential dependency structure

Alternative approaches: - Component-specific matrices (IMIX): computationally prohibitive - No dependencies (qch): severe Type I error inflation

Robustness to Dependencies

Within-series correlation (ξ = 0.3): - Minimal impact on most methods - Actually improved FDR control for qch_copula - Particularly beneficial in challenging scenarios (Q=16)

Further options: - Combine with dependency-aware multiple testing procedures - Apply local score techniques for spatial clustering

Practical Recommendations

Choosing q̃ (Minimum Effect Threshold)

Multiple strategies: 1. Hypothesis-driven: Based on research question (e.g., pleiotropy → q̃=2) 2. Prior evidence: From previous analyses (e.g., q̃=8 from literature) 3. Cost-benefit: Economic considerations (breeding value of multi-trait resistance) 4. Exploratory: Test multiple q̃ values for ranking

Method Selection

Use qch_copula when: - Q > 2 traits/conditions - Correlations between traits exist - Need rigorously defined P-values - Large-scale data (105-106 markers, Q ≤ 20) - Testing various composite hypotheses

Consider alternatives when: - Q = 2 and no dependencies: simpler methods may suffice - Extremely high correlations (ρ > 0.7): expect minor FDR inflation - Need for component-specific direction effects: extensions required

Implementation Details

Available in R package qch on CRAN

Key functions: - Model fitting with copula dependencies - P-value computation for any composite hypothesis - Multiple hypothesis testing without re-estimation

Limitations and Extensions

Current Limitations

  1. Correlation levels: Slight FDR inflation at ρ ≥ 0.5 for large Q (16+)
  2. Independence assumption: Items assumed independent within series
  3. Direction agnostic: Does not account for effect signs
  4. Post-hoc inference: Testing multiple q̃ values raises multiple comparisons issues

Ongoing/Future Work

Effect direction: - Extension accounting for effect signs available in qch package (independent case) - Copula + direction effects: under development

Within-series dependencies: - Compatible with dependency-aware multiple testing - Integration with local score techniques

Model selection: - Post-hoc inference procedures for q̃ selection