Skip to main content

Conference Program

Subpage Hero

Loading

Enhancing the Validity of Whole-Genome Association Studies: A Privacy-Preserving HWE Quality Control Method for Large-Scale Biobank Data

Laboratory Genetics and Genomics
  • Primary Categories:
    • Population Genetics
  • Secondary Categories:
    • Population Genetics
Introduction:
Large-scale association studies using multi-ethnic whole-genome sequencing (WGS) data are fundamental in the identification of genetic variants linked to complex diseases, enabling personalized medicine and risk prediction in clinical applications. Ensuring the validity and clinical utility of these findings hinges on rigorous quality control (QC) procedures. Hardy-Weinberg equilibrium (HWE) serves as a critical QC metric for identifying potential genotyping errors, inbreeding, and more. Deviations from HWE can lead to spurious associations that can directly impact clinical interpretations and decision-making. Accurate HWE estimation becomes increasingly complex when dealing with large-scale whole-genome studies that have genetically related samples and population structure, including the Center for Common Disease Genomics (CCDG) and the Trans-Omics for Precision Medicine Program (TOPMed). Currently, there are no existing HWE tests that account for both genetic relatedness and population structure across multiple biobanks. In addition, data across biobanks often cannot be combined directly due to data sharing and privacy constraints. To address these challenges, we propose a novel federated learning method to assess HWE across multiple biobanks while preserving data privacy.

Methods:
We used data from two major biobanks: CCDG, which contains WGS data from diverse populations aimed at identifying genetic variants linked to common diseases, and TOPMed, which integrates WGS with clinical data for studying heart, lung, blood, and sleep disorders. To account for population structure and genetic relatedness in our HWE test, we used a generalized estimating equation framework that incorporates principal components (based on minor allele frequency > 0.01) as covariates and a family-specific genotype correlation matrix, respectively. We extended this framework to the federated learning setting by leveraging the surrogate efficient score function, which preserves the privacy of individual-level data by requiring only summary-level information across biobanks. In addition, we accounted for biobank-specific heterogeneity using the density ratio tilting method.

Results:
Our analysis included approximately 140k and 190k samples from CCDG and TOPMed, respectively. Both biobank datasets were highly diverse with population structure, containing multi-ancestral/ethnic samples from African, European, Asian, and Hispanic populations. In addition, they contained genetically related samples, e.g., CCDG has 19,555 samples with first-degree relatedness, 578 second degree, and 1,098 third degree. Based on our comprehensive simulation studies that generated genotypes with population structure and genetic relatedness based on CCDG and TOPMed, we showed that our method controls for type-I error appropriately, ensuring that clinically relevant variants were preserved. In addition to type-I error, we ensured that our method has enough statistical power to detect variants that deviate from HWE. Compared to a traditional meta-analysis, our approach consistently demonstrated superior performance at identifying variants that deviate from HWE across all p-value thresholds.

Conclusion:
Our HWE testing method appropriately adjusts for genetic relatedness and population structure within each biobank while accounting for data heterogeneity across biobanks in a federated setting. As large-scale biobanks are quickly becoming a cornerstone of genomic analysis, our privacy-preserving approach has the potential to improve the validity of whole-genome association analyses for diverse patient populations by enhancing the accuracy and robustness of HWE testing.

Agenda

Sponsors