Machine Learning-Guided GWAS in the Taiwan Biobank Reveals Novel Height-Associated Genes with Possible Epistatic Interactions
Laboratory Genetics and Genomics
-
Primary Categories:
- Population Genetics
-
Secondary Categories:
- Population Genetics
Introduction:
While significant progress has been made in understanding genetic determinants of height in large cohorts, challenges persist for minor ethnic populations due to small sample sizes and underrepresentation. Traditional genome-wide association studies (GWAS) often lack statistical power in these groups, limiting the detection of genetic signals and interactions. To address these limitations, we developed a comprehensive workflow leveraging machine learning (ML) techniques tailored for complex trait analysis. This approach enhances the discovery of height-associated genes in minor populations and facilitates the exploration of gene-gene interactions.
Methods:
Using a cohort of 120,068 individuals from the Taiwan Biobank and genotyping data filtered to 510,276 SNPs, we established baseline models for males and females through Lasso regression to control for demographic confounders. An Add-One SNP approach was used to evaluate the unique contribution of each SNP by assessing reductions in mean squared error (MSE). Top SNPs were prioritized based on z-score transformation, with pathway and gene enrichment analyses conducted to map significant SNPs to relevant genes. We further performed gene enrichment and pathway analysis to annotate significant SNPs and examined potential epistatic effects.
Results:
The baseline model achieved MSE values of 32.3582 for males and 26.0198 for females. Our ML-based model successfully replicated previous height-related SNP findings within the Taiwan Biobank, validating the efficacy of our approach. Among the top 3000 SNPs, we mapped 2296 SNPs in males and 2398 in females to genes, with 71.4% (1640 SNPs) in males and 63.7% (1528 SNPs) in females linked to height traits reported in the GWAS catalog. After applying CADD and RegulomeDB filtering, most identified significant genes had known height associations. For males, a total of four key height-contributing genes were identified across various pathways. For females, 21 key genes were identified, with the highest number of genes involved in signal transduction. Notably, ERBB3 (p = 5.14E-21) and SUOX (p = 3.03E-9) emerged as novel, height-linked genes specific to females, with ERBB3 possibly serving as a central hub for pathway crosstalk. Additionally, two key SNPs (rs1346786 and rs17278665) in EFEMP1 gene have been extensively reported in the GWAS catalog without sex-specific differences, exhibited a discordant pattern between sexes. The ML model effectively revealed gene-gene interactions, identifying SNPs undetectable by traditional GWAS. While overall genetic influences on height were similar between genders, differences in SNP rankings suggest distinct genetic priorities. Importantly, the ML model demonstrated its capacity to detect gene-gene interactions, revealing SNPs with interaction effects that traditional GWAS methods would likely overlook. Although no significant differences were found in the overall genetic influence on height between genders, SNP rankings differed by sex, suggesting distinct genetic prioritizations.
Conclusion:
This study presents a novel ML-driven GWAS framework that offers a powerful alternative for analyzing genetic traits in minor populations, overcoming the typical sample size barriers and amplifying detection of epistatic effects. By replicating known findings and uncovering new loci associated with height in a smaller dataset, this approach highlights ML’s potential to address longstanding challenges in genetic research across diverse populations, enhancing both the precision and depth of GWAS studies in underrepresented groups.
While significant progress has been made in understanding genetic determinants of height in large cohorts, challenges persist for minor ethnic populations due to small sample sizes and underrepresentation. Traditional genome-wide association studies (GWAS) often lack statistical power in these groups, limiting the detection of genetic signals and interactions. To address these limitations, we developed a comprehensive workflow leveraging machine learning (ML) techniques tailored for complex trait analysis. This approach enhances the discovery of height-associated genes in minor populations and facilitates the exploration of gene-gene interactions.
Methods:
Using a cohort of 120,068 individuals from the Taiwan Biobank and genotyping data filtered to 510,276 SNPs, we established baseline models for males and females through Lasso regression to control for demographic confounders. An Add-One SNP approach was used to evaluate the unique contribution of each SNP by assessing reductions in mean squared error (MSE). Top SNPs were prioritized based on z-score transformation, with pathway and gene enrichment analyses conducted to map significant SNPs to relevant genes. We further performed gene enrichment and pathway analysis to annotate significant SNPs and examined potential epistatic effects.
Results:
The baseline model achieved MSE values of 32.3582 for males and 26.0198 for females. Our ML-based model successfully replicated previous height-related SNP findings within the Taiwan Biobank, validating the efficacy of our approach. Among the top 3000 SNPs, we mapped 2296 SNPs in males and 2398 in females to genes, with 71.4% (1640 SNPs) in males and 63.7% (1528 SNPs) in females linked to height traits reported in the GWAS catalog. After applying CADD and RegulomeDB filtering, most identified significant genes had known height associations. For males, a total of four key height-contributing genes were identified across various pathways. For females, 21 key genes were identified, with the highest number of genes involved in signal transduction. Notably, ERBB3 (p = 5.14E-21) and SUOX (p = 3.03E-9) emerged as novel, height-linked genes specific to females, with ERBB3 possibly serving as a central hub for pathway crosstalk. Additionally, two key SNPs (rs1346786 and rs17278665) in EFEMP1 gene have been extensively reported in the GWAS catalog without sex-specific differences, exhibited a discordant pattern between sexes. The ML model effectively revealed gene-gene interactions, identifying SNPs undetectable by traditional GWAS. While overall genetic influences on height were similar between genders, differences in SNP rankings suggest distinct genetic priorities. Importantly, the ML model demonstrated its capacity to detect gene-gene interactions, revealing SNPs with interaction effects that traditional GWAS methods would likely overlook. Although no significant differences were found in the overall genetic influence on height between genders, SNP rankings differed by sex, suggesting distinct genetic prioritizations.
Conclusion:
This study presents a novel ML-driven GWAS framework that offers a powerful alternative for analyzing genetic traits in minor populations, overcoming the typical sample size barriers and amplifying detection of epistatic effects. By replicating known findings and uncovering new loci associated with height in a smaller dataset, this approach highlights ML’s potential to address longstanding challenges in genetic research across diverse populations, enhancing both the precision and depth of GWAS studies in underrepresented groups.