AI-supported Identification of Disease-Gene Relationship Publications
Laboratory Genetics and Genomics
-
Primary Categories:
- Laboratory Genetics
-
Secondary Categories:
- Laboratory Genetics
Introduction:
Since the initial implementation of genome-wide diagnostic sequencing - including whole exome sequencing (WES) and whole genome sequencing (WGS) - the pace of novel disease-gene relationship (GDR) discovery has been rapid and consistent. Every month, peer-reviewed scientific journal articles are published containing novel GDRs, but there is no systematic mechanism centralizing these findings. GDR databases currently available - including OMIM, ClinGen, theGenCC, PanelApp (UK and Australia), and DECiPHER - rely heavily on human curation of literature. These efforts are predominantly volunteer-based and thus cannot be depended upon to be timely or comprehensive. As laboratories and scientists evaluating genetic test results depend on accurate GDR information, the information gap between published articles and GDR databases creates risk for false negative results and increases the manual workload on the genomics workforce.
Methods:
Here we present a workflow using a combination of PubMed searches with large language model (LLM) generative artificial intelligence (AI) assessment of journal titles and abstracts to detect articles likely describing a GDR for a given gene. Search terms and LLM prompt engineering was informed by systematic review of publications associated with existing OMIM phenotype pages. Sets of 50 positive control genes and 50 negative control genes were selected to enable evaluation of the performance of this tool.
Results:
A total of 38 unique search terms were identified as either high frequency in OMIM Phenotype associated publications or expert feedback. At least one article was identified as likely describing a GDR for 49/50 (98%) positive control genes and 22/50 (44%) negative control genes. There was a significant difference in the average number of positive articles between sets, with 6.34 on average for positives and 0.64 on average for negatives (T-Test P-Value 6*10-11). A total of 32 genes had >=4 articles, all of which came from the positive control set (binomial P-Value 2.3*10-8).
Conclusion:
Our current method is capable of detecting peer-reviewed articles containing GDR with high sensitivity and moderate specificity. This system can be used to screen for potentially novel GDR articles to increase the diagnostic yield of WES/WGS testing without significantly increasing the workload on the genomics workforce. As this is only a first attempt at such a tool, it is reasonable to expect improved performance, particularly in specificity, with modifications. Areas for improvement include PubMed search parameters, LLM prompt engineering, and LLM models.
Since the initial implementation of genome-wide diagnostic sequencing - including whole exome sequencing (WES) and whole genome sequencing (WGS) - the pace of novel disease-gene relationship (GDR) discovery has been rapid and consistent. Every month, peer-reviewed scientific journal articles are published containing novel GDRs, but there is no systematic mechanism centralizing these findings. GDR databases currently available - including OMIM, ClinGen, theGenCC, PanelApp (UK and Australia), and DECiPHER - rely heavily on human curation of literature. These efforts are predominantly volunteer-based and thus cannot be depended upon to be timely or comprehensive. As laboratories and scientists evaluating genetic test results depend on accurate GDR information, the information gap between published articles and GDR databases creates risk for false negative results and increases the manual workload on the genomics workforce.
Methods:
Here we present a workflow using a combination of PubMed searches with large language model (LLM) generative artificial intelligence (AI) assessment of journal titles and abstracts to detect articles likely describing a GDR for a given gene. Search terms and LLM prompt engineering was informed by systematic review of publications associated with existing OMIM phenotype pages. Sets of 50 positive control genes and 50 negative control genes were selected to enable evaluation of the performance of this tool.
Results:
A total of 38 unique search terms were identified as either high frequency in OMIM Phenotype associated publications or expert feedback. At least one article was identified as likely describing a GDR for 49/50 (98%) positive control genes and 22/50 (44%) negative control genes. There was a significant difference in the average number of positive articles between sets, with 6.34 on average for positives and 0.64 on average for negatives (T-Test P-Value 6*10-11). A total of 32 genes had >=4 articles, all of which came from the positive control set (binomial P-Value 2.3*10-8).
Conclusion:
Our current method is capable of detecting peer-reviewed articles containing GDR with high sensitivity and moderate specificity. This system can be used to screen for potentially novel GDR articles to increase the diagnostic yield of WES/WGS testing without significantly increasing the workload on the genomics workforce. As this is only a first attempt at such a tool, it is reasonable to expect improved performance, particularly in specificity, with modifications. Areas for improvement include PubMed search parameters, LLM prompt engineering, and LLM models.