Facilitating Machine Learning and Artificial Intelligence in Genetic Databases: An Open-Source Tool for Data Integration and Summarization

Laboratory Genetics and Genomics

Primary Categories:
- Genomic Medicine
Secondary Categories:
- Genomic Medicine

Introduction:
Genetic databases, such as ClinVar, ClinGen and the Gene Curation Coalition (GenCC), are crucial to the understanding and interpretation of disease-causing variation. Contained within these repositories is valuable information about the pathogenicity of genetic variants, the confidence of gene-disease relationships, and the actionability of genomic findings. However, the data in these repositories can be time consuming to navigate for genetics professionals looking for information about specific genes or variants. In addition, they are not currently formatted for use in machine learning (ML) and artificial intelligence (AI) applications due to the inaccessibility of shared data across databases and the use of categorical text fields instead of numerical values. This project created publicly available tools to integrate these databases and make them more accessible to clinicians and researchers using ML/AI applications, including easing the burden of summarizing data from multiple repositories.

Methods:
An open-source software tool was developed to integrate data from ClinVar, ClinGen and GenCC by connecting relevant data fields between the repositories. In order to make this data more accessible for ML/AI purposes, our team further curated the source material by normalizing the data, extracting relevant information about variants and developing prompts to summarize the data through AI methods. To normalize the data for ML applications, fields from ClinVar, ClinGen and GenCC were reviewed, and methods were developed to convert commonly used genetic terms which represent a continuum (for example: Benign to Pathogenic variant classification) into numerical values. We then developed template language to translate the structured fields into text appropriate to be used by AI programs. These templates were then used as an input for multiple large language models (LLMs), and the outputs were reviewed for accuracy, readability, conciseness, clinical relevance and bias. An iterative approach using input from genetic professionals was used to engineer LLM prompts and determine the most effective method for using AI to generate variant summaries.

Results:
A novel framework for the numerical interpretation of a variant’s clinical significance was developed, allowing variant classifications to be compared not only by confidence and pathogenicity (ex. Benign to Pathogenic) but also by their clinical impact or penetrance (ex. Risk allele vs Mendelian disease risk). In addition, numerical values were assigned for gene-disease validity, haploinsufficiency and gene-disease-intervention actionability categories. These numerical assignments were integrated into the developed software tool, which is now available as a resource for genetics researchers to further their own research applications at the following repository: https://github.com/mgbpm/clingen-ai-tools. We subsequently devised template text that converts the structured text fields from these databases into a readable format for input into LLMs. Ten diverse example variant summaries were developed as input for the LLM for one-shot and three-shot prompts, which use one or three examples, respectively, to guide the outputs. We further engineered LLM prompts to provide accurate and succinct variant summaries that improve the readability of these templates for genetics providers. Subject-expert review of the AI outputs noted that one-shot and three-shot prompts typically yielded the most accurate and readable variant summaries over zero-shot prompts.

Conclusion:
This publicly available tool and the numerical frameworks developed here allow ML/AI researchers to more accurately process the data within these repositories for their own research and applications. Clinicians and clinical researchers interested in learning more about variants of interest can use the engineered prompt sequence from the tool to quickly and efficiently gather summarized information from ClinVar, ClinGen and GenCC.

Conference Program

Facilitating Machine Learning and Artificial Intelligence in Genetic Databases: An Open-Source Tool for Data Integration and Summarization

Agenda

Sponsors