Automating Pharmacogenomic Annotation: Leveraging Schema Constrained GPT-4 for Variant-Drug Data Extraction
Clinical Genetics and Therapeutics
-
Primary Categories:
- Non-Clinical
-
Secondary Categories:
- Non-Clinical
Introduction:
Introduction
Pharmacogenomics aims to optimize drug therapy by tailoring treatments to individual genetic profiles, enhancing drug efficacy, minimizing adverse reactions, and improving patient outcomes. Despite its potential, pharmacogenomics has faced challenges in achieving full integration into clinical settings, primarily due to the vast and expanding number of genetic variants that need to be covered. These variants are dispersed across a vast body of biomedical literature, making manual annotation labor-intensive and time-consuming. Language models like GPT4 act as a possible path to identify and annotate these variants, enabling broader adoption of pharmacogenomics in clinical settings.
Methods:
Methods
We selected 5,000 variant-drug annotations from PharmGKB (Pharmacogenomics Knowledge Base) for our study. To extract relevant information from the biomedical literature, we developed an automated workflow to retrieve full-text articles from PubMedCentral (PMC) using PubMed Identifiers (PMIDs) provided by PharmGKB. Entrez Programming Utilities (E-utilities) were used to link PMIDs to PubMedCentral Identifiers (PMCIDs) and fetch full-text XML (Extensible Markup Language) content.
This process resulted in 1,850 distinct variant annotations from 657 unique articles. We used GPT-4 (Generative Pre-trained Transformer 4) to extract relevant information from these articles in a structured JSON (JavaScript Object Notation) schema format by utilizing the XML content of the articles. The model was prompted to identify key details related to genetic variants, such as variant identifiers (e.g., reference single nucleotide polymorphisms [rsIDs], gene names, protein changes,star alleles), associated genes, clinical significance, related diseases and drugs, and supporting evidence. The output consisted of JSON objects containing a unique variant identifier, relevant metadata, and contextual evidence.
To evaluate the performance of the extracted data, we compared it with a ground truth dataset using a variant matching framework, which processed complex variant notations and assessed whether the predicted and true variants matched exactly, partially, or not at all.
We aligned the extracted data from GPT-4 with the ground truth based on shared PMIDs, treating each ground truth entry as potentially mapping to multiple predictions. Key fields, such as gene names, drugs, phenotype categories, and variant annotations, were compared to determine match rates and assess extraction quality. Given the complexity of many-to-one associations, we conducted an additional set comparison analysis for each PMID. The intersection of PMID sets was calculated to identify overlapping elements, and the match rate was computed as the ratio of the size of the intersection to the size of the set in the ground truth.
Results:
Results
When aligned by Gene and PMID as single samples, the model performed poorly. The drug match rate was 57.4%, and the phenotype category match rate was 66.7%. The exact variant match rate was 0.0%, while the partial variant match rate was 16.7%. The overall exact match rate across all fields was 16.7%, and the partial match rate was 94.5%. Grouped by identical PMID’s, the model showed improvement with a gene match rate of 74.8%, the variant/haplotype match rate of 38.0%, and the drug match rate of 72.6% per PMID.
Conclusion:
Conclusions
Our study shows that GPT-4 as a schema-constrained annotation tool, can identify key components in whole documents. While the system demonstrates success in capturing aggregate information, alignment and precision remains a significant issue. In addition limitations to both verification and alignment might undervalue the performance. Overall, automation offers a promising tool to enhance pharmacogenomics, but substantial modifications are needed to guarantee clinical impact.
Introduction
Pharmacogenomics aims to optimize drug therapy by tailoring treatments to individual genetic profiles, enhancing drug efficacy, minimizing adverse reactions, and improving patient outcomes. Despite its potential, pharmacogenomics has faced challenges in achieving full integration into clinical settings, primarily due to the vast and expanding number of genetic variants that need to be covered. These variants are dispersed across a vast body of biomedical literature, making manual annotation labor-intensive and time-consuming. Language models like GPT4 act as a possible path to identify and annotate these variants, enabling broader adoption of pharmacogenomics in clinical settings.
Methods:
Methods
We selected 5,000 variant-drug annotations from PharmGKB (Pharmacogenomics Knowledge Base) for our study. To extract relevant information from the biomedical literature, we developed an automated workflow to retrieve full-text articles from PubMedCentral (PMC) using PubMed Identifiers (PMIDs) provided by PharmGKB. Entrez Programming Utilities (E-utilities) were used to link PMIDs to PubMedCentral Identifiers (PMCIDs) and fetch full-text XML (Extensible Markup Language) content.
This process resulted in 1,850 distinct variant annotations from 657 unique articles. We used GPT-4 (Generative Pre-trained Transformer 4) to extract relevant information from these articles in a structured JSON (JavaScript Object Notation) schema format by utilizing the XML content of the articles. The model was prompted to identify key details related to genetic variants, such as variant identifiers (e.g., reference single nucleotide polymorphisms [rsIDs], gene names, protein changes,star alleles), associated genes, clinical significance, related diseases and drugs, and supporting evidence. The output consisted of JSON objects containing a unique variant identifier, relevant metadata, and contextual evidence.
To evaluate the performance of the extracted data, we compared it with a ground truth dataset using a variant matching framework, which processed complex variant notations and assessed whether the predicted and true variants matched exactly, partially, or not at all.
We aligned the extracted data from GPT-4 with the ground truth based on shared PMIDs, treating each ground truth entry as potentially mapping to multiple predictions. Key fields, such as gene names, drugs, phenotype categories, and variant annotations, were compared to determine match rates and assess extraction quality. Given the complexity of many-to-one associations, we conducted an additional set comparison analysis for each PMID. The intersection of PMID sets was calculated to identify overlapping elements, and the match rate was computed as the ratio of the size of the intersection to the size of the set in the ground truth.
Results:
Results
When aligned by Gene and PMID as single samples, the model performed poorly. The drug match rate was 57.4%, and the phenotype category match rate was 66.7%. The exact variant match rate was 0.0%, while the partial variant match rate was 16.7%. The overall exact match rate across all fields was 16.7%, and the partial match rate was 94.5%. Grouped by identical PMID’s, the model showed improvement with a gene match rate of 74.8%, the variant/haplotype match rate of 38.0%, and the drug match rate of 72.6% per PMID.
Conclusion:
Conclusions
Our study shows that GPT-4 as a schema-constrained annotation tool, can identify key components in whole documents. While the system demonstrates success in capturing aggregate information, alignment and precision remains a significant issue. In addition limitations to both verification and alignment might undervalue the performance. Overall, automation offers a promising tool to enhance pharmacogenomics, but substantial modifications are needed to guarantee clinical impact.