Smart Categories: LLM-Based Automatic Tagging of Categorical information in Genomic Articles Outperforms Manual Curators Yielding Improved Curation Efficiency
Clinical Genetics and Therapeutics
-
Primary Categories:
- Clinical Genetics
-
Secondary Categories:
- Clinical Genetics
Introduction:
The accurate and expedient assessment of clinical and functional evidence in variant curation is critical for accurate, high-throughput pipelines. Finding relevant evidence to support variant classification is the most time-intensive step in variant curation workflows. The goal of this project is to increase the efficiency of curation by enabling curators to focus on the most important articles first and provide cutoffs whereby articles are determined to be not relevant and thus not needing review. As an initial proof of concept, Genomenon has adopted an AI model that is able to scan an abstract and categorize it as an article that has Clinical, Functional, or Review content. These “Smart Categories” are a new way to characterize the content of articles based on analysis deeper than just keyword text matching or standard NLP.
Methods:
To improve the assignment of PMIDs into curation-relevant categories, we are utilizing a modern AI approach. Attainment of this goal requires the creation of a large pool of articles classified as their respective categories by a human. For this project, curators read the title and abstract from a pool of articles and classify it into one or more pre-selected categories. This training data is then used to fine-tune a large language model (LLM) that has been pre-trained to evaluate text from newly published genomics articles that are automatically ingested into our pipeline. Expert curators read and have labeled 1300 PMIDs into our current 10 categories: Clinical, Functional, Review, Germline, Somatic, Human, Non-Human, Case Report, Association Study, and Non-English. These categories were chosen as the primary, most relevant categories needed to speed curation workflows. This information was then presented to curators as part of a variant curation pipeline.
For score calculation, the title and abstract of each article are passed through our fine-tuned LLM. The output of the model is a probabilistic value (also referred to as a score) from 0 to 0.999 for each of the categories. An article can have multiple category assignments (e.g., a classification of Clinical and Functional, or a classification of Functional, Somatic, and Non-Human). Newly published genomics articles are ingested automatically into the pipeline and the title and abstract are run through the trained LLM to generate the categories and probabilistic scores. An article is considered positive for a specific category if that category has a value of 0.500 and above.
Results:
First, we empirically determined the score cutoff for category application by adjusting the threshold and evaluating performance metrics. This yielded a cutoff score for all categories of 0.500 and above.
With this value, the deployed model metrics across all 10 categories:
Key individual category metrics are:
These scores are then integrated into our custom genomic intelligence curation infrastructure and are used to prioritize articles that have functional and clinical evidence within them, while deprioritizing review articles. This has yielded a curator efficiency increase of ~15%. Additionally, in a preliminary test we determined that the new model resulted in a 22.8% increase in correct assignment of categories compared to a human curator given only the title and abstract for an article.
Conclusion:
We trained and deployed an LLM that outperformed both human curation and our previous Convolutional Neural Network ML-based model. We found that using a smaller transformer model fine-tuned on genomics articles was more cost-effective than utilizing a commercial generative pre-trained transformer model.
The accurate and expedient assessment of clinical and functional evidence in variant curation is critical for accurate, high-throughput pipelines. Finding relevant evidence to support variant classification is the most time-intensive step in variant curation workflows. The goal of this project is to increase the efficiency of curation by enabling curators to focus on the most important articles first and provide cutoffs whereby articles are determined to be not relevant and thus not needing review. As an initial proof of concept, Genomenon has adopted an AI model that is able to scan an abstract and categorize it as an article that has Clinical, Functional, or Review content. These “Smart Categories” are a new way to characterize the content of articles based on analysis deeper than just keyword text matching or standard NLP.
Methods:
To improve the assignment of PMIDs into curation-relevant categories, we are utilizing a modern AI approach. Attainment of this goal requires the creation of a large pool of articles classified as their respective categories by a human. For this project, curators read the title and abstract from a pool of articles and classify it into one or more pre-selected categories. This training data is then used to fine-tune a large language model (LLM) that has been pre-trained to evaluate text from newly published genomics articles that are automatically ingested into our pipeline. Expert curators read and have labeled 1300 PMIDs into our current 10 categories: Clinical, Functional, Review, Germline, Somatic, Human, Non-Human, Case Report, Association Study, and Non-English. These categories were chosen as the primary, most relevant categories needed to speed curation workflows. This information was then presented to curators as part of a variant curation pipeline.
For score calculation, the title and abstract of each article are passed through our fine-tuned LLM. The output of the model is a probabilistic value (also referred to as a score) from 0 to 0.999 for each of the categories. An article can have multiple category assignments (e.g., a classification of Clinical and Functional, or a classification of Functional, Somatic, and Non-Human). Newly published genomics articles are ingested automatically into the pipeline and the title and abstract are run through the trained LLM to generate the categories and probabilistic scores. An article is considered positive for a specific category if that category has a value of 0.500 and above.
Results:
First, we empirically determined the score cutoff for category application by adjusting the threshold and evaluating performance metrics. This yielded a cutoff score for all categories of 0.500 and above.
With this value, the deployed model metrics across all 10 categories:
- Accuracy - 88.4%
- Precision - 84.5%
- Recall - 84.4%
- F1 Score - 84.4%
Key individual category metrics are:
- Clinical – 84.9% precision, 81.7% accuracy;
- Functional – 87.7% precision, 89.0% accuracy;
- Review – 73.8% precision, 94.6% accuracy.
These scores are then integrated into our custom genomic intelligence curation infrastructure and are used to prioritize articles that have functional and clinical evidence within them, while deprioritizing review articles. This has yielded a curator efficiency increase of ~15%. Additionally, in a preliminary test we determined that the new model resulted in a 22.8% increase in correct assignment of categories compared to a human curator given only the title and abstract for an article.
Conclusion:
We trained and deployed an LLM that outperformed both human curation and our previous Convolutional Neural Network ML-based model. We found that using a smaller transformer model fine-tuned on genomics articles was more cost-effective than utilizing a commercial generative pre-trained transformer model.