Skip to main content

Conference Program

Subpage Hero

Loading

Natural Language Processing for Deep Phenotyping of Patients Receiving Genomic Testing Enables Effective Gene Prioritization in a Clinical Diagnostics Pipeline 

Laboratory Genetics and Genomics
  • Primary Categories:
    • Laboratory Genetics
  • Secondary Categories:
    • Laboratory Genetics
Introduction:
Phenotype-driven genomic analysis for Mendelian disorders focuses on the genes that are most likely to harbor disease-causing variants based on the patient’s presenting features. Most diagnostic laboratories rely on human curation of clinical notes to obtain the necessary human phenotype ontology (HPO) terms that are required for carrying out gene prioritization workflows. To improve efficiency and reduce human subjectivity in the phenotyping process, our clinical diagnostic laboratory sought to validate the outcomes of a natural language processing (NLP)-based tool for comprehensive feature extraction via conversion into HPO terms for patients receiving genomic testing. Driven by the assumption that most Mendelian diseases are caused by individual genetic etiologies, we also created a cross-sectional database tool that will identify a list of genes associated with a predetermined number of HPO terms as defined by a sliding scale, according to desired user specificity.  

Methods:
We identified two groups of 50 patients who were evaluated by our clinical Genetics providers and received ES that provided a diagnostic finding associated with a single gene. For each patient, two clinic notes composed before the diagnosis was known, one from a Genetics and non-Genetics specialty provider, were collated. Two laboratory genetic counselors (Human Reviewers [HRs]) extracted HPO terms corresponding to phenotypic descriptions in the clinical notes, while the notes were also processed by the Linguamatics I2E NLP algorithm for HPO terms, with subsequent refinement performed to optimize the accuracy and specificity of the extraction. We developed methodologies to measure the concordance and dissimilarities of the extracted HPO terms using Jaccard similarity coefficients and distance matrices. We examined the likelihood that HPO terms generated by each entity are capable of isolating gene lists that contain the reported causative gene using our tool, HPOverlap. Using the publicly available Phen2Gene tool, we also examined the ranking of the causative gene.   

Results:
On average, 27.8 and 24.6 HPO terms were extracted per case by our HRs respectively, as compared to 26.7 terms by our NLP algorithm, indicating no statistical differences in the number of terms generated. The Jaccard similarity coefficient and distance matrix profiles generated for the extracted HPO terms indicated that our two HRs are more similar and closer ontologically as compared to NLP individually, but the individual HR profile differences from NLP were not notably significant. Using our HPOverlap tool, NLP generated a 72% likelihood of retaining the causative gene in a gene list size under 600, as compared to 66-72% for our two HRs. These results are consistent with the trend in gene rankings generated by Phen2Gene when the same HPO terms extracted by HRs vs. NLP were supplied accordingly. Our customized parameters for the NLP algorithm tackled recurrent obstacles such as the usage of ambiguous human language in clinical notes, especially negative terminologies that should negate the inclusion of phenotypes. We also provisioned for the exclusion of descriptions of family members’ phenotypes and discussions of differential diagnoses and their associated features, as they confound the isolation of phenotypes that are specific to the patient.  

Conclusion:
We have developed a NLP pipeline in a clinical diagnostic setting that has comparable performance to HRs, that we have validated for clinical use when paired with our HPOverlap tool. This enables consistent phenotypic extraction and refinement of HPO terms for patients undergoing ES/GS, regardless of laboratories’ access to skilled genetics-trained personnel. Although phenotype-driven analysis is key to effective genomic analyses, genotype-driven analysis is expected to be complementary for the cases where the causative gene is novel. By computationally automating these early crucial steps in genomic analyses, we allow laboratory personnel to reinvest the time savings for downstream variant interpretation and analysis for complex cases. 

Agenda

Sponsors