Enhancing Genomic Data Accessibility with Hypothes.is Parsing Tool for Variant Annotations
Clinical Genetics and Therapeutics
-
Primary Categories:
- Clinical Genetics
-
Secondary Categories:
- Clinical Genetics
Introduction:
Efficiently analyzing genomic variant data is critical for advancing our understanding of various genetic diseases and identifying genotype-phenotype correlations. Hypothes.is, a collaborative web-based annotation tool, allows users to store annotations from publicly sourced data, including peer-reviewed literature of genetic and genomic data. The University of North Carolina Biocuration and Coordination Core (UNC BCC), a part of the Clinical Genome Resource (ClinGen), leverages hypothes.is for community curation to create repositories of structured annotations for gene and variant adjudication, referred to as “Baseline Annotation.” However, raw data obtained from hypothes.is is incompatible with Excel-based analyses, limiting its utility for large-scale studies. The challenge lies in the fact that CSV files downloaded from hypothes.is consolidate all relevant data into a single cell, making it difficult to extract the structured and systematic annotations used in ClinGen Baseline Annotation.
Methods:
We developed a parsing tool utilizing list comprehension and delimiter identification. Annotations were extracted from hypothes.is and utilized in iterative testing to convert raw data into a format optimized for downstream analysis and visualization. The tool was initially applied to generate discrete data points from 100 annotations for RAG1 and RAG2, both implicated in Severe Combined Immunodeficiency (SCID). These annotations were collected from multiple annotators in ClinGen’s SCID Variant Curation Expert Panel (VCEP), resulting in slight format discrepancies due to human-generated inconsistencies in spacing, case sensitivity, category titles, etc. Recognizing that annotation protocols may vary slightly across genes and annotators, we are iteratively improving the tool to ensure compatibility with any ClinGen annotation protocol. This process includes optimizing the code to reduce sensitivity to case variations and flexible titles in annotation categories. Requiring annotations to adhere to a designated format enhances compatibility with the tool, which then facilitates the efficient analysis of biomedical literature. This creates a self-reinforcing cycle that promotes more uniform annotation practices and enhances the effectiveness of the analysis.
Results:
The parsing tool is currently being developed into an interface that minimizes user workload and is under evaluation by UNC BCC biocurators. By enhancing compatibility, this tool promotes improved data sharing and collaboration between researchers across institutions. Specifically, by providing a repository of pre-annotated data on genomic knowledge of genes, variants, phenotypes, and more, ClinGen expert panels can utilize the data for large-scale analyses to aid efforts in defining gene-disease relationships and specifying criteria for variant classification (e.g., phenotyping and PP4). The tool supports efforts in rapid categorization of genetic variants and comprehensive analysis of related clinical, demographic, and laboratory data.
The main limitation of this study lies in the tool's current dependence on pre-established ClinGen annotation protocols. While the tool is being refined to accommodate variable annotations and formatting, its greatest strength is datasets in which standardized annotation protocols are being used. Additional development is required for compatibility with annotations that deviate from standardized annotation protocols or that lack well-defined annotation structures. We plan to examine the efficacy of the tool in a broader range of genes and more complex variant categories, such as structural variants or non-coding regions, which still need further validation.
Conclusion:
With continued refinement, this tool has the potential to significantly increase accessibility and usability of annotated data, automating a previously manual step essential for analysis of any hypothes.is data, and enables researchers to more easily integrate these annotations into their workflows. This automation not only supports data accuracy and standardization, but also makes Baseline Annotation infrastructure more approachable and user-friendly. By facilitating annotation accessibility to a wider range of researchers, we hope the tool fosters cross-institutional collaboration to advance genomic and precision medicine.
Efficiently analyzing genomic variant data is critical for advancing our understanding of various genetic diseases and identifying genotype-phenotype correlations. Hypothes.is, a collaborative web-based annotation tool, allows users to store annotations from publicly sourced data, including peer-reviewed literature of genetic and genomic data. The University of North Carolina Biocuration and Coordination Core (UNC BCC), a part of the Clinical Genome Resource (ClinGen), leverages hypothes.is for community curation to create repositories of structured annotations for gene and variant adjudication, referred to as “Baseline Annotation.” However, raw data obtained from hypothes.is is incompatible with Excel-based analyses, limiting its utility for large-scale studies. The challenge lies in the fact that CSV files downloaded from hypothes.is consolidate all relevant data into a single cell, making it difficult to extract the structured and systematic annotations used in ClinGen Baseline Annotation.
Methods:
We developed a parsing tool utilizing list comprehension and delimiter identification. Annotations were extracted from hypothes.is and utilized in iterative testing to convert raw data into a format optimized for downstream analysis and visualization. The tool was initially applied to generate discrete data points from 100 annotations for RAG1 and RAG2, both implicated in Severe Combined Immunodeficiency (SCID). These annotations were collected from multiple annotators in ClinGen’s SCID Variant Curation Expert Panel (VCEP), resulting in slight format discrepancies due to human-generated inconsistencies in spacing, case sensitivity, category titles, etc. Recognizing that annotation protocols may vary slightly across genes and annotators, we are iteratively improving the tool to ensure compatibility with any ClinGen annotation protocol. This process includes optimizing the code to reduce sensitivity to case variations and flexible titles in annotation categories. Requiring annotations to adhere to a designated format enhances compatibility with the tool, which then facilitates the efficient analysis of biomedical literature. This creates a self-reinforcing cycle that promotes more uniform annotation practices and enhances the effectiveness of the analysis.
Results:
The parsing tool is currently being developed into an interface that minimizes user workload and is under evaluation by UNC BCC biocurators. By enhancing compatibility, this tool promotes improved data sharing and collaboration between researchers across institutions. Specifically, by providing a repository of pre-annotated data on genomic knowledge of genes, variants, phenotypes, and more, ClinGen expert panels can utilize the data for large-scale analyses to aid efforts in defining gene-disease relationships and specifying criteria for variant classification (e.g., phenotyping and PP4). The tool supports efforts in rapid categorization of genetic variants and comprehensive analysis of related clinical, demographic, and laboratory data.
The main limitation of this study lies in the tool's current dependence on pre-established ClinGen annotation protocols. While the tool is being refined to accommodate variable annotations and formatting, its greatest strength is datasets in which standardized annotation protocols are being used. Additional development is required for compatibility with annotations that deviate from standardized annotation protocols or that lack well-defined annotation structures. We plan to examine the efficacy of the tool in a broader range of genes and more complex variant categories, such as structural variants or non-coding regions, which still need further validation.
Conclusion:
With continued refinement, this tool has the potential to significantly increase accessibility and usability of annotated data, automating a previously manual step essential for analysis of any hypothes.is data, and enables researchers to more easily integrate these annotations into their workflows. This automation not only supports data accuracy and standardization, but also makes Baseline Annotation infrastructure more approachable and user-friendly. By facilitating annotation accessibility to a wider range of researchers, we hope the tool fosters cross-institutional collaboration to advance genomic and precision medicine.