MedGen – an integrated and comprehensive resource for medical genetics concepts and concept-associated information
Clinical Genetics and Therapeutics
-
Primary Categories:
- Health services and Implementation
-
Secondary Categories:
- Health services and Implementation
Introduction:
Medical genetics phenotype terms are available in various vocabularies and ontologies each created to serve a specific medical specialty or need in the community. Since each data source may have its own subset of data, there are differing concept overlaps between them. Concept relationships may also be available including disease hierarchies, disease to clinical features, and associated genes. To have a comprehensive resource for medical genetics, data from all these different sources need to be integrated, equivalent concepts mapped under a common record with a unique identifier and any gaps need to be filled. In addition, connections between related concepts need to be processed to generate a more complete picture that facilitates biomedical research and understanding. It is a challenge for many researchers and clinicians to integrate such diverse and expanding volumes of data from multiple sources. MedGen fulfills this need by integrating concepts and their interrelationships from a variety of sources, adding new concepts if needed, making it a comprehensive resource. It also aids further discoverability by linking concepts to relevant resources within and outside NCBI.
Methods:
Most data in MedGen is processed via automated term updates and alignment across data sources with occasional manual curation needed. When new versions of source data become available, they are downloaded and processed to update existing terms and add new terms into MedGen while matching identical terms from different sources. Depending on the data source, mapping is done between identifiers or concept names or both. The primary data sources for MedGen include ontologies like HPO and Mondo, vocabularies including OrphaNet, medical genetics concepts from the UMLS Metathesaurus, OMIM and PharmGKB, the GARD information center , and NCBI resources such as GeneReviews. These sources provide both concept information as well as interrelationships between concepts and genes. Finally, manual curation involves a standard decision tree to evaluate the source term and potential matches in MedGen. We also collaborate to resolve concept discrepancies with our data sources or to add new terms.
Results:
Data is available through a web interface and is also downloadable. The web interface supports search by clinical phenotypes, genes, source identifiers and other attributes. The information available for each concept includes a unique identifier either from UMLS or MedGen, a primary concept name with exact synonyms, mode of inheritance, genes associated with the condition, equivalent source identifiers, disease description, term hierarchy, and clinical features. MedGen as of Oct 2024, has a total of 228,698 concepts with concept interrelationships from seven primary sources. These concepts have stable identifiers that can be used by other resources to link to MedGen. In addition, MedGen provides linkouts to NCBI resources and external resources relevant to basic and applied research, clinical observations, molecular and biochemical testing, and adaptive best practices for clinical care. For users who would like the entire dataset or larger subsets of MedGen, FTP reports and API interfaces are available.
Conclusion:
To understand a disease, a biomedical researcher or clinician might need to know all the clinical features, disease descriptions, drug responses, genes, and diseases related by common genes or other attributes as well as links to other relevant information. All this information, when available in a single resource, helps expose the relationships between seemingly discrete entities and creates a broader and deeper understanding to generate hypotheses for further investigation or treatment. MedGen facilitates this process by integrating various source vocabularies and ontologies, gene, and disease concept interrelationships in one place and enabling easy navigation between these entities. This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health.
Medical genetics phenotype terms are available in various vocabularies and ontologies each created to serve a specific medical specialty or need in the community. Since each data source may have its own subset of data, there are differing concept overlaps between them. Concept relationships may also be available including disease hierarchies, disease to clinical features, and associated genes. To have a comprehensive resource for medical genetics, data from all these different sources need to be integrated, equivalent concepts mapped under a common record with a unique identifier and any gaps need to be filled. In addition, connections between related concepts need to be processed to generate a more complete picture that facilitates biomedical research and understanding. It is a challenge for many researchers and clinicians to integrate such diverse and expanding volumes of data from multiple sources. MedGen fulfills this need by integrating concepts and their interrelationships from a variety of sources, adding new concepts if needed, making it a comprehensive resource. It also aids further discoverability by linking concepts to relevant resources within and outside NCBI.
Methods:
Most data in MedGen is processed via automated term updates and alignment across data sources with occasional manual curation needed. When new versions of source data become available, they are downloaded and processed to update existing terms and add new terms into MedGen while matching identical terms from different sources. Depending on the data source, mapping is done between identifiers or concept names or both. The primary data sources for MedGen include ontologies like HPO and Mondo, vocabularies including OrphaNet, medical genetics concepts from the UMLS Metathesaurus, OMIM and PharmGKB, the GARD information center , and NCBI resources such as GeneReviews. These sources provide both concept information as well as interrelationships between concepts and genes. Finally, manual curation involves a standard decision tree to evaluate the source term and potential matches in MedGen. We also collaborate to resolve concept discrepancies with our data sources or to add new terms.
Results:
Data is available through a web interface and is also downloadable. The web interface supports search by clinical phenotypes, genes, source identifiers and other attributes. The information available for each concept includes a unique identifier either from UMLS or MedGen, a primary concept name with exact synonyms, mode of inheritance, genes associated with the condition, equivalent source identifiers, disease description, term hierarchy, and clinical features. MedGen as of Oct 2024, has a total of 228,698 concepts with concept interrelationships from seven primary sources. These concepts have stable identifiers that can be used by other resources to link to MedGen. In addition, MedGen provides linkouts to NCBI resources and external resources relevant to basic and applied research, clinical observations, molecular and biochemical testing, and adaptive best practices for clinical care. For users who would like the entire dataset or larger subsets of MedGen, FTP reports and API interfaces are available.
Conclusion:
To understand a disease, a biomedical researcher or clinician might need to know all the clinical features, disease descriptions, drug responses, genes, and diseases related by common genes or other attributes as well as links to other relevant information. All this information, when available in a single resource, helps expose the relationships between seemingly discrete entities and creates a broader and deeper understanding to generate hypotheses for further investigation or treatment. MedGen facilitates this process by integrating various source vocabularies and ontologies, gene, and disease concept interrelationships in one place and enabling easy navigation between these entities. This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health.