Applying data science methodologies with artificial intelligence variant reinterpretation to map and estimate genetic disorder prevalence utilizing clinical data
Clinical Genetics and Therapeutics
-
Primary Categories:
- Public Health Genetics
-
Secondary Categories:
- Public Health Genetics
Introduction:
Clinical genetic data often reside in the Electronic Health Record (EHR) in an unstructured format and thus are not readily available. We solved this problem by obtaining the data for a variant and linked it to a patient's genetic and demographic data while sorting genes by the inheritance pattern of the disorders associated with them. This allowed us to estimate the genetic disorder prevalence in our catchment area and map such disorders to our local geography utilizing ZIP codes.
Methods:
Germline variant data were derived from clinical genetic testing sent from Valley Children’s Hospital (VCH) to reference laboratories. Data in the form of discrete variants were derived from the VCH EHR. Data derived from the EHR was then validated against variant data provided from reference laboratories in the forms of spreadsheets using Microsoft SQL to produce the final variant database. Automated re-interpretation of all variants was attempted utilizing the Franklin© AI. Variant data were matched to EHR-derived demographic data by name, age, date of birth (DOB), ZIP code and medical record number (MRN) utilizing structured Microsoft Structured Query Language SQL. ZIP codes were converted to ZIP code tabulation areas (ZCTA) to correspond with the geographic distribution of populations per the 2010 United States Census Bureau. Relevant demographic and genomic data were merged to create a database, which was then utilized for exploratory data analysis (EDA) in Microsoft PowerBI©. A choropleth map of genomic variation within our data set was generated by mapping variants to ZCTAs generating a choropleth map of variants to our geographic catchment area in California. The most common genetic disorders in our catchment area and per individual ZCTAs were determined utilizing genetic and demographic data.
Results:
A total of 3044 individuals who had variants were identified and, of these, 98% were linked to geographic data from the EHR. There was a total of 3065 variants of all pathogenicity interpretations: 2474 (80.7%) of variants were monogenic, 584 (19.1%) were structural variants (SV), four (0.1%) were repeat expansion variants and three (0.1%) were methylation variants. Franklin© was able to parse 2688 (88%) of all monogenic and structural variants (SV) with a change in interpretation compared to the original report for 658 (24%). Two hundred and forty-three (37%) of all reinterpreted variants were upgraded (e.g., benign > pathogenic) and 415 (63%) were downgraded (e.g., pathogenic > benign). Clinically significant Franklin© pathogenicity interpretation changes were defined as a change to or from pathogenic or likely pathogenic (P/LP) to any other interpretation. There were 156 clinically significant interpretation changes in single genes and none for SVs. A table of P/LP variants filterable by inheritance pattern, pathogenicity, sex, and zygosity was generated in Microsoft Power BI© and demonstrated 739 genetic disorders in our database. The prevalence of genetic disorders per inheritance pattern, sex, and zygosity was determined. The most salient finding was that the most common autosomal recessive (AR) disorder was PEX6-associated Zellweger Spectrum Disorder (ZSD). Seven homozygotes for a P/LP variant in PEX6 were identified in which all but one were homozygous for the same pathogenic variant (NM_000287.4:c.1409G>C). Mapping of the PEX6 variant geographically represented these six affected individuals in the cities of Bakersfield, Paso Robles and Santa Maria.
Conclusion:
We used a combination of data science and Mendelian genetic principles to facilitate EDA of clinical genetic data utilizing the unique capabilities of AI and choropleths to gain novel insights into our patient population. These methodologies allowed us to estimate the prevalence of genetic disorders within our catchment area, localize geographies enriched for variants, identify patients who were eligible for emerging therapies and systemically reinterpret variants using AI.
Clinical genetic data often reside in the Electronic Health Record (EHR) in an unstructured format and thus are not readily available. We solved this problem by obtaining the data for a variant and linked it to a patient's genetic and demographic data while sorting genes by the inheritance pattern of the disorders associated with them. This allowed us to estimate the genetic disorder prevalence in our catchment area and map such disorders to our local geography utilizing ZIP codes.
Methods:
Germline variant data were derived from clinical genetic testing sent from Valley Children’s Hospital (VCH) to reference laboratories. Data in the form of discrete variants were derived from the VCH EHR. Data derived from the EHR was then validated against variant data provided from reference laboratories in the forms of spreadsheets using Microsoft SQL to produce the final variant database. Automated re-interpretation of all variants was attempted utilizing the Franklin© AI. Variant data were matched to EHR-derived demographic data by name, age, date of birth (DOB), ZIP code and medical record number (MRN) utilizing structured Microsoft Structured Query Language SQL. ZIP codes were converted to ZIP code tabulation areas (ZCTA) to correspond with the geographic distribution of populations per the 2010 United States Census Bureau. Relevant demographic and genomic data were merged to create a database, which was then utilized for exploratory data analysis (EDA) in Microsoft PowerBI©. A choropleth map of genomic variation within our data set was generated by mapping variants to ZCTAs generating a choropleth map of variants to our geographic catchment area in California. The most common genetic disorders in our catchment area and per individual ZCTAs were determined utilizing genetic and demographic data.
Results:
A total of 3044 individuals who had variants were identified and, of these, 98% were linked to geographic data from the EHR. There was a total of 3065 variants of all pathogenicity interpretations: 2474 (80.7%) of variants were monogenic, 584 (19.1%) were structural variants (SV), four (0.1%) were repeat expansion variants and three (0.1%) were methylation variants. Franklin© was able to parse 2688 (88%) of all monogenic and structural variants (SV) with a change in interpretation compared to the original report for 658 (24%). Two hundred and forty-three (37%) of all reinterpreted variants were upgraded (e.g., benign > pathogenic) and 415 (63%) were downgraded (e.g., pathogenic > benign). Clinically significant Franklin© pathogenicity interpretation changes were defined as a change to or from pathogenic or likely pathogenic (P/LP) to any other interpretation. There were 156 clinically significant interpretation changes in single genes and none for SVs. A table of P/LP variants filterable by inheritance pattern, pathogenicity, sex, and zygosity was generated in Microsoft Power BI© and demonstrated 739 genetic disorders in our database. The prevalence of genetic disorders per inheritance pattern, sex, and zygosity was determined. The most salient finding was that the most common autosomal recessive (AR) disorder was PEX6-associated Zellweger Spectrum Disorder (ZSD). Seven homozygotes for a P/LP variant in PEX6 were identified in which all but one were homozygous for the same pathogenic variant (NM_000287.4:c.1409G>C). Mapping of the PEX6 variant geographically represented these six affected individuals in the cities of Bakersfield, Paso Robles and Santa Maria.
Conclusion:
We used a combination of data science and Mendelian genetic principles to facilitate EDA of clinical genetic data utilizing the unique capabilities of AI and choropleths to gain novel insights into our patient population. These methodologies allowed us to estimate the prevalence of genetic disorders within our catchment area, localize geographies enriched for variants, identify patients who were eligible for emerging therapies and systemically reinterpret variants using AI.