Implementing Structural Variant Analysis in Codicem, a Comprehensive Clinical Variant Analysis Platform
Laboratory Genetics and Genomics
-
Primary Categories:
- Genomic Medicine
-
Secondary Categories:
- Genomic Medicine
Introduction:
Genome sequencing (GS) is an effective tool in a variety of clinical contexts, especially for the diagnosis of rare disease. To provide clinicians and patients concise, actionable, and specific genomic information, patient DNA undergoes a complex, phased analysis that includes algorithmic, informatic, and human components. Analysis can be considered in three stages: primary (sequencing), secondary (variant identification), and tertiary (curation and interpretation). However, tertiary analysis is itself a multi-stage process which benefits from an iterative, software-aided approach where experts: annotate variants with biological and clinical data; filter variants to a human-understandable scope; curate variants to a case-specific hypothesis; interpret variants based on clinical evidence; and report case results to end users (medical providers, scientists, and patients).
Variant annotation and filtering is particularly challenging in GS, where secondary analysis results in ~4.5 million single nucleotide variants (SNVs), ~1.0 million insertions/deletions (indels), and ~25,000 structural variants (SVs). Time and staffing constraints typically limit expert curation capacity to tens to hundreds of variants. Annotations for effective filtering and interpretation typically include gene overlaps, protein effects, population frequencies, previous clinical interpretations, and functional predictions.
Here we describe, Codicem, a comprehensive clinical genome variant analysis and interpretation web application that has been used for 8 years to support research and clinical GS at the HudsonAlpha Institute for Biotechnology. We also describe ongoing and planned efforts to open source Codicem, improve its ease of use and maintenance, and adapt it to use for “long-read” GS (lrGS) technologies, especially for SV analyses.
Methods:
Codicem is a mature software product for SNV and indel analysis from short-read genomes. It ingests variant data in Variant Call Format (VCF), annotates variants, provides a filtration engine, allows analysts to curate variants, and produces a customizable report. Codicem supports team curation with multiple analysts per case and directors with clinical report sign-out authority.
To extend Codicem’s capabilities for SV analyses powered by long-read GS, we defined software requirements for SV ingestion, annotation, and filtration. We determined a set of overlap summary annotations based on genomic location, such as percent representation of repetitive sequence content or boolean intersection with a user-defined gene list. We account for inherent variability in SV biology and variant calling by matching patient SVs to variant-specific annotations with a flexible breakpoint query based on reciprocally-overlapping genomic coordinates and SV type. To visualize SVs in genomic context, we deployed a custom installation of the UCSC Genome Browser and integrated it into Codicem. Embedded Integrated Genomics Viewer (IGV) also visualizes read alignments to allow QC assessments of SV calls.
Finally, we developed an open-source software roadmap, including upgrades to the back-end installation, deployment, and data ingestion infrastructures.
Results:
To date, Codicem has been used to analyze ~6000 genomes across ~15 research projects since migration to hg38. A separate Codicem instance is maintained as part of a CAP/CLIA-validated clinical genome test and has been used to issue ~2000 interpreted genome reports. SV-enabled Codicem implementation is ongoing via agile software development. Improvements to the Codicem installation and deployment systems will support an open-source release to facilitate community use of the tool.
Conclusion:
Tertiary analysis of clinical GS data, which includes annotation, filtering, curation, and reporting of variants, is a complex, software-driven process. Small variant analysis in Codicem is an integral part of our research and clinical genomics programs, but opportunities exist for improvement, especially with the advent of lrGS and increased focus on structural variation. While some commercial tools provide similar functionality to Codicem, there are few open source, freely available applications; therefore, an open-source and SV-enabled release of Codicem may facilitate expanded research and clinical benefits of GS.
Genome sequencing (GS) is an effective tool in a variety of clinical contexts, especially for the diagnosis of rare disease. To provide clinicians and patients concise, actionable, and specific genomic information, patient DNA undergoes a complex, phased analysis that includes algorithmic, informatic, and human components. Analysis can be considered in three stages: primary (sequencing), secondary (variant identification), and tertiary (curation and interpretation). However, tertiary analysis is itself a multi-stage process which benefits from an iterative, software-aided approach where experts: annotate variants with biological and clinical data; filter variants to a human-understandable scope; curate variants to a case-specific hypothesis; interpret variants based on clinical evidence; and report case results to end users (medical providers, scientists, and patients).
Variant annotation and filtering is particularly challenging in GS, where secondary analysis results in ~4.5 million single nucleotide variants (SNVs), ~1.0 million insertions/deletions (indels), and ~25,000 structural variants (SVs). Time and staffing constraints typically limit expert curation capacity to tens to hundreds of variants. Annotations for effective filtering and interpretation typically include gene overlaps, protein effects, population frequencies, previous clinical interpretations, and functional predictions.
Here we describe, Codicem, a comprehensive clinical genome variant analysis and interpretation web application that has been used for 8 years to support research and clinical GS at the HudsonAlpha Institute for Biotechnology. We also describe ongoing and planned efforts to open source Codicem, improve its ease of use and maintenance, and adapt it to use for “long-read” GS (lrGS) technologies, especially for SV analyses.
Methods:
Codicem is a mature software product for SNV and indel analysis from short-read genomes. It ingests variant data in Variant Call Format (VCF), annotates variants, provides a filtration engine, allows analysts to curate variants, and produces a customizable report. Codicem supports team curation with multiple analysts per case and directors with clinical report sign-out authority.
To extend Codicem’s capabilities for SV analyses powered by long-read GS, we defined software requirements for SV ingestion, annotation, and filtration. We determined a set of overlap summary annotations based on genomic location, such as percent representation of repetitive sequence content or boolean intersection with a user-defined gene list. We account for inherent variability in SV biology and variant calling by matching patient SVs to variant-specific annotations with a flexible breakpoint query based on reciprocally-overlapping genomic coordinates and SV type. To visualize SVs in genomic context, we deployed a custom installation of the UCSC Genome Browser and integrated it into Codicem. Embedded Integrated Genomics Viewer (IGV) also visualizes read alignments to allow QC assessments of SV calls.
Finally, we developed an open-source software roadmap, including upgrades to the back-end installation, deployment, and data ingestion infrastructures.
Results:
To date, Codicem has been used to analyze ~6000 genomes across ~15 research projects since migration to hg38. A separate Codicem instance is maintained as part of a CAP/CLIA-validated clinical genome test and has been used to issue ~2000 interpreted genome reports. SV-enabled Codicem implementation is ongoing via agile software development. Improvements to the Codicem installation and deployment systems will support an open-source release to facilitate community use of the tool.
Conclusion:
Tertiary analysis of clinical GS data, which includes annotation, filtering, curation, and reporting of variants, is a complex, software-driven process. Small variant analysis in Codicem is an integral part of our research and clinical genomics programs, but opportunities exist for improvement, especially with the advent of lrGS and increased focus on structural variation. While some commercial tools provide similar functionality to Codicem, there are few open source, freely available applications; therefore, an open-source and SV-enabled release of Codicem may facilitate expanded research and clinical benefits of GS.