Finding Malignant Variants with Oncological Genetic Analysis
NGS generates several million to billion short-read sequences of the DNA and RNA isolated from a sample. In contrast to traditional Sanger sequencing with read lengths of 500-900 base pairs (bp), short reads of NGS range in size from 75 to 300 bp depending on the application and sequencing chemistry.
The bioinformatics pipeline for a typical DNA sequencing strategy involves aligning the raw sequence reads from a FASTQ or unaligned BAM (uBAM) file against the human reference genome. The FASTQ file formats store short sequences as plain text with metadata about each short sequence such as base quality score and read identifiers.
The sequence alignment process assigns a genome positional context to the short reads in the reference genome and generates several metadata fields, including alignment characteristics (matches, mismatches, and gaps) in Concise Idiosyncratic Gapped Alignment Report format. The aligned sequences and the related metadata are stored in a Sequence Alignment Mapping (SAM/BAM) format. Downstream algorithms consume the BAM file to identify a range of genetic alterations, including single nucleotide variants, insertions and deletions (indels), and tumor mutation burden.
Laboratories commonly estimate copy number alterations (CNA) from aligned sequencing reads by using the depth of coverage approach. More extensive and specific DNA sequencing strategies also enable identification of large structural variants (SV), including gene fusions and microsatellite instability. In addition, a split-read alignment strategy identifies gene fusions from genomic DNA sequencing.
The results of variant identification are stored in one of the variant call formats (VCF), including genome VCF, generic feature format, and others. These formats allow the encoding of quantitative information about the variant, such as variant allele fraction, depth of coverage at the variant position, and genotype quality. Given the more complex representation of CNA and large SV, including gene fusions, there is ongoing work on using alternative file formats to represent such data types appropriately.
Finally, the downstream bioinformatics analysis for DNA sequence variants involves queries across multiple genomic databases to extract meaningful information about gene and variant nomenclature, variant prevalence, functional impact, and assertion of clinical significance. A user interface renders and visualizes annotated DNA sequence variants, CNA, SV, and other genetic alterations. Such a user interface enables trained molecular pathologists and practitioners to interpret the clinical significance of the genetic alterations and release a comprehensive molecular report.
Additional important applications of bioinformatics in molecular laboratory operations include quality control monitoring of sequencing data across runs, identification of background sequencing noise to reduce false-positive results, validation of upgrades to the bioinformatics pipeline, and the development and validation of novel algorithms for sequence data processing and variant interpretation.