Paper - Review
10.1038/s41588-018-0316-4
DOI: 10.1038/s41588-018-0316-4
Abstract
The human reference genome
→ serves as the foundation for genomics
← by providing a scaffold ← for alignment of sequencing reads
← but currently only reflects a single consensus haplotype
∴ Impairing analysis accuracy
A graph reference genome implementation
→ enables read alignment ← across 2.8 K diploid genomes
← 1⃣ encompassing 12.6 M SNPs 2⃣ 4.0 M insertions & deletions
Using a graph genome reference
→ improves → read mapping sensitivity
→ produces a 0.5% increase in variant calling recall
Iterative augmentation of graph genome
→ yields incremental gains ← in variant calling accuracy
Introduction
Human reference genome
→ a standardized coordinate system
→ for 1⃣ annotating genomic elements 2⃣ comparing individual human genomes
∴ Underpinning the quality
← of 1⃣ all ensuring analyses 2⃣ ultimately the ability
→ to draw conclusions of (clinical significance) ← from DNA sequencing
The current (human reference genome)
→ a linear haploid DNA sequence
→ poses practical limitations
∵ the prevalence of (genetic diversity) ← in human populations
Genetic divergence
→ may cause sequencing reads → 1⃣ to map incorrectly 2⃣ to fail to map altogether
Recent large-scale re-sequencing efforts → have
→ 1⃣ comprehensively catalogued common genetic variants 2⃣ prompting suggestions → to make use of (this information) ← through multi-genome references
Multi-genome graph references
→ orders of magnitude slower ← than (conventional linear reference genome-based methods)
Present (a graph genome pipeline)
→ for 1⃣ building 2⃣ augmenting 3⃣ storing 4⃣ querying 5⃣ variant calling
← from (graph genome) ← composed of a population of (genome sequences)
Results
A computationally efficient graph genome implementation
A graph genome data structure
← genomic sequences ← on the edge of the graph
A graph genome
→ is constructed ← from a population of genome sequences
A process → to build a graph genome
← using VCF files indicating genetic variants
← with respect to a standard linear reference genome
This methods
→ ensures backward compatibility of graph coordinates
→ to linear reference genome coordinates
Genomic features
← such as 1⃣ tandem repeat expansions 2⃣ inversions
→ are represented → as 1⃣ insertions 2⃣ sequence replacement ← in the graph
Do NOT ❌ support → bi-directionality & cycles
∵ structure impose ← un-necessary computational complexity
Use → a hash table
← which associates (short sequences of length k-mers)
← along all valid path in the graph (← with their graph coordinates)
Improved read mapping accuracy using graph genomes
Developed (a graph aligner) → for short reads
← which uses the k-mer index for seeding
→ to support (genomic analysis) ← on graph genome implementation
The read alignment
← against a path in the graph
→ are projected to the standard reference genome
∴ The output format → maintains (full compatibility)
← with existing genomics data processing tools
The reads are placed
→ to the closest reference position
∴ Downstream analysis tools → can access these reads conveniently
Measured → the graph aligner runtimes
← on 10 randomly selected high-coverage whole-genome sequencing datasets
This trends
→ is reversed ← when only (8 and 12 threads) are used
→ the runtimes remains comparable
Simulated → sequencing reads
← from individual samples drawn from VCF
→ to test → the read mapping accuracy (← of the graph aligner)
Graph Aligner
→ maintains 1⃣ a high mapping rate 2⃣ accuracy
← even in reads containing ← long indels
Graph Genome Pipeline improves in variant detection
Graph Genome Pipeline
→ calls variants
← using (a re-assembly variant caller) & (variant call filters)
Devised 4 (independent & complementary) experiments
→ to compare the variant calling accuracy
Benchmarked GATK HaplotypeCaller results
← derived from Graph Aligner BAMs
→ to separate the impact of the graph aligner & from the variant caller
Developed → a machine learning-based approach
→ to the standard approach of filtering → the false positive variants
A consistent patterns emerges
← from the benchmarking experiments
Graph Genome Pipeline
→ has an equally (good precision) ← with better recall
← in both SNP & indel calling
The gain (← in SNP calling accuracy)
→ is driven ← by graph alignment
The GiaB variant call sets
→ provide an estimate for the practical upper limit
← in achievable accuracy ← using the standard linear reference genome
∵ They are carefully curate ← from extensive amount of high-quality data
← which generated from 1⃣ a combination of several different sequencing platforms 2⃣ meta-analyzed
← across a suite of state-of-the-art bioinformatics tools
Graph Aligner
→ is by design able to map reads
← across known variations ← without reference bias
∴ These variants
→ could be real
→ but missed ← by all other linear reference genome-based pipelines ← used by GiaB
∵ Reference bias
A unified framework for SV calling using Graph Genome Pipeline
Sequence information
← of known SVs
→ can be incorporated → into a graph genome
← which allowing reads → to be mapped across them
Curated manually a dataset of 230 (high-quality, breakpoint-resolved deletion-type) SVs
→ to demonstrate → reads spanning SV breakpoints → ca be used to directly genotype SVs
SV set does NOT ❌ include → any events composed purely (← of inserted sequence)
Many of them → involve (novel sequence insertions) ← at their breakpoints
The fractions of reads spanning SV breakpoints
→ segregates cleanly into 3 clusters
SV set presented here
→ lacks more complex variants
← such as mobile elements & inversions
Focused on the GiaB samples HG002
← for which both 1⃣ Illumina read 2⃣ PacBio long read data → are publicly available
→ to compare the SV genotyping performance (← of the graph aligner)
Graph genomes prevent erroneous variant calls around SVs
Structural variations
← which mediated by certain DNA repair mechanisms
→ can exhibit micro-homology
Compared → the rate of 1000G SNPs
← around the SV breakpoints with the background rate of SNPs ← in 1000G
→ to quantify this effect in 1000G
The aggregate 1000G SNP rate
→ is increased 3-fold to 5.5/bp
→ 67% of 1000G SNPs called ← within 10 bp of an SV breakpoint are false
Incremental improvement in variant calling recall through iterative graph augmentation
Newly discovered variants
→ can be incrementally added → to existing graph genomes
→ to increase the comprehensiveness of the graph
Augmented → the global graph with variants detected in 10 samples
← from 3 super-populations of Coriell cohort
→ to test whether incremental graph augmentation → would improve (variant calling)
The augmented graphs result
← in almost twice the number of novel SNPs and indels ← being called
The quality of the detected variants indirectly
← using Ti/Tv and hetero-to-homo alternate allele ratio
Metrics fall outside
← the expected range for novel (SNPs and indels)
(Ti/Tv & het/hom) ratios
→ remains un-affected
→ although (graph augmentation results) ← in more (known & novel) variants being called
Discussion
Benchmarking experiments → demonstrate that
→ using (a graph genome reference) → improves read mapping & variant calling recall
← including that of SVs
← without ❌ a concomitant loss in precision
Graph aligner
→ is able to readily (align reads)
← across breakpoint-resolved SVs ← included in the graph
Existing methods
← for identifying and genotyping SVs
→ require specifically designed multi-step algorithms
Further improvements → to variant calling
→ could be achieved
← by representing haplotypes ← of small variants and SVs
Provide → a computational efficient alternative
→ to joint variant calling ← in leveraging previously accumulated population genetics information