Paper - Review

10.1038/s41588-018-0316-4

DOI: 10.1038/s41588-018-0316-4

Abstract

The human reference genome
→ serves as the foundation for genomics
← by providing a scaffold ← for alignment of sequencing reads
← but currently only reflects a single consensus haplotype
∴ Impairing analysis accuracy

A graph reference genome implementation
→ enables read alignment ← across 2.8 K diploid genomes
← 1⃣ encompassing 12.6 M SNPs 2⃣ 4.0 M insertions & deletions

Using a graph genome reference
→ improves → read mapping sensitivity
→ produces a 0.5% increase in variant calling recall

Iterative augmentation of graph genome
→ yields incremental gains ← in variant calling accuracy

Introduction

Human reference genome
→ a standardized coordinate system
→ for 1⃣ annotating genomic elements 2⃣ comparing individual human genomes

∴ Underpinning the quality
← of 1⃣ all ensuring analyses 2⃣ ultimately the ability
→ to draw conclusions of (clinical significance) ← from DNA sequencing

The current (human reference genome)
→ a linear haploid DNA sequence
→ poses practical limitations
∵ the prevalence of (genetic diversity) ← in human populations

Genetic divergence
→ may cause sequencing reads → 1⃣ to map incorrectly 2⃣ to fail to map altogether

Recent large-scale re-sequencing efforts → have
→ 1⃣ comprehensively catalogued common genetic variants 2⃣ prompting suggestions → to make use of (this information) ← through multi-genome references

Multi-genome graph references
→ orders of magnitude slower ← than (conventional linear reference genome-based methods)

Present (a graph genome pipeline)
→ for 1⃣ building 2⃣ augmenting 3⃣ storing 4⃣ querying 5⃣ variant calling
← from (graph genome) ← composed of a population of (genome sequences)

Results

A computationally efficient graph genome implementation

A graph genome data structure
← genomic sequences ← on the edge of the graph

A graph genome
→ is constructed ← from a population of genome sequences

A process → to build a graph genome
← using VCF files indicating genetic variants
← with respect to a standard linear reference genome

This methods
→ ensures backward compatibility of graph coordinates
→ to linear reference genome coordinates

Genomic features
← such as 1⃣ tandem repeat expansions 2⃣ inversions
→ are represented → as 1⃣ insertions 2⃣ sequence replacement ← in the graph

Do NOT ❌ support → bi-directionality & cycles
∵ structure impose ← un-necessary computational complexity

Use → a hash table
← which associates (short sequences of length k-mers)
← along all valid path in the graph (← with their graph coordinates)

Improved read mapping accuracy using graph genomes

Developed (a graph aligner) → for short reads
← which uses the k-mer index for seeding
→ to support (genomic analysis) ← on graph genome implementation

The read alignment
← against a path in the graph
→ are projected to the standard reference genome

∴ The output format → maintains (full compatibility)
← with existing genomics data processing tools

The reads are placed
→ to the closest reference position
∴ Downstream analysis tools → can access these reads conveniently

Measured → the graph aligner runtimes
← on 10 randomly selected high-coverage whole-genome sequencing datasets

This trends
→ is reversed ← when only (8 and 12 threads) are used
→ the runtimes remains comparable

Simulated → sequencing reads
← from individual samples drawn from VCF
→ to test → the read mapping accuracy (← of the graph aligner)

Graph Aligner
→ maintains 1⃣ a high mapping rate 2⃣ accuracy
← even in reads containing ← long indels

Graph Genome Pipeline improves in variant detection

Graph Genome Pipeline
→ calls variants
← using (a re-assembly variant caller) & (variant call filters)

Devised 4 (independent & complementary) experiments
→ to compare the variant calling accuracy

Benchmarked GATK HaplotypeCaller results
← derived from Graph Aligner BAMs
→ to separate the impact of the graph aligner & from the variant caller

Developed → a machine learning-based approach
→ to the standard approach of filtering → the false positive variants

A consistent patterns emerges
← from the benchmarking experiments

Graph Genome Pipeline
→ has an equally (good precision) ← with better recall
← in both SNP & indel calling

The gain (← in SNP calling accuracy)
→ is driven ← by graph alignment

The GiaB variant call sets
→ provide an estimate for the practical upper limit
← in achievable accuracy ← using the standard linear reference genome
∵ They are carefully curate ← from extensive amount of high-quality data
← which generated from 1⃣ a combination of several different sequencing platforms 2⃣ meta-analyzed
← across a suite of state-of-the-art bioinformatics tools

Graph Aligner
→ is by design able to map reads
← across known variations ← without reference bias

∴ These variants
→ could be real
→ but missed ← by all other linear reference genome-based pipelines ← used by GiaB
∵ Reference bias

A unified framework for SV calling using Graph Genome Pipeline

Sequence information
← of known SVs
→ can be incorporated → into a graph genome
← which allowing reads → to be mapped across them

Curated manually a dataset of 230 (high-quality, breakpoint-resolved deletion-type) SVs
→ to demonstrate → reads spanning SV breakpoints → ca be used to directly genotype SVs

SV set does NOT ❌ include → any events composed purely (← of inserted sequence)
Many of them → involve (novel sequence insertions) ← at their breakpoints

The fractions of reads spanning SV breakpoints
→ segregates cleanly into 3 clusters

SV set presented here
→ lacks more complex variants
← such as mobile elements & inversions

Focused on the GiaB samples HG002
← for which both 1⃣ Illumina read 2⃣ PacBio long read data → are publicly available
→ to compare the SV genotyping performance (← of the graph aligner)

Graph genomes prevent erroneous variant calls around SVs

Structural variations
← which mediated by certain DNA repair mechanisms
→ can exhibit micro-homology

Compared → the rate of 1000G SNPs
← around the SV breakpoints with the background rate of SNPs ← in 1000G
→ to quantify this effect in 1000G

The aggregate 1000G SNP rate
→ is increased 3-fold to 5.5/bp
→ 67% of 1000G SNPs called ← within 10 bp of an SV breakpoint are false

Incremental improvement in variant calling recall through iterative graph augmentation

Newly discovered variants
→ can be incrementally added → to existing graph genomes
→ to increase the comprehensiveness of the graph

Augmented → the global graph with variants detected in 10 samples
← from 3 super-populations of Coriell cohort
→ to test whether incremental graph augmentation → would improve (variant calling)

The augmented graphs result
← in almost twice the number of novel SNPs and indels ← being called

The quality of the detected variants indirectly
← using Ti/Tv and hetero-to-homo alternate allele ratio

Metrics fall outside
← the expected range for novel (SNPs and indels)

(Ti/Tv & het/hom) ratios
→ remains un-affected
→ although (graph augmentation results) ← in more (known & novel) variants being called

Discussion

Benchmarking experiments → demonstrate that
→ using (a graph genome reference) → improves read mapping & variant calling recall
← including that of SVs
← without ❌ a concomitant loss in precision

Graph aligner
→ is able to readily (align reads)
← across breakpoint-resolved SVs ← included in the graph

Existing methods
← for identifying and genotyping SVs
→ require specifically designed multi-step algorithms

Further improvements → to variant calling
→ could be achieved
← by representing haplotypes ← of small variants and SVs

Provide → a computational efficient alternative
→ to joint variant calling ← in leveraging previously accumulated population genetics information