Paper - Review

10.1186/s13059-019-1819-8

DOI: 10.1186/s13059-019-1819-8

Abstract

⭐ Era of genomic medicine
→ Potential (to detect sequences originating from micro-organisms)
→ Approaches for (bacterial & viral detection) ← within host-dominated sequence data

⭐ Benchmarking (← over 70 distinct combinations of tools)
→ on 100 simulated cancer datasets
1⃣ mOTU 2⃣ Kraken

⭐ SEPATH
→ amenable to high throughput sequencing studies (← across a range of high-performance computing clusters)

Background

HPV
← Human papillomavirus
→ the role of HPV (← in tumorigenesis)

WGS
→ is rapidly increasing → with recent large-scale projects (← e.g. TCGA & ICGC)
→ making it possible → to (detect & quantify) pathogens

Analyzing (← cancer metagenomics)
→ The potential benefits are broad
→ could benefit (multiple prominent research topics) (← including 1⃣ cancer development 2⃣ treatment resistance 3⃣ biomarkers of progression)
∴ To consider (the performance of pathogen sequence classification methods) → host-dominated tissue sequence data

Taxonomic profiling
← by using amplicon analysis of the 16S ribosomal RNA gene
→ can interrogate (all regions ← of every constituent genome)
→ to obtain (accurate taxonomic classifications) for metagenomic sequence data

⚠ Rely on (reference & assembled) genomes
← to (match & classify) each sequencing read
1⃣ There exists an (uneven dispersion of interest) (← in the tree of life)
2⃣ (Sequence similarity) (← between organisms & contamination) → inhibit the perfect classification of every input sequence
∵ this species-level instability → select to carry out (metagenomic investigations) at a genus level

Computational tools (← for metagenomic classification)
→ can be generalized into (taxonomic binners) & (taxonomic profilers)
1⃣ Taxonomic binners
→ e.g. Kraken, CLARK, StrainSeeker
→ make a classification on every input sequence
2⃣ Taxonomic profilers
→ e.g. MetaPhlAn2, mOUTs2
→ use a curated database of marker genes → to obtain (a comparable profile) for each samples

❗Challenge: CAMI (Critical Assessment of Metagenome Interpretation)
→ to independently benchmark the ever-growing (selection of tools)
→ provides (a useful starting point) → for understanding classification tools
→ unlikely ❌ to provide an accurate comparison

Classifying organisms
← within host tissue sequence data
→ provides (an additional set of challenges)

SEPATH
→ template computational pipelines designed specifically (← for obtaining classifications)
→ by analyzing the performance of tools

Results

The process (← of obtaining pathogenic classifications from host tissue reads)
1⃣ sequence quality control
2⃣ host sequence depletion
3⃣ taxonomic classification

Human sequence depletion

❗Essential → to remove as many host reads as possible
1⃣ To limit the opportunity for mis-classification
2⃣ To significantly reduce (the size of data)

3 methods (← of host depletion) were investigated on 11 simulated datasets
→ All methods retained (← the majority of bacterial reads)
→ Human reads remaining in each dataset varied

To capture k-mers specific of cancer sequences
→ BBDuK database was generate containing human reference genome 38

Taxonomic classification: bacterial datasets

Comparing (the performance of 6 different taxonomic classification tools)
← by applying them after (filtering & host depletion) on 100 simulated datasets

Bacterial proportion estimation

Analyzing (population proportions)
→ provide understanding (← of micro-organism community structure)
∴ To assess (the performance of tools) in predicting proportions

Bacterial classification following metagenomic assembly

(mOTUs2 & Kraken) have comparable performances
→ Kraken can classify non-bacterial sequences; mOTUs2 cannot

Post-classification filtering involves applying criteria
→ to remove low-quality classification from taxonomic results

Filtering these datasets (← by number of contigs) is non-ideal

Taxonomic classification: viral datasets

The performance (← of viral classification) in the presence of bacterial noise
→ mOTUs2 does NOT ❌ make viral classification
→ Kraken was run on either (quality-trimmed reads & contigs following metaSPAdes)

Effect (← of filtering) on viral species classification was NOT ❌ reflected (← in the classification of bacterial genera)

Bacterial consensus classification

Using distinct methods of (classification & combining)
→ to improve metagenomic classification performance

A smaller selection of datasets was used
∵ local resource limitations (← in storage & computational time) of aligning
→ to produce the required input for PathSeq

Real cancer whole genome sequence data

SEPATH pipelines (← using Kraken & mOUTs2) were ran on
← 1⃣ quality-trimmed 2⃣ human-depleted sequencing files

HPV present
← in 9/10 cervical squamous cell carcinoma

Care should be taken
→ to ensure the true-positive nature
← despite human read depletion

Expansion of these pipelines
← to larget datasets
→ to characterize (the role of many other reported genera)

Discussion

Pipelines
→ for detecting (bacterial genera) & (viral species)
→ in simulated & real WGS from cancer samples
→ perform well in terms of (sensitivity & PPV)
→ utilize computational resources effectively

Kraken
→ builds a database by (minimizing & compressing) → every unique k-mer for each reference genome
→ begins the analysis (← by breaking down each input into its constituent k-mers)

mOTUs2
→ uses a highly targeted approach ← by analyzing 40 universal phylogenetic bacterial marker gene
→ (mOTUs2 vs. Bracken) → mOTUs2 provides more accurate predictions

∴ Kraken pipelines → for accurate representations of (presence & absence)
∴ (Abundance weighted β-diversity metrics) should be interpreted with caution

Use (← of mOTUs2) → for quantitative bacterial measurement
→ with the high classification performance on simulated data
∴ Both (binary & non-binary) β-diversity measures → would be representative of the true values

mOTUs2
→ differs from the current methods (← which rely purely on bacterial reference sequences)

Bacterial classification
→ a higher performance (← at taxonomic levels above genus level)
→ performance appears to drop ↓ at species level
∵ the instability of species-level classification

SEPATH pipelines (← on real cancer sequence data)
→ suggests (overall agreement) (← between Kraken & mOTUs2)
→ Kraken is more sensitive then mOTUs (← in real data)
∵ the differing parameters used

Using (sequencing protocol) (← which optimized for microbial detection)
→ result in a (higher & more even) microbial genome coverage