Paper - Review
10.1186/s13059-019-1819-8
DOI: 10.1186/s13059-019-1819-8
Abstract
⭐ Era of genomic medicine
→ Potential (to detect sequences originating from micro-organisms)
→ Approaches for (bacterial & viral detection) ← within host-dominated sequence data
⭐ Benchmarking (← over 70 distinct combinations of tools)
→ on 100 simulated cancer datasets
1⃣ mOTU 2⃣ Kraken
⭐ SEPATH
→ amenable to high throughput sequencing studies (← across a range of high-performance computing clusters)
Background
HPV
← Human papillomavirus
→ the role of HPV (← in tumorigenesis)
WGS
→ is rapidly increasing → with recent large-scale projects (← e.g. TCGA & ICGC)
→ making it possible → to (detect & quantify) pathogens
Analyzing (← cancer metagenomics)
→ The potential benefits are broad
→ could benefit (multiple prominent research topics) (← including 1⃣ cancer development 2⃣ treatment resistance 3⃣ biomarkers of progression)
∴ To consider (the performance of pathogen sequence classification methods) → host-dominated tissue sequence data
Taxonomic profiling
← by using amplicon analysis of the 16S ribosomal RNA gene
→ can interrogate (all regions ← of every constituent genome)
→ to obtain (accurate taxonomic classifications) for metagenomic sequence data
⚠ Rely on (reference & assembled) genomes
← to (match & classify) each sequencing read
1⃣ There exists an (uneven dispersion of interest) (← in the tree of life)
2⃣ (Sequence similarity) (← between organisms & contamination) → inhibit the perfect classification of every input sequence
∵ this species-level instability → select to carry out (metagenomic investigations) at a genus level
Computational tools (← for metagenomic classification)
→ can be generalized into (taxonomic binners) & (taxonomic profilers)
1⃣ Taxonomic binners
→ e.g. Kraken, CLARK, StrainSeeker
→ make a classification on every input sequence
2⃣ Taxonomic profilers
→ e.g. MetaPhlAn2, mOUTs2
→ use a curated database of marker genes → to obtain (a comparable profile) for each samples
❗Challenge: CAMI (Critical Assessment of Metagenome Interpretation)
→ to independently benchmark the ever-growing (selection of tools)
→ provides (a useful starting point) → for understanding classification tools
→ unlikely ❌ to provide an accurate comparison
Classifying organisms
← within host tissue sequence data
→ provides (an additional set of challenges)
SEPATH
→ template computational pipelines designed specifically (← for obtaining classifications)
→ by analyzing the performance of tools
Results
The process (← of obtaining pathogenic classifications from host tissue reads)
1⃣ sequence quality control
2⃣ host sequence depletion
3⃣ taxonomic classification
Human sequence depletion
❗Essential → to remove as many host reads as possible
1⃣ To limit the opportunity for mis-classification
2⃣ To significantly reduce (the size of data)
3 methods (← of host depletion) were investigated on 11 simulated datasets
→ All methods retained (← the majority of bacterial reads)
→ Human reads remaining in each dataset varied
To capture k-mers specific of cancer sequences
→ BBDuK database was generate containing human reference genome 38
Taxonomic classification: bacterial datasets
Comparing (the performance of 6 different taxonomic classification tools)
← by applying them after (filtering & host depletion) on 100 simulated datasets
Bacterial proportion estimation
Analyzing (population proportions)
→ provide understanding (← of micro-organism community structure)
∴ To assess (the performance of tools) in predicting proportions
Bacterial classification following metagenomic assembly
(mOTUs2 & Kraken) have comparable performances
→ Kraken can classify non-bacterial sequences; mOTUs2 cannot
Post-classification filtering involves applying criteria
→ to remove low-quality classification from taxonomic results
Filtering these datasets (← by number of contigs) is non-ideal
Taxonomic classification: viral datasets
The performance (← of viral classification) in the presence of bacterial noise
→ mOTUs2 does NOT ❌ make viral classification
→ Kraken was run on either (quality-trimmed reads & contigs following metaSPAdes)
Effect (← of filtering) on viral species classification was NOT ❌ reflected (← in the classification of bacterial genera)
Bacterial consensus classification
Using distinct methods of (classification & combining)
→ to improve metagenomic classification performance
A smaller selection of datasets was used
∵ local resource limitations (← in storage & computational time) of aligning
→ to produce the required input for PathSeq
Real cancer whole genome sequence data
SEPATH pipelines (← using Kraken & mOUTs2) were ran on
← 1⃣ quality-trimmed 2⃣ human-depleted sequencing files
HPV present
← in 9/10 cervical squamous cell carcinoma
Care should be taken
→ to ensure the true-positive nature
← despite human read depletion
Expansion of these pipelines
← to larget datasets
→ to characterize (the role of many other reported genera)
Discussion
Pipelines
→ for detecting (bacterial genera) & (viral species)
→ in simulated & real WGS from cancer samples
→ perform well in terms of (sensitivity & PPV)
→ utilize computational resources effectively
Kraken
→ builds a database by (minimizing & compressing) → every unique k-mer for each reference genome
→ begins the analysis (← by breaking down each input into its constituent k-mers)
mOTUs2
→ uses a highly targeted approach ← by analyzing 40 universal phylogenetic bacterial marker gene
→ (mOTUs2 vs. Bracken) → mOTUs2 provides more accurate predictions
∴ Kraken pipelines → for accurate representations of (presence & absence)
∴ (Abundance weighted β-diversity metrics) should be interpreted with caution
Use (← of mOTUs2) → for quantitative bacterial measurement
→ with the high classification performance on simulated data
∴ Both (binary & non-binary) β-diversity measures → would be representative of the true values
mOTUs2
→ differs from the current methods (← which rely purely on bacterial reference sequences)
Bacterial classification
→ a higher performance (← at taxonomic levels above genus level)
→ performance appears to drop ↓ at species level
∵ the instability of species-level classification
SEPATH pipelines (← on real cancer sequence data)
→ suggests (overall agreement) (← between Kraken & mOTUs2)
→ Kraken is more sensitive then mOTUs (← in real data)
∵ the differing parameters used
Using (sequencing protocol) (← which optimized for microbial detection)
→ result in a (higher & more even) microbial genome coverage