Paper - Review

10.1038/s41586-020-2095-1

DOI: 10.1038/s41586-020-2095-1

Abstract

Systematic characterization
← of the cancer microbiome
→ provides exploit (non-human & micro-organism-derived molecules) (← in the diagnosis of a major human disease)

⭐ Some types of cancer → show substantial microbial contributions
∴ Found (unique microbial signatures) (← in tissue & blood)

Introduction

Cancer
→ a disease of the human genome
❗Microbiome → makes substantial contributions to some types of cancer
❓Microbial contributions (← to different types of cancer) → remain unknown

Re-examined microbial reads
← from 18116 samples across 10481 patients and 33 types of cancer
∴ TCGA sequencing data → remain unexplored for micro-organisms

2 (orthogonal microbial-detection pipelines)
← systematically measuring & mitigating technical (variation & contamination)
→ Machine Learning → to identify (microbial signatures)

TCGA cancer microbiome and its normalization

Filtered (← with quality-controlled metadata & normalized)
∴ WGS → provided (significantly more microbial reads) (← than RNA-seq experiments) for
1⃣ primary tumor
2⃣ solid-tissue normal
3⃣ metastatic
4⃣ recurrent tumor samples

Performed (← slower, but potentially more specific)
→ for four TCGA types of cancer: 1⃣ Cervical squamous cell carcinoma 2⃣ Stomach adenocarcinoma 3⃣ Lung adenocarcinoma 4⃣ Ovarian serous cystadenocarcinoma
∵ Fast k-mer-matching approaches → are prone to false-positive results

(TCGA expression & human genomic data) → are known to show (substantial batch effects)
∴ Converted discrete taxonomical counts → into log-counts per million

Predicting among and within types of cancer

Trained (stochastic gradient-boosting ML models)
→ to discriminate (between & within) (types & stages) of cancer

Measuring performance
1⃣ One cancer types vs. all others
2⃣ Tumor vs. normal

Differences in (sensitivity & specificity) (← between types of cancer)
∵ differences in class sizes & AUROC & AUPR
❗Cancer microbial heterogeneity → may also contribute to this ↑ differential performance

Randomly sorted (raw TCGA microbial counts) → into two batches
← to evaluate (the generalizability) ← across data sets
∴ Found highly similar performance

SHOGUN
← for further validation
→ an alignment-based microbial taxonomic pipeline
→ SHOGUN-derived data → replicated the batch effects

∴ NO ❌ (major differences) (← in discriminatory performance) (← between the data sets)

Biological relevance of micro-organism profiles

Evidence → for their biological relevance
← given (the strong discrimination) (← of microbial signatures)
← using (ecologically expected) & (clinically tested outcomes)

Bayesian microbial-source tracking algorithm
← to assess whether (cancer-associated micro-organisms are expected)
← across 8 body sites in the HMP2 (Human Microbiome Project 2)
← from 70 solid-tissue normal samples

Fusobacterium spp.
→ are important in the (development & progression) of gastrointestinal tumors
→ is abundant in primary tumors (← compared to solid-tissue normal samples & blood-derived normal samples)

Helicobacter pylori
→ has NO ❌ differences (← between primary tumor & adjacent solid-tissue normal samples)

Comparing (our microorganism-detection pipeline)
← using 2 different bio-informatics pipeline
1⃣ de novo meta-genome assembly methods
2⃣ read-based methods → PathSeq algorithm

❗(Differential abundance) (← of the alpha-papillomavirus genus)
← between primary tumors

❗Patients (← with liver hepato-cellular carcinoma)
→ had (selective over-abundance of the HBV genus)
← in both (primary tumors) & (adjacent solid-tissue normal samples)

Microbial (drivers & commensals)
→ provide (initial evidence) (← which the models were ecologically relevant)
→ e.g. for CESC tumor → alpha-papillomavirus genus
→ for COAD tumor → Faecalibacterium genus

❗Provide (raw & normalized) microbial abundance data sets

Measuring and mitigating contamination

Importance of (measuring & mitigating) ← the (potential effects) (← of contamination)
→ spiked (5 types ← of pseudo-contaminants) → into the raw data sets
→ to track through (decontamination & supervised normalization & ML)

∴ Grouped samples → into individual (sequencing plates) & Removed all (putative contaminants) (← which identified)

These (in silico decontamination methods)
→ are NOT ❌ substitutes → for implementing (gold-standard) → on cancer samples

(In silico tools) (← which described here)
→ reflect the state-of-the-art
→ are NOT ❌ designed → to detect (abundant spikes ← of contaminants)

Stringent decontaminations
→ is that (real signals) (← which reflect commensal)

Predictions using microbial DNA in blood

❗Mounting evidence
← blood-based mbDNA (← microbial DNA) → can be (clinically informative) in cancer
→ is unclear ❓ ← how broadly this applies
∴ ML strategies → found that blood-borne mbDNA could discriminate (← between numerous types of cancer)

Benchmark (← our ML models)
← against existing ctDNA assays
→ focusing on circumstances (← under which ctDNA assays fail)

ML can discriminate well
← between types of cancer using blood mbDNA
← after removing (all blood-derived normal samples)

❗ Notable facts
1⃣ ctDNA assays → use plasma (← rather than whole blood)
2⃣ Distribution (← of mbDNA among blood) is unknown ❓
3⃣ Impossible ❌ → to tell whether (mbDNA came from live & dead micro-organisms)

∴ Their filtering was limited by
1⃣ NOT ❌ having the primary specimens
2⃣ genus-level taxonomic resolution
3⃣ NOT ❌ knowing (← which non-TCGA samples) were concurrently processed

Validating microbial signatures in blood

Evaluated (the use of plasma-derived, cell-free mbDNA signatures)
← to demonstrate (the real-world utility) (← of these results) while benchmarking plasma-based ctDNA assays
→ to discriminate among (healthy individuals & multiple types of cancer)

Plasma
→ is distinct subset (← of whole blood)
→ carries (major advantages) ← in archival stability & bio-repository availability & biological interpretation

Cell-free DNA
← was extracted from these plasma samples (← with extensive controls)