Paper - Review
10.1186/s12859-020-03695-z
DOI: 10.1186/s12859-020-03695-z
Abstract
Background
Analysis of (somatic mutations)
← from tumor whole exomes
→ has fueled → discovery of (novel cancer driver genes)
98% of the genome
→ is non-coding
→ includes (regulatory elements)
← whose (normal cellular functions) → can be disrupted by mutation
WGS
→ allows → for identification of (non-coding somatic variation)
→ expanded → estimation of (background mutation rates)
Results
MutEnricher
→ a flexible toolset → for investigating (somatic mutation enrichment)
← in both 1⃣ coding 2⃣ non-coding genomic regions
MutEnricher
→ contains → two distinct module → for these purpose
→ for calculating 1⃣ sample-specific 2⃣ feature-specific background mutation rates
Conclusions
MutEnricher
→ is a Python package → for investing (somatic mutation enrichment)
Background
Analysis of (somatic mutations)
← throughout the protein-coding genome
→ has fueled → the discovery of many (cancer driver genes)
The vast majority of the genome
→ is non-coding
→ contains (regulatory elements)
← e.g. 1⃣ enhancers 2⃣ promoters
→ that influence 1⃣ cell-type 2⃣ tissue-type specific processes
WGS
→ allows for (genome-wide discovery) of (somatic variation)
→ may identify → novel non-coding driver mutations
Non-coding somatic variation
→ identified recurrent mutations
e.g. 1⃣ TERT 2⃣ FOXA1
❗: Devised → a variety of (analytical strategies)
→ for interrogating non-coding somatic mutations
❓: Software packages → are NOT readily available
MutEnricher
→ a flexible toolset → that performs (somatic mutation enrichment analysis)
← of both 1⃣ protein-coding 2⃣ non-coding genome loci
MutEnricher
→ computes → 1⃣ overall mutation burden 2⃣ hotspot enrichments
MutEnricher
→ is composed of (two distinct analysis modules)
→ 1⃣ coding → identifies genes → harboring recurrent non-silent somatic mutations 2⃣ noncoding → identifies enrichments of somatic variation ← in user-defined non-coding genomic regions
Implementation
Overview
MutEnricher
→ performs → somatic mutation enrichment analyses
Coding module
→ assesses (enrichment of non-silent somatic mutations)
← within coding gene sequence
Noncoding module
→ determines (somatic enrichment)
← within user-defined genomic regions
Both modules
→ compute → 1⃣ overall feature burden 2⃣ hotspot enrichment significances
← e.g. 1⃣ gene 2⃣ non-coding region
Both MutEnricher modules
→ report 1⃣ independent burden 2⃣ hotspot p-values
← with combined significance estimates → for interrogation
Required inputs and file formats
Somatic mutations
← provided to MutEnricher
→ tabix-indexed somatic VCF files
Coding gene impact annotations
→ are required → for the coding module
→ to distinguish (non-silent 🆚 silent mutations)
MutEnricher
→ interrogates → somatic mutation densities
← in user-defined features of interest
Background mutation rate calculations
MutEnricher
→ implements several methods
← which users can select → for computing (background mutation rates)
→ are necessary → for 1⃣ gene 2⃣ region enrichment calculations
❗: 1⃣ global 2⃣ local 3⃣ covariate clustered
❗: with the global method
Gene/Region backgrounds are computed
→ as (the sum of sample somatic mutation counts)
← within all features divided ← by the total length
∴ All features (← within a sample) → have the same background rate
❗: for the local method
A local background mutation rate
→ is calculated 1⃣ per-gene 2⃣ per-region
← for each sample
Local windows → are scanned
→ around each feature ← in each sample
The background mutation rate
← for the samples' feature
→ is set → to the maximal observed rate ← from this procedure
❗: for the covariate method
→ clusters features ← by similarity of (user-supplied genomic covariates)
← using affinity propagation
→ calculates 1⃣ per-sample 2⃣ per-feature rates
← from the mutation densities of (cluster members)
An additional method
→ combines → the behaviors of 1⃣ the local 2⃣ covariate clustering methods
Features
→ are again grouped ← by genomic covariates
The final background mutation rate
→ is calculated
→ as (the geometric mean) ← of sample-wise rate ← for all samples
Burden and "hotspot" statistical testing
MutEnricher
→ implements two statistical strategies
→ for determining (somatic mutations enrichments)
→ 1⃣ the binomial distribution 2⃣ negative binomial testing strategy
MutEnricher
→ finds → significant "hotspot" enrichments
← by progressively grouping (somatic mutations)
MutEnricher
→ reports → both 1⃣ independent burden 2⃣ hotspot p-values
← along with combined significance estimates ← using Fisher's methods
Datasets, run characteristics, and comparisons to existing tools
Obtained → several somatic MAF files
← from TCGA cohorts
Ran → MutEnricher's coding module
← on these in an exome-specific mode
Required → candidates hotspots
→ to have at least five (somatic mutations)
Compared → MutEnricher results
← from 1⃣ MutSigCV 2⃣ MutSig2CV 3⃣ fishHook 4⃣ OncodriveFML
Results and discussion
Ran → MutEnricher's coding module
← on seven WES-derived mutation datasets ← from TCGA
Observed → strong overlap ← among genes
←which called statistically significant ← with those also called by MutSigCV
Genes
← which not identified as significant
→ but significant when hotspot
→ were considered include KRAS ← in BRCA
Compared
→ MutEnricher's combinded (burden & hotspot result)
→ to MutSig2CV significance calls
MutEnricher burden results
→ were also consistent
← with fishHook results
Results
← from these tools
← on TCGA lung dataset
→ were highly variable
MutEnricher's consistency ← with all tools
→ was higher
← when these cancer types were not considered
Tested → MutEnricher's 1⃣ coding 2⃣ non-coding modules
← on 1⃣ breast 2⃣ liver 3⃣ medulloblastoma
Compared → 1⃣ coding 2⃣ non-coding analysis results
→ to 1⃣ fishHook 2⃣ OncodriveFML
Tested → non-coding results
← against MOAT's annotation-based algorithm
Significantly mutated genes
← called by 1⃣ MutEnricher 2⃣ fishHook
→ were highly consistent
Focused → on liver somatic mutations
← as hepatocellular carcinomas
→ are known → to possess recurrent hotspot mutations
← in the TERT promoter