Paper - Review
10.1186/s12859-019-2968-1
DOI: 10.1186/s12859-019-2968-1
Abstract
Background
Reliable detection ← of differentially expressed gene
← in RAN-seq data
→ is not a trivial task
← despite the availability of (many ready-made testing software)
❓: Data analysis
→ has intricacies ← that require careful human attention
Use → modern data analysis techniques
← that incorporate visual feedback → to verify the appropriateness of their models
❗: Some RNA-seq packages
→ provide → static visualization tools
❓: Their meaningfulness
→ should be explicitly demonstrated → to users
Results
1⃣ Introduce → new interactive RNA-seq visualization tools
2⃣ Compile → a collection of examples
← why visualization should be (an integral components) ← of differential expression analysis
Use → public RNA-seq datasets
→ to show that → our new visualization tools can detect
→ 1⃣ normalization issues 2⃣ differential expression designation problems 3⃣ common analysis errors
❗: our new visualization tools
→ can identify → genes of interest
← in ways undetectable with models
∴ "bigPrint"
← many of which are unique additions → to what is currently available
Conclusions
Interactive graphics
→ should be an indispensable component of (modern RNA-seq analysis)
← which is currently not the case
1⃣ This paper 2⃣ its corresponding software
→ aim to persuade
→ 1⃣ users to slightly modify (their differential expression analyses) ← by incorporating statistical graphics → into their usual analysis pipeline 2⃣ developers → to create (additional complex & interactive plotting methods)
Background
RNA-seq
→ uses next-generation sequencing (NGS)
→ to estimate the quantity of RNA ← in biological samples
1⃣ decreasing cost 2⃣ increasing throughput
→ has rendered RNA-seq
← an attractive form of transcriptome profiling
Gene expression studies
→ were performed ← with microarray techniques
← which required (prior knowledge) ← of reference sequences
RNA-seq
→ has enabled → a new range of applications
← such as 1⃣ de novo transcriptome assembly 2⃣ detection of alternative splicing processes
RNA-seq
→ is revolutionizing our understanding
← of the intricacies of (eukaryotic transcriptomes)
One common format of RNA-seq data
→ is a matrix ← containing mapped read counts
← for 1⃣ n rows of genes 2⃣ p columns of samples
Researches
→ conduct RNA-seq studies
→ to identify DEGs
← between treatment groups
This objective
→ is approaches ← with models
← such as 1⃣ the negative binomial model 2⃣ linear regression models
❗: RNA-seq
→ produced → un-biased data
← that did not require sophisticated normalization
❓: RNA-seq data
→ is replete ← with biases
→ that accurate detection of DEGs → is not a negligible task
∴ Complicate RNA-seq data analysis
→ include 1⃣ nucleotide & read-position biases 2⃣ biases ← related to gene lengths & sequencing depths 3⃣ biases ← introduced during (library preparation) 4⃣ confounding combinations ← of (technical & biological variability)
Researchers
→ should analyze RNA-seq data
← like they would any other biased multi-variate data
Solely applying models
→ to such data
→ is problematic
∵ models hold assumptions
← that must be verified → to ensure statistical soundness
Data visualization
→ enables researchers
→ to see (patterns & problems)
← which they may not otherwise detect ← with traditional modeling
The most effective approach
→ to data analysis
→ is to iterate ← between (models & visuals)
→ is to enhance ← the appropriateness of (applied models)
← based on feedback from visuals
Primarily want
→ to compare the variability
← 1⃣ between replicates 2⃣ between treatment groups
∴ This is visually best achieved
← by drawing the mapped read count distribution
← across 1⃣ all genes 2⃣ all samples
Strive
→ to remedy this problem
← by highlighting the utility of (new & effective) differential expression plotting tools
Use real RNA-seq data
→ to show that their tools can detect
→ 1⃣ normalization problems 2⃣ DEG designated problems 3⃣ common errors
∴ Their tools
→ can identify gene of interest
← which cannot otherwise be obtained ← by models
∴ interactive graphics
→ should be → an indispensable component
← of modern RNA-seq analysis
Users
→ simply modify their approach → to differential expression analysis
← by (assessing the sensibility of their models) ← with multi-variate graphical tools
← e.g. 1⃣ parallel coordinate plots 2⃣ scatter plot matrices 3⃣ liter plots
Results
Parallel coordinate plots
Paralle coordinate plots
→ are essential
→ to inform → relationships
← between variables ← multi-variate data
A parallel coordinate plot
→ draw each row(gene) ← as a line
Two samples
← with similar read counts
→ will have a flat connection
Two samples
← with dissimilar read counts
→ will have a sloped connections
Researchers
→ can quickly confirm this
← with a parallel coordinate plots
∴ There should be
→ flat connections ← between replicates
→ crossed connections ← between treatments
Two of the most common graphic techniques
→ are 1⃣ side-by-side boxplots 2⃣ multi-dimensional scaling (MDS) plots
❓: these plots
→ can hide problems
← that still exist ← in the data even after normalization
← that could be better detected ← with parallel coordinates plots
Figure
→ exemplifies → this problem
→ for two simulated datasets
Each dataset
→ contains → two treatment groups ← with three replicates
The side-by-side boxplots
→ both show → fairly consistent medians
← across (the six samples) ← in the left & right datasets
∴ The left MDS plots
→ separates → the treatment groups distinctively
∴ The right MDS right
→ suggests a similar separation
← but in a much subtler manner
1⃣ The box plots 2⃣ MDS plots
→ provide useful information
The parallel coordinate plots
→ show and additional meaningful difference
← between the left and right datasets
❗: Some of the genes
→ have consistently low values → for treatment group A
→ have consistently high values → for treatment group B
❗: some gene have → the opposite phenomenon
∴ The majority of the plotted genes
→ may be DEG candidates
∴ Could NOT see
→ this important distinction ← as clearly
← using 1⃣ the side-by-side box plots 2⃣ the MDS plots
∵ they only provide data summarization
← at the sample resolution
∴ The parallel coordinates plots
→ show the sample connections
→ for each gene ← in the data
The example above
→ was simulated → for didactic purpose
Examine → the application of parallel coordinate plots
→ to real data ← from an RNA-seq study
← that compared soybean leaves
Filtered gene
← with 1⃣ low means 2⃣ low variance
Performed → a hierarchical clustering analysis
← with a cluster size of four
Retained ← only significant gene
Visualized → the results
← using parallel coordinate lines
Standardized → each gene
→ to have 1⃣ a mean of zero 2⃣ standard deviation of unity
Performed → hierarchical clustering
← on the standardized DEGs
← using Ward's linkage
∴ This process
→ can divide (large DEG lists) → into (smaller cluster) of (similar patterns)
→ allows us → to more effectively detect → the various types of patterns
∴ 1⃣ the number 2⃣ quality of clusters
→ can vary ← depending on the data
The majority of significant genes
→ were ← in Cluster 1 & 2
← which for the most part captured → the expected patterns of (differential expression)
∴ These genes
→ mostly showed → clean differential expression profiles
Scatter plot matrices
A scatter plot matrix
→ is another effective multi-variate visualization tool
← that plots (read count distribution) ← across 1⃣ all genes 2⃣ all samples
A scatter plots matrix
→ represents each row (gene)
← as a point in each scatter plot
∴ Users
→ can quickly discover → unexpected patterns
→ recognize → geometric shapes
→ assess → the structure & association ← between (multiple variables) in a manner
← that is different ← from most common practices
Clean data
→ would be expected
→ to have larger variability
← between (treatment groups) ← than between replicates
Researches
→ can quickly confirm this
←with a scatter plot matrix
∴ Most genes
→ should fall ←along the x=y line
A small proportion of them
→ to show differential expression
← between samples
A fraction of the genes
→ should have (lower variability)
← between replicates than between treatments
The spread of the scatter plot points
→ to fall more closely ← long the x=y relationship
← between replicates than between treatments
Created → a scatter plot matrix
→ for a public RNA-seq dataset
← that contains three replicates → for two developmental stages of (soybean cotyledon)
Users
→ can use the scatter plot matrix
→ to focus ← on subsets of genes
1⃣ outlier gene
← that deviate ← from the x=y line
← in replicates scatter plots
→ might be problematic
2⃣ outlier gene
← that deviate ← from the x=y line
← in treatment scatter plots
→ might be DEGs
∴ Users
→ view their patterns
←from multiple perspectives
← while obtaining their identifiers
Each gene
← in our data
→ is plotted once ← in each of the 15 scatter plots
More than one million points
→ must be plotted
∴ Rendering all points
→ would slow down → the interactive capability of the plot
Can tailor → the geometric object of (the scatter plots)
→ to be hexagon bins ← rather than points
The genes
→ are also linked
→ to a second plot ← that super-imposes them ← as parallel coordinate lines
← on a side-by-side box plots ← of all counts in the dataset
Assessing normalization with scatter plot matrices
The scatter plot matrix
→ can be used → to (1⃣ understand 2⃣ assess) various algorithms
Use
→ a publicly-available RNA-seq dataset
← on yeast grown in YP-Glucose (YPD)
→ to exemplify this point
The data
→ contained (four cultures) ← from independent libraries
← that were sequenced
← using 1⃣ two library preparation protocols 2⃣ one & two lanes ← in a total of three flow-cells
∴ Researchers
→ examine 1⃣ various levels 2⃣ combinations
← of technical effects
The authors
→ could establish → a false positive rate
← in relation to the number of DEGs
Within-lane regression alone
→ was insufficient
← in effectively removing biases
Expect → most genes
→ to show similar expression between samples
← except for the handful ← that are differentially expressed
It is clear
→ that the data still was NOT sufficiently normalized
← the distribution of gene → is NOT centered ← around the x=y lines
The scatter plot matrix
→ follows → the expected structure
← with most gene falling
The read counts
→ fall closer → to the x=y line
← between the Y4 replicates ← than between the Y1 replicates
∵ The Y1 replicates
→ had additional technical variability
← as they used two different flow cells
The scatter plot matrix
→ can also be used
→ to quickly inspect patterns of 1⃣ biological 2⃣ technical variability
Checking for common errors with scatter plots matrices
Irresproducibility
→ is prevalent
← in high-throughput biological studies
∵ A study ← in Nature Genetics
→ surveyed 18 published micro-array expression analyses
→ reported → that only two were exactly reproducible
The extent of the problem
→ has spawned a field → "forensic bioinformatics"
Researchers
→ attempt → to reverse-engineer
→ reported results back → into the raw datasets
← simply to derive the methodologies ← used in published studies
Irreproducibility
→ is merely cumbersome
← when it masks methods
Forensic bioinformatics
← who have actively investigated common errors
← in high-throughput biological studies
→ has concluded ← that
→ the largeness of the data itself → may hinder out ability → to detect errors
∴ Simple errors
→ can be difficult
→ to detect ← using common practice ← in high-throughput studies
Scatter plot matrices
→ are a convenient tool
→ to check for common errors ← like sample mis-labeling
A subset of these (thick & thin) scatter plots
→ appear outside of their expected locations
← given the expected variability ← between (treatment 🆚 replicates)
Rearranging
← the columns of the two samples
→ would indeed lead back → to the clean-looking scatterplot matrix
The scatter plot matrix
→ provides us → convincing evidence ← of a mis-labeling problem
← which cannot be confirmed ← with such detail
← using traditional plots ← e.g. 1⃣ box plots 2⃣ MDS plots
This method
→ can inform suspicious patterns
← in more detail ← than other means
The user
→ still need → to substantiate
← this suspicion ←with decisive evidence
→ should only used → the visualization as a guide
Finding unexpected patterns in scatter plot matrices
Most popular RNA-seq plotting tools
→ display summaries ← about the read counts
← e.g. 1⃣ fold change summaries 2⃣ principal components summaries 3⃣ five number summaries 4⃣ dispersion summaries
Scatter plot matrices
→ display → the non-summarized read counts
→ for all genes
∴ This traits
→ allows for 1⃣ geometric shapes 2⃣ patterns
→ which relevant to the read count distribution
→ to be readily visible ← in the scatter plot matrix
❗: how geometric shapes
← in the scatter plot matrix
→ can provide applicable information
← which uses the iron-metabolism soy-bean dataset
∴ The expected pattern of a scatter plot matrix
← with more variation ← around the x=y line
← between treatments ← than between replicates
Identified → the five transcripts
← that deviated the most from the expected pattern
∴ Searched → for their putative functions
These transcripts
→ are reportedly involved ← in 1⃣ biotic 2⃣ abiotic stress responses
← the production of super-oxides → to combat microbial infections
A lab biologist
→ documented → a clean data collection process
The same researcher
→ collected the samples ← in succession
→ to reduce variability ← caused by (plant handling) ← by different researchers
❗: a vast change
← in gene expression response
← between these three time points
The streak of gene shown
← may be
← due to (the timing differences) ← between replicate handling
Scientists
→ cannot observe → such interesting structures
← from any models
These structures
→ could lead → to interesting post hoc analyses
The authors
→ had noted → 1⃣ an inadvertent experimental 2⃣ biological discrepancy
← between those replicates
A post hoc hypothesis
← that these gene ← might respond → to that discrepant condition
→ could be generated
Assessing DEG calls in scatter plots matrices
The scatter plots matrix
→ can also be used
→ to quickly examine the DEGs ← returned from a given model
DEGs
→ to fall along → the x=y line → for scatter plots ← between replicates
→ to deviate ← from the x=y line → for scatter plots ← between treatment groups
Link
→ these DEGs ← as parallel coordinate lines
← on a side-by-side box plot
→ to confirm the expected pattern of (differential expression)
← from a second view point
Liter plots
❗: how to view DEGs
→ onto the Cartesian coordinates of (the scatter plots matrix)
❓: this figure
→ becomes limited ← when we investigate treatment groups
← that contain a large number of replicates
∵ we have (too many small scatter plots)
→ for it to remain → an effective visualization tool
Researchers
→ could benefit ← from additional plotting tools
→ to quickly verify → individual DEGs returned from a model
The "replicate line plots"
→ was developed
← by who demonstrated ← it could detect model scaling problems ← in micro-array data
This plot
→ is only applicable
← on datasets where treatment groups ← contain exactly two replicates
❗: an extension of the "replicate line plot"
← which can be applied → to datasets
← with (two & more) replicates
Each gene
→ is plotted once
→ for each possible combination of replicates
← between treatment groups
∴ Each gene ← in this dataset
→ is plotted
← as nine points in the liter plot
This would reduce
→ the speed of interactive functionality
← as cause over-plotting problems
Once the background of hexagons
→ has been drawn
→ to give us → as sense of the distribution of all
← between-treatment sample pair combinations
we
→ can (examine & compare) → liter plots
← using the clusters
A significant gene
← from cluster 1 plotted
← as nine green points
The nine overlaid points
→ are super-imposed ← in a manner we would expect from a DEG
→ they are located far ← from the x=y line
The genes
→ are now over-expressed
← in the other treatment
There seems
→ to be a pattern ← in which one replicate ← from the P group
→ is larger ← than the other two replicates
The nine over laid points
→ are NOT clearly super-imposed
← in the distinct pattern ← we expect of significant genes
The difference
← between the treatment group
→ is so small ← that the overlaid points cluster ← around the x=y line
The gene
→ shows → 1⃣ inconsistent replications 2⃣ consistent treatment groups
The spread-out overlaid
→ points center ← on the x=y line
❓: the liter plots
→ call into question
→ whether the gene ← from this cluster
→ show an expected profile of (differential expression)
∴ This is similar
→ to the messy-looking parallel coordinate plots
Liter plots
→ can detect (odd & questionable) patterns
← in individual "significant genes"
← that cannot be detected numerically through models
Users
→ are provided → several input fields
← that tailor the plot functionality
Readers
→ can verify
→ 1⃣ the parallel coordinate 2⃣ liter plots 3⃣ scatter plots matrices
→ tell a similar story ← about the DEG patterns in these four clusters
Closing case study
Calculate DEG calls
→ for this data
← using the normalization methods of (library size scaling)
The number of total reads
← in each samples
→ are normalized → to a common value
← across all samples
❓: Could finish → our analysis ← at this point
Could draw conclusion ← based on this list of DEGs
← that came from the model
❗: it would be wise
→ to also visualize (this dataset)
A scatter plot matrix
→ confirms the expected pattern
← with treatment scatter plots showing larger variation
← than technical replicate scatter plots
It → uncovers → a hidden pattern
← in the treatment plots
∴ there is a pronounced (streak of genes)
← with higher expressions ← in the liver group
❗: view the DEGs
→ from the model
← using parallel coordinate plots
May need
→ to re-consider → our normalization technique
→ taking both of (these observations) → into account
❓: library size scaling method
→ is not adequate ← in all cases
← especially when (the underlying distribution of reads) ← between samples is inconsistent
The observed streak of (outlier genes)
← that are highly expressed ← in the liver samples
→ reduces the sequences quota available → to the remaining gene in these samples
Trimmed mean of M values (TMM) normalization
→ for such cases
This technique
→ generates sample scaling factors
← that consider sample distributions
1⃣ re-start the analysis 2⃣ apply TMM normalization
→ to this data
❓: The scatter plot matrix
→ did NOT appear differently
❗: should visualize → the new DEG calls
Plotting
→ these DEGs ← as parallel coordinate lines
→ paints a much cleaner picture
← from what we saw earlier
❗: TMM normalization
→ kept → the original 1968 liver-specific DEGs
← from library size scaling
❓: TMM normalization
→ added 1578 more → for a total of 3546 liver-specific DEGs
The liver-specific DEGs
→ may be slightly less clean-looking
← with TMM normalization
The 3974 kidney-specific DEGs
← from TMM normalization
→ are a proper subset of the 7050 kidney-specific DEGs
← from library scale normalization
The 1968 liver-specific DEGs
← from library scale normalization
→ are a proper subset of the 3546 liver-specific DEGs
← from TMM normalization
∴ Perform → a deeper investigation fo the effects of normalization
← on this data
∴ Explore four subsets of genes
← in the form of 1⃣ parallel coordinate plots 2⃣ scatter plots matrices 3⃣ liter plots
Demonstrate → the use of (data standardization)
→ for 1⃣ scatter plots matrices 2⃣ liter plots
Begin ← by plotting the four gene subsets
← in the form of parallel coordinate plots
← after application of hierarchical clustering analysis
Each subset
→ is grouped into eight clusters
→ to reduce any over-plotting ← that would occur should they all be viewed together ← as one large cluster
Continue our visualization study
← by overlaying genes ← from the largest cluster of the four gene subsets
← in the form of standardized scatter plot matrices
Standardization
→ causes → the whole datasets
→ to appear ← as oval-shapes that are almost identical ← across all scatter plots
Lose
→ geometric structures ← that can elicit meaningful information
∵ standardize → scatter plots matrices
Standardization
→ amplifies → meaningful patterns
← in the overlaid DEGs
The overlaid DEG patterns
→ are more spread out
← in the standardized version
DEGs
← in both forms of normalization
→ have the expected differential expression profiles
← in the standardized scatter plots matrices
Overlaying example genes
→ from the largest cluster of the four gene subsets
Standardization
→ causes the dataset
→ to appear ← as an oval-shape
→ to remove the original geometric structure ← in the hexagonal binning
∴ The overlaid DEG patterns
→ are more spread out
← in the standardized version ← in the current case study below
∴ Better interpretation
The example genes
← that were called DEGs ← in both forms of normalization
→ have the expected profiles ← in the liter plots
The overlaid points
→ deviate more ← from the x=y line ← in a tight cluster
This dataset
→ requires more than just library size scaling
→ for reliable analysis
∵ this in-depth analyses ← in this case study
Underscore the overarching theme
→ iteration ← between 1⃣ models 2⃣ visualization
→ is crucial → to achieve (the most convincing results & conclusions)
← in RAN-seq studies
Plot scalability
All visualization plots
→ have limitations
← based on the number of samples ← in the data
Plots
← that appear messy
← regardless of sample numbers
∴ the presence of data quality problems
1⃣ MDS plots 2⃣ box plots 3⃣ parallel coordinate plots
→ can remain effective
← with fairly large sample number
❗: Parallel coordinate plots
→ should be sorted
→ to help place similar variables ← near each other
Scatter plots matrices
→ lose their efficiency
← at smaller sample number
∵ restricted space
One remedy
→ is → to subset their data
→ to plot several smaller scatter plot matrices
One scatter plots matrix
→ would have required ← a prohibitive 576 scatter plots
Liter plots
→ are another remedy → for large data sets
→ can often accommodate more samples
Applied → liter plots
→ to our full data ←that contained 2 group of 12 replicates
❗: The liter plot
→ is more suitable
→ for large datasets ← than the replicate line plot
Discussion
Effective visualization
→ should be → a crucial part of two-group differential expression analysis
1⃣ scatter plots matrices 2⃣ parallel coordinate plots 3⃣ liter plots
→ check → for normalization problems
→ catch (common errors) ← in analysis pipeline
→ confirm that (the variation) ← between replicates 🆚 treatment
These graphical tools
→ allow → quickly explore DEG lists ← that come out of models
→ ensure ← which ones make sense