Paper - Review

10.1186/s12859-019-2968-1

DOI: 10.1186/s12859-019-2968-1

Abstract

Background

Reliable detection ← of differentially expressed gene
← in RAN-seq data
→ is not a trivial task
← despite the availability of (many ready-made testing software)

❓: Data analysis
→ has intricacies ← that require careful human attention

Use → modern data analysis techniques
← that incorporate visual feedback → to verify the appropriateness of their models

❗: Some RNA-seq packages
→ provide → static visualization tools
❓: Their meaningfulness
→ should be explicitly demonstrated → to users

Results

1⃣ Introduce → new interactive RNA-seq visualization tools
2⃣ Compile → a collection of examples
← why visualization should be (an integral components) ← of differential expression analysis

Use → public RNA-seq datasets
→ to show that → our new visualization tools can detect
→ 1⃣ normalization issues 2⃣ differential expression designation problems 3⃣ common analysis errors

❗: our new visualization tools
→ can identify → genes of interest
← in ways undetectable with models

∴ "bigPrint"
← many of which are unique additions → to what is currently available

Conclusions

Interactive graphics
→ should be an indispensable component of (modern RNA-seq analysis)
← which is currently not the case

1⃣ This paper 2⃣ its corresponding software
→ aim to persuade
→ 1⃣ users to slightly modify (their differential expression analyses) ← by incorporating statistical graphics → into their usual analysis pipeline 2⃣ developers → to create (additional complex & interactive plotting methods)

Background

RNA-seq
→ uses next-generation sequencing (NGS)
→ to estimate the quantity of RNA ← in biological samples

1⃣ decreasing cost 2⃣ increasing throughput
→ has rendered RNA-seq
← an attractive form of transcriptome profiling

Gene expression studies
→ were performed ← with microarray techniques
← which required (prior knowledge) ← of reference sequences

RNA-seq
→ has enabled → a new range of applications
← such as 1⃣ de novo transcriptome assembly 2⃣ detection of alternative splicing processes

RNA-seq
→ is revolutionizing our understanding
← of the intricacies of (eukaryotic transcriptomes)

One common format of RNA-seq data
→ is a matrix ← containing mapped read counts
← for 1⃣ n rows of genes 2⃣ p columns of samples

Researches
→ conduct RNA-seq studies
→ to identify DEGs
← between treatment groups

This objective
→ is approaches ← with models
← such as 1⃣ the negative binomial model 2⃣ linear regression models

❗: RNA-seq
→ produced → un-biased data
← that did not require sophisticated normalization
❓: RNA-seq data
→ is replete ← with biases
→ that accurate detection of DEGs → is not a negligible task

∴ Complicate RNA-seq data analysis
→ include 1⃣ nucleotide & read-position biases 2⃣ biases ← related to gene lengths & sequencing depths 3⃣ biases ← introduced during (library preparation) 4⃣ confounding combinations ← of (technical & biological variability)

Researchers
→ should analyze RNA-seq data
← like they would any other biased multi-variate data

Solely applying models
→ to such data
→ is problematic
∵ models hold assumptions
← that must be verified → to ensure statistical soundness

Data visualization
→ enables researchers
→ to see (patterns & problems)
← which they may not otherwise detect ← with traditional modeling

The most effective approach
→ to data analysis
→ is to iterate ← between (models & visuals)
→ is to enhance ← the appropriateness of (applied models)
← based on feedback from visuals

Primarily want
→ to compare the variability
← 1⃣ between replicates 2⃣ between treatment groups

∴ This is visually best achieved
← by drawing the mapped read count distribution
← across 1⃣ all genes 2⃣ all samples

Strive
→ to remedy this problem
← by highlighting the utility of (new & effective) differential expression plotting tools

Use real RNA-seq data
→ to show that their tools can detect
→ 1⃣ normalization problems 2⃣ DEG designated problems 3⃣ common errors

∴ Their tools
→ can identify gene of interest
← which cannot otherwise be obtained ← by models

∴ interactive graphics
→ should be → an indispensable component
← of modern RNA-seq analysis

Users
→ simply modify their approach → to differential expression analysis
← by (assessing the sensibility of their models) ← with multi-variate graphical tools
← e.g. 1⃣ parallel coordinate plots 2⃣ scatter plot matrices 3⃣ liter plots

Results

Parallel coordinate plots

Paralle coordinate plots
→ are essential
→ to inform → relationships
← between variables ← multi-variate data

A parallel coordinate plot
→ draw each row(gene) ← as a line

Two samples
← with similar read counts
→ will have a flat connection

Two samples
← with dissimilar read counts
→ will have a sloped connections

Researchers
→ can quickly confirm this
← with a parallel coordinate plots
∴ There should be
→ flat connections ← between replicates
→ crossed connections ← between treatments

Two of the most common graphic techniques
→ are 1⃣ side-by-side boxplots 2⃣ multi-dimensional scaling (MDS) plots

❓: these plots
→ can hide problems
← that still exist ← in the data even after normalization
← that could be better detected ← with parallel coordinates plots

Figure
→ exemplifies → this problem
→ for two simulated datasets

Each dataset
→ contains → two treatment groups ← with three replicates

The side-by-side boxplots
→ both show → fairly consistent medians
← across (the six samples) ← in the left & right datasets

∴ The left MDS plots
→ separates → the treatment groups distinctively
∴ The right MDS right
→ suggests a similar separation
← but in a much subtler manner

1⃣ The box plots 2⃣ MDS plots
→ provide useful information
The parallel coordinate plots
→ show and additional meaningful difference
← between the left and right datasets

❗: Some of the genes
→ have consistently low values → for treatment group A
→ have consistently high values → for treatment group B
❗: some gene have → the opposite phenomenon

∴ The majority of the plotted genes
→ may be DEG candidates

∴ Could NOT see
→ this important distinction ← as clearly
← using 1⃣ the side-by-side box plots 2⃣ the MDS plots
∵ they only provide data summarization
← at the sample resolution

∴ The parallel coordinates plots
→ show the sample connections
→ for each gene ← in the data

The example above
→ was simulated → for didactic purpose

Examine → the application of parallel coordinate plots
→ to real data ← from an RNA-seq study
← that compared soybean leaves

Filtered gene
← with 1⃣ low means 2⃣ low variance
Performed → a hierarchical clustering analysis
← with a cluster size of four
Retained ← only significant gene
Visualized → the results
← using parallel coordinate lines

Standardized → each gene
→ to have 1⃣ a mean of zero 2⃣ standard deviation of unity
Performed → hierarchical clustering
← on the standardized DEGs
← using Ward's linkage

∴ This process
→ can divide (large DEG lists) → into (smaller cluster) of (similar patterns)
→ allows us → to more effectively detect → the various types of patterns

∴ 1⃣ the number 2⃣ quality of clusters
→ can vary ← depending on the data

The majority of significant genes
→ were ← in Cluster 1 & 2
← which for the most part captured → the expected patterns of (differential expression)

∴ These genes
→ mostly showed → clean differential expression profiles

Scatter plot matrices

A scatter plot matrix
→ is another effective multi-variate visualization tool
← that plots (read count distribution) ← across 1⃣ all genes 2⃣ all samples

A scatter plots matrix
→ represents each row (gene)
← as a point in each scatter plot

∴ Users
→ can quickly discover → unexpected patterns
→ recognize → geometric shapes
→ assess → the structure & association ← between (multiple variables) in a manner
← that is different ← from most common practices

Clean data
→ would be expected
→ to have larger variability
← between (treatment groups) ← than between replicates

Researches
→ can quickly confirm this
←with a scatter plot matrix

∴ Most genes
→ should fall ←along the x=y line
A small proportion of them
→ to show differential expression
← between samples

A fraction of the genes
→ should have (lower variability)
← between replicates than between treatments

The spread of the scatter plot points
→ to fall more closely ← long the x=y relationship
← between replicates than between treatments

Created → a scatter plot matrix
→ for a public RNA-seq dataset
← that contains three replicates → for two developmental stages of (soybean cotyledon)

Users
→ can use the scatter plot matrix
→ to focus ← on subsets of genes

1⃣ outlier gene
← that deviate ← from the x=y line
← in replicates scatter plots
→ might be problematic
2⃣ outlier gene
← that deviate ← from the x=y line
← in treatment scatter plots
→ might be DEGs

∴ Users
→ view their patterns
←from multiple perspectives
← while obtaining their identifiers

Each gene
← in our data
→ is plotted once ← in each of the 15 scatter plots

More than one million points
→ must be plotted
∴ Rendering all points
→ would slow down → the interactive capability of the plot

Can tailor → the geometric object of (the scatter plots)
→ to be hexagon bins ← rather than points

The genes
→ are also linked
→ to a second plot ← that super-imposes them ← as parallel coordinate lines
← on a side-by-side box plots ← of all counts in the dataset

Assessing normalization with scatter plot matrices

The scatter plot matrix
→ can be used → to (1⃣ understand 2⃣ assess) various algorithms

Use
→ a publicly-available RNA-seq dataset
← on yeast grown in YP-Glucose (YPD)
→ to exemplify this point

The data
→ contained (four cultures) ← from independent libraries
← that were sequenced
← using 1⃣ two library preparation protocols 2⃣ one & two lanes ← in a total of three flow-cells

∴ Researchers
→ examine 1⃣ various levels 2⃣ combinations
← of technical effects

The authors
→ could establish → a false positive rate
← in relation to the number of DEGs

Within-lane regression alone
→ was insufficient
← in effectively removing biases

Expect → most genes
→ to show similar expression between samples
← except for the handful ← that are differentially expressed

It is clear
→ that the data still was NOT sufficiently normalized
← the distribution of gene → is NOT centered ← around the x=y lines

The scatter plot matrix
→ follows → the expected structure
← with most gene falling

The read counts
→ fall closer → to the x=y line
← between the Y4 replicates ← than between the Y1 replicates

∵ The Y1 replicates
→ had additional technical variability
← as they used two different flow cells

The scatter plot matrix
→ can also be used
→ to quickly inspect patterns of 1⃣ biological 2⃣ technical variability

Checking for common errors with scatter plots matrices

Irresproducibility
→ is prevalent
← in high-throughput biological studies

∵ A study ← in Nature Genetics
→ surveyed 18 published micro-array expression analyses
→ reported → that only two were exactly reproducible

The extent of the problem
→ has spawned a field → "forensic bioinformatics"
Researchers
→ attempt → to reverse-engineer
→ reported results back → into the raw datasets
← simply to derive the methodologies ← used in published studies

Irreproducibility
→ is merely cumbersome
← when it masks methods

Forensic bioinformatics
← who have actively investigated common errors
← in high-throughput biological studies
→ has concluded ← that
→ the largeness of the data itself → may hinder out ability → to detect errors

∴ Simple errors
→ can be difficult
→ to detect ← using common practice ← in high-throughput studies

Scatter plot matrices
→ are a convenient tool
→ to check for common errors ← like sample mis-labeling

A subset of these (thick & thin) scatter plots
→ appear outside of their expected locations
← given the expected variability ← between (treatment 🆚 replicates)

Rearranging
← the columns of the two samples
→ would indeed lead back → to the clean-looking scatterplot matrix

The scatter plot matrix
→ provides us → convincing evidence ← of a mis-labeling problem
← which cannot be confirmed ← with such detail
← using traditional plots ← e.g. 1⃣ box plots 2⃣ MDS plots

This method
→ can inform suspicious patterns
← in more detail ← than other means

The user
→ still need → to substantiate
← this suspicion ←with decisive evidence
→ should only used → the visualization as a guide

Finding unexpected patterns in scatter plot matrices

Most popular RNA-seq plotting tools
→ display summaries ← about the read counts
← e.g. 1⃣ fold change summaries 2⃣ principal components summaries 3⃣ five number summaries 4⃣ dispersion summaries

Scatter plot matrices
→ display → the non-summarized read counts
→ for all genes

∴ This traits
→ allows for 1⃣ geometric shapes 2⃣ patterns
→ which relevant to the read count distribution
→ to be readily visible ← in the scatter plot matrix

❗: how geometric shapes
← in the scatter plot matrix
→ can provide applicable information
← which uses the iron-metabolism soy-bean dataset

∴ The expected pattern of a scatter plot matrix
← with more variation ← around the x=y line
← between treatments ← than between replicates

Identified → the five transcripts
← that deviated the most from the expected pattern
∴ Searched → for their putative functions

These transcripts
→ are reportedly involved ← in 1⃣ biotic 2⃣ abiotic stress responses
← the production of super-oxides → to combat microbial infections

A lab biologist
→ documented → a clean data collection process

The same researcher
→ collected the samples ← in succession
→ to reduce variability ← caused by (plant handling) ← by different researchers

❗: a vast change
← in gene expression response
← between these three time points

The streak of gene shown
← may be
← due to (the timing differences) ← between replicate handling

Scientists
→ cannot observe → such interesting structures
← from any models

These structures
→ could lead → to interesting post hoc analyses

The authors
→ had noted → 1⃣ an inadvertent experimental 2⃣ biological discrepancy
← between those replicates
A post hoc hypothesis
← that these gene ← might respond → to that discrepant condition
→ could be generated

Assessing DEG calls in scatter plots matrices

The scatter plots matrix
→ can also be used
→ to quickly examine the DEGs ← returned from a given model

DEGs
→ to fall along → the x=y line → for scatter plots ← between replicates
→ to deviate ← from the x=y line → for scatter plots ← between treatment groups

Link
→ these DEGs ← as parallel coordinate lines
← on a side-by-side box plot
→ to confirm the expected pattern of (differential expression)
← from a second view point

Liter plots

❗: how to view DEGs
→ onto the Cartesian coordinates of (the scatter plots matrix)
❓: this figure
→ becomes limited ← when we investigate treatment groups
← that contain a large number of replicates
∵ we have (too many small scatter plots)
→ for it to remain → an effective visualization tool

Researchers
→ could benefit ← from additional plotting tools
→ to quickly verify → individual DEGs returned from a model

The "replicate line plots"
→ was developed
← by who demonstrated ← it could detect model scaling problems ← in micro-array data

This plot
→ is only applicable
← on datasets where treatment groups ← contain exactly two replicates

❗: an extension of the "replicate line plot"
← which can be applied → to datasets
← with (two & more) replicates

Each gene
→ is plotted once
→ for each possible combination of replicates
← between treatment groups

∴ Each gene ← in this dataset
→ is plotted
← as nine points in the liter plot

This would reduce
→ the speed of interactive functionality
← as cause over-plotting problems

Once the background of hexagons
→ has been drawn
→ to give us → as sense of the distribution of all
← between-treatment sample pair combinations

we
→ can (examine & compare) → liter plots
← using the clusters

A significant gene
← from cluster 1 plotted
← as nine green points

The nine overlaid points
→ are super-imposed ← in a manner we would expect from a DEG
→ they are located far ← from the x=y line

The genes
→ are now over-expressed
← in the other treatment

There seems
→ to be a pattern ← in which one replicate ← from the P group
→ is larger ← than the other two replicates

The nine over laid points
→ are NOT clearly super-imposed
← in the distinct pattern ← we expect of significant genes

The difference
← between the treatment group
→ is so small ← that the overlaid points cluster ← around the x=y line

The gene
→ shows → 1⃣ inconsistent replications 2⃣ consistent treatment groups
The spread-out overlaid
→ points center ← on the x=y line

❓: the liter plots
→ call into question
→ whether the gene ← from this cluster
→ show an expected profile of (differential expression)

∴ This is similar
→ to the messy-looking parallel coordinate plots

Liter plots
→ can detect (odd & questionable) patterns
← in individual "significant genes"
← that cannot be detected numerically through models

Users
→ are provided → several input fields
← that tailor the plot functionality

Readers
→ can verify
→ 1⃣ the parallel coordinate 2⃣ liter plots 3⃣ scatter plots matrices
→ tell a similar story ← about the DEG patterns in these four clusters

Closing case study

Calculate DEG calls
→ for this data
← using the normalization methods of (library size scaling)

The number of total reads
← in each samples
→ are normalized → to a common value
← across all samples

❓: Could finish → our analysis ← at this point
Could draw conclusion ← based on this list of DEGs
← that came from the model
❗: it would be wise
→ to also visualize (this dataset)

A scatter plot matrix
→ confirms the expected pattern
← with treatment scatter plots showing larger variation
← than technical replicate scatter plots

It → uncovers → a hidden pattern
← in the treatment plots
∴ there is a pronounced (streak of genes)
← with higher expressions ← in the liver group

❗: view the DEGs
→ from the model
← using parallel coordinate plots

May need
→ to re-consider → our normalization technique
→ taking both of (these observations) → into account

❓: library size scaling method
→ is not adequate ← in all cases
← especially when (the underlying distribution of reads) ← between samples is inconsistent

The observed streak of (outlier genes)
← that are highly expressed ← in the liver samples
→ reduces the sequences quota available → to the remaining gene in these samples

Trimmed mean of M values (TMM) normalization
→ for such cases
This technique
→ generates sample scaling factors
← that consider sample distributions

1⃣ re-start the analysis 2⃣ apply TMM normalization
→ to this data

❓: The scatter plot matrix
→ did NOT appear differently
❗: should visualize → the new DEG calls

Plotting
→ these DEGs ← as parallel coordinate lines
→ paints a much cleaner picture
← from what we saw earlier

❗: TMM normalization
→ kept → the original 1968 liver-specific DEGs
← from library size scaling
❓: TMM normalization
→ added 1578 more → for a total of 3546 liver-specific DEGs

The liver-specific DEGs
→ may be slightly less clean-looking
← with TMM normalization

The 3974 kidney-specific DEGs
← from TMM normalization
→ are a proper subset of the 7050 kidney-specific DEGs
← from library scale normalization

The 1968 liver-specific DEGs
← from library scale normalization
→ are a proper subset of the 3546 liver-specific DEGs
← from TMM normalization

∴ Perform → a deeper investigation fo the effects of normalization
← on this data
∴ Explore four subsets of genes
← in the form of 1⃣ parallel coordinate plots 2⃣ scatter plots matrices 3⃣ liter plots

Demonstrate → the use of (data standardization)
→ for 1⃣ scatter plots matrices 2⃣ liter plots

Begin ← by plotting the four gene subsets
← in the form of parallel coordinate plots
← after application of hierarchical clustering analysis

Each subset
→ is grouped into eight clusters
→ to reduce any over-plotting ← that would occur should they all be viewed together ← as one large cluster

Continue our visualization study
← by overlaying genes ← from the largest cluster of the four gene subsets
← in the form of standardized scatter plot matrices

Standardization
→ causes → the whole datasets
→ to appear ← as oval-shapes that are almost identical ← across all scatter plots

Lose
→ geometric structures ← that can elicit meaningful information
∵ standardize → scatter plots matrices

Standardization
→ amplifies → meaningful patterns
← in the overlaid DEGs

The overlaid DEG patterns
→ are more spread out
← in the standardized version

DEGs
← in both forms of normalization
→ have the expected differential expression profiles
← in the standardized scatter plots matrices

Overlaying example genes
→ from the largest cluster of the four gene subsets

Standardization
→ causes the dataset
→ to appear ← as an oval-shape
→ to remove the original geometric structure ← in the hexagonal binning

∴ The overlaid DEG patterns
→ are more spread out
← in the standardized version ← in the current case study below
∴ Better interpretation

The example genes
← that were called DEGs ← in both forms of normalization
→ have the expected profiles ← in the liter plots

The overlaid points
→ deviate more ← from the x=y line ← in a tight cluster

This dataset
→ requires more than just library size scaling
→ for reliable analysis
∵ this in-depth analyses ← in this case study

Underscore the overarching theme
→ iteration ← between 1⃣ models 2⃣ visualization
→ is crucial → to achieve (the most convincing results & conclusions)
← in RAN-seq studies

Plot scalability

All visualization plots
→ have limitations
← based on the number of samples ← in the data

Plots
← that appear messy
← regardless of sample numbers
∴ the presence of data quality problems

1⃣ MDS plots 2⃣ box plots 3⃣ parallel coordinate plots
→ can remain effective
← with fairly large sample number

❗: Parallel coordinate plots
→ should be sorted
→ to help place similar variables ← near each other

Scatter plots matrices
→ lose their efficiency
← at smaller sample number
∵ restricted space

One remedy
→ is → to subset their data
→ to plot several smaller scatter plot matrices

One scatter plots matrix
→ would have required ← a prohibitive 576 scatter plots

Liter plots
→ are another remedy → for large data sets
→ can often accommodate more samples

Applied → liter plots
→ to our full data ←that contained 2 group of 12 replicates

❗: The liter plot
→ is more suitable
→ for large datasets ← than the replicate line plot

Discussion

Effective visualization
→ should be → a crucial part of two-group differential expression analysis

1⃣ scatter plots matrices 2⃣ parallel coordinate plots 3⃣ liter plots
→ check → for normalization problems
→ catch (common errors) ← in analysis pipeline
→ confirm that (the variation) ← between replicates 🆚 treatment

These graphical tools
→ allow → quickly explore DEG lists ← that come out of models
→ ensure ← which ones make sense