Paper - Review
10.1038/nrg2641
DOI: 10.1038/nrg2641
Abstract
ChIP-seq
→ Chromatin Immunoprecipitation followed by sequencing
→ genome-wide profiling of 1⃣ DNA-binding proteins 2⃣ Histone modifications 3⃣ nucleosomes
→ Advantages: 1⃣ Higher resolution 2⃣ Less noise 3⃣ Greater coverage
∴ For studying 1⃣ Gene regulation 2⃣ Epigenetic mechanisms
Introduction
Genome-wide mapping
← of 1⃣ Protein-DNA interactions 2⃣ Epigenetic marks
→ is essential for (transcriptional regulation)
Precise map
← of binding-site for transcript factors
→ is vital → for deciphering (the gene regulatory networks)
Nucleosome positioning
→ has a key role in gene regulations
→ guides (development & differentiation)
Chromatin states
→ can influence transcription directly
∴ (Enhance & Impede) recruitment of effector protein complexes
ChIP
← Chromatin immunoprecipitation
→ for assaying protein-DNA binding ← in vivo
→ Antibodies are used → to select specific (proteins & nucleosomes)
∴ Enabling a genome-scale view (← of DNA-protein interactions)
NGS
← Next-Generation Sequencing
→ Sequence (DNA fragments) in a single run
ChIP-Seq
← Chromatin immunoprecipitation followed by sequencing
→ the early application (← of NGS)
→ has (higher resolution) & (fewer artifact) & (greater coverage) & (a larger dynamic range) ← than ChIP-chip
∴ Provides substantially improved data
ChIP-seq basics
DNA fragments
← which associated with a specific protein
→ are enriched
1⃣ Chromatin is sheared by sonication into small fragements (← 200-600 bp range)
2⃣ Antibody is used to immunoprecipitate (the DNA-protein complex)
3⃣ Determine the sequence bound by the protein
MNase digestion
← Micrococcal nuclease
→ most often used to fragment the chromatin
∵ To map (nucleosome positions) & (histone modifications)
MNase treatment is generally preffered
∵ MNase removes linker DNA ← more efficiently than sonication
∴ MNase allows more precise mapping (← of each nucleosome)
MNase digestion
→ has a more sequenced bias than sonication
Advantages and disadvantages of ChIP-seq
Advantage of ChIP-seq
1⃣ its base-pair resolution → is the greatest improvement
2⃣ ChIP-seq does NOT ❌ suffer from the noise (← which generated by the hybridization step in ChIP-chip)
3⃣ The intensity signal measured on arrays might not be linear over its entire range
4⃣ In ChIP-seq, the genome coverage is NOT ❌ limited ← by the repertoire of (probe sequence) (← which fixed on the array)
❗ Disadvantages
1⃣ All profiling technologies produce unwanted artifacts
→ This problem can be ameliorated (← by improvements in alignment algorithms)
2⃣ Bias → towards GC-rich content in fragment selection
3⃣ The loss of (sensitivity & specificity) in detection (← of enriched regions)
∵ Insufficient number of reads
4⃣ Its current cost & availability ❗
Overall cost (← of ChIP-seq) → will have to be lowered
→ to be comparable with the cost of ChIP-chip in every case
Issues in experimental design
Antibody quality
ChIP-seq
← depends on (the quality of the antibody used)
A (sensitive and specific) antibody
→ will give a high level (← of enrichment) → to detect (binding events)
Rigorous validation → is a laborious process
Cross-reactivity (← with similar histone modification)
→ should be checked → by using two independent antibodies
Sample quantity
One advantage of ChIP-seq (← over ChIP-chip)
→ the smaller amount (← of sample material needed)
Several ChIP protocols have been developed
1⃣ 1e4 - 1e5 → for genom-wide profiling
2⃣ 1e2 - 1e3 → for PCR quantification
❗ they require (abundant transcription factors) & (histone modifications)
Fewer founds (← of amplification) are required → for ChIP-seq
∴ Potential (← of artifacts) to PCR is lower
Precise amount (← of ChIP DNA)
← depends on (the abundance of chromatin-associated protein targets)
Control experiment
ChIP
← involve (several potential sources of artifacts)
Shearing ← of DNA
→ does NOT ❌ result → in (uniform fragmentation) (← of the genome)
(Repetitive sequences) → might seem to be enriched
∵ Inaccuracies (← in the number) (← of copies of the repeats)
∴ ChIP-seq profile → should be compared (← with same region in a matched control)
3 commonly used types (← of control sample)
1⃣ Input DNA
2⃣ Mock IP DNA
3⃣ DNA from non-specific IP
❗ There is NO ❌ consensus on which is the most appropriate
Comparison (← shearing DNA & amplification)
Large amount (← of sequencing)
→ is necessary for ChIP-seq control experiment
Sufficient numbers of tags
→ are needed at each point
→ to obtain (accurate estimates) → throughout genome
Depth of sequencing
ChIP-chip vs. ChIP-seq
→ Number of (tiling array)
1⃣ ChIP-chip: fixed ← regardless of (protein & modification)
2⃣ ChIP-seq: determined ← by investigator
Many early data sets
← contained reads from a single lane
← regardless of (what the specific experiment was)
Reasonable criterion
← for determining sufficient sequencing depth
→ would be the results (← of a given analysis) → do NOT ❌ change → when more reads are obtained
∴ saturation point
Binding sites were determined
→ for each sample with a threshold probability
More sites continued
→ to be found at a steady pace (← with additional sequencing)
Saturation point might NOT ❌
← be used to determine (the number of tags)
→ to be sequenced
Saturation point does exist ∃
← if a fixed threshold is imposed ← on (the fold enrichment)
∴ The number of (significant peaks)
→ may continue to rise
← with more sequencing
Ability → to sequence (multiple samples) ← at the same time
→ becomes important for (cost effectiveness)
A few bases are sufficient
→ to serve as (unique identifiers) → for many samples
Additional considerations
ChIP fragments can be sequenced ← at both ends
← as done for detection of (structural variations) in the genome
Paired-end sequencing → can be used in conjunction
1⃣ to provide (additional specificity)
2⃣ to map long-range (chromatin interactions)
ChIP experiments → should be replicated
→ to ensure reproducibility of the data
Challenges in data analysis
Data management
Data can be stored → at 3 levels
1⃣ Image data
2⃣ Sequence data
3⃣ Alignment data
Sequence tags can be used → to map the data
1⃣ when an improved aligner is available
2⃣ when a reference genome is updated
NO ❌ consensus
← regard to which data types should be stored
→ keep only the sequence-level data
∵ Too expensive to store (all data)
Meta-data describing → should describe (the details of experiment)
∵ To ensure → the archive is useful to the community
Genome alignment
Image processing & Base calling
→ are platform specific
→ are mostly done ← using the software provided
👍 All (subsequent results)
→ are based on the (aligned reads)
∴ Conventional alignment algorithms → can take many processor hours
∴ A new generation (← of aligners) → has been developed
ChIP-seq
→ should allow → for (a small number of mismatches)
∵ Sequencing errors & SNPs & Indels
❓ Many current analysis pipelines
→ discard non-unique tags
Identification of enriched regions
To identify regions
← which are enriched in the ChIP sample ← relative to the control
Peak callers
→ scan along the genome
→ to identify (the enriched regions) are currently available
A simple fold ratio (← of the signal for ChIP sample) ← around the peak
→ provides (important information)
→ but NOT ❌ adequate
Effective approach
1⃣ A Poisson model
2⃣ The absolute tag numbers
Poisson model
→ can be modified → to account for regional bias (← in tag density)
∵ Chromatin structure & Copy number variation & Amplification bias
Peaks can be scored
← before a combined profile is generated
1⃣ how well (the tag distributions) on the two strands ← resemble each other
2⃣ whether the distance ← between (the peak is close)
Another important (local correction)
← regardless of (the peak detection method)
→ is to adjust → for sequence align-ability
Major difficulty
← in identifying (enriched regions)
1⃣ Sharp
2⃣ Broad
3⃣ Mixed
Sharp peaks
→ are generally found → for (protein-DNA binding) & (histone modifications) ← at regulatory regions
Many (peak detection techniques)
← which used in ChIP-chip & DNA copy number analysis
→ will be modified for ChIP-seq
(A careful comparison) (← of the algorithms)
→ is still being carried out
It is clear → that (the best methods) should
← at least take advantage (← of the strand-specific pattern)
1⃣ Expected at (a binding location)
2⃣ Adjust for (local variation) as measured by input DNA
3⃣ To a lesser extent
4⃣ Correct → for sequence align-ability
Downstream analysis
Approaches
→ to analyses (the biological implications) (← of ChIP-seq data)