Paper - Review

10.1038/nrg2641

DOI: 10.1038/nrg2641

Abstract

ChIP-seq
→ Chromatin Immunoprecipitation followed by sequencing
→ genome-wide profiling of 1⃣ DNA-binding proteins 2⃣ Histone modifications 3⃣ nucleosomes
→ Advantages: 1⃣ Higher resolution 2⃣ Less noise 3⃣ Greater coverage
∴ For studying 1⃣ Gene regulation 2⃣ Epigenetic mechanisms

Introduction

Genome-wide mapping
← of 1⃣ Protein-DNA interactions 2⃣ Epigenetic marks
→ is essential for (transcriptional regulation)

Precise map
← of binding-site for transcript factors
→ is vital → for deciphering (the gene regulatory networks)

Nucleosome positioning
→ has a key role in gene regulations
→ guides (development & differentiation)

Chromatin states
→ can influence transcription directly
∴ (Enhance & Impede) recruitment of effector protein complexes

ChIP
← Chromatin immunoprecipitation
→ for assaying protein-DNA binding ← in vivo
→ Antibodies are used → to select specific (proteins & nucleosomes)
∴ Enabling a genome-scale view (← of DNA-protein interactions)

NGS
← Next-Generation Sequencing
→ Sequence (DNA fragments) in a single run

ChIP-Seq
← Chromatin immunoprecipitation followed by sequencing
→ the early application (← of NGS)
→ has (higher resolution) & (fewer artifact) & (greater coverage) & (a larger dynamic range) ← than ChIP-chip
∴ Provides substantially improved data

ChIP-seq basics

DNA fragments
← which associated with a specific protein
→ are enriched

1⃣ Chromatin is sheared by sonication into small fragements (← 200-600 bp range)
2⃣ Antibody is used to immunoprecipitate (the DNA-protein complex)
3⃣ Determine the sequence bound by the protein

MNase digestion
← Micrococcal nuclease
→ most often used to fragment the chromatin
∵ To map (nucleosome positions) & (histone modifications)

MNase treatment is generally preffered
∵ MNase removes linker DNA ← more efficiently than sonication
∴ MNase allows more precise mapping (← of each nucleosome)

MNase digestion
→ has a more sequenced bias than sonication

Advantages and disadvantages of ChIP-seq

Advantage of ChIP-seq
1⃣ its base-pair resolution → is the greatest improvement
2⃣ ChIP-seq does NOT ❌ suffer from the noise (← which generated by the hybridization step in ChIP-chip)
3⃣ The intensity signal measured on arrays might not be linear over its entire range
4⃣ In ChIP-seq, the genome coverage is NOT ❌ limited ← by the repertoire of (probe sequence) (← which fixed on the array)

❗ Disadvantages
1⃣ All profiling technologies produce unwanted artifacts
→ This problem can be ameliorated (← by improvements in alignment algorithms)
2⃣ Bias → towards GC-rich content in fragment selection
3⃣ The loss of (sensitivity & specificity) in detection (← of enriched regions)
∵ Insufficient number of reads
4⃣ Its current cost & availability ❗

Overall cost (← of ChIP-seq) → will have to be lowered
→ to be comparable with the cost of ChIP-chip in every case

Issues in experimental design

Antibody quality

ChIP-seq
← depends on (the quality of the antibody used)

A (sensitive and specific) antibody
→ will give a high level (← of enrichment) → to detect (binding events)

Rigorous validation → is a laborious process

Cross-reactivity (← with similar histone modification)
→ should be checked → by using two independent antibodies

Sample quantity

One advantage of ChIP-seq (← over ChIP-chip)
→ the smaller amount (← of sample material needed)

Several ChIP protocols have been developed
1⃣ 1e4 - 1e5 → for genom-wide profiling
2⃣ 1e2 - 1e3 → for PCR quantification
❗ they require (abundant transcription factors) & (histone modifications)

Fewer founds (← of amplification) are required → for ChIP-seq
∴ Potential (← of artifacts) to PCR is lower

Precise amount (← of ChIP DNA)
← depends on (the abundance of chromatin-associated protein targets)

Control experiment

ChIP
← involve (several potential sources of artifacts)

Shearing ← of DNA
→ does NOT ❌ result → in (uniform fragmentation) (← of the genome)

(Repetitive sequences) → might seem to be enriched
∵ Inaccuracies (← in the number) (← of copies of the repeats)
∴ ChIP-seq profile → should be compared (← with same region in a matched control)

3 commonly used types (← of control sample)
1⃣ Input DNA
2⃣ Mock IP DNA
3⃣ DNA from non-specific IP
❗ There is NO ❌ consensus on which is the most appropriate

Comparison (← shearing DNA & amplification)

Large amount (← of sequencing)
→ is necessary for ChIP-seq control experiment
Sufficient numbers of tags
→ are needed at each point
→ to obtain (accurate estimates) → throughout genome

Depth of sequencing

ChIP-chip vs. ChIP-seq
→ Number of (tiling array)
1⃣ ChIP-chip: fixed ← regardless of (protein & modification)
2⃣ ChIP-seq: determined ← by investigator

Many early data sets
← contained reads from a single lane
← regardless of (what the specific experiment was)

Reasonable criterion
← for determining sufficient sequencing depth
→ would be the results (← of a given analysis) → do NOT ❌ change → when more reads are obtained
∴ saturation point

Binding sites were determined
→ for each sample with a threshold probability

More sites continued
→ to be found at a steady pace (← with additional sequencing)

Saturation point might NOT ❌
← be used to determine (the number of tags)
→ to be sequenced

Saturation point does exist ∃
← if a fixed threshold is imposed ← on (the fold enrichment)

∴ The number of (significant peaks)
→ may continue to rise
← with more sequencing

Ability → to sequence (multiple samples) ← at the same time
→ becomes important for (cost effectiveness)
A few bases are sufficient
→ to serve as (unique identifiers) → for many samples

Additional considerations

ChIP fragments can be sequenced ← at both ends
← as done for detection of (structural variations) in the genome

Paired-end sequencing → can be used in conjunction
1⃣ to provide (additional specificity)
2⃣ to map long-range (chromatin interactions)

ChIP experiments → should be replicated
→ to ensure reproducibility of the data

Challenges in data analysis

Data management

Data can be stored → at 3 levels
1⃣ Image data
2⃣ Sequence data
3⃣ Alignment data

Sequence tags can be used → to map the data
1⃣ when an improved aligner is available
2⃣ when a reference genome is updated

NO ❌ consensus
← regard to which data types should be stored
→ keep only the sequence-level data
∵ Too expensive to store (all data)

Meta-data describing → should describe (the details of experiment)
∵ To ensure → the archive is useful to the community

Genome alignment

Image processing & Base calling
→ are platform specific
→ are mostly done ← using the software provided

👍 All (subsequent results)
→ are based on the (aligned reads)
∴ Conventional alignment algorithms → can take many processor hours
∴ A new generation (← of aligners) → has been developed

ChIP-seq
→ should allow → for (a small number of mismatches)
∵ Sequencing errors & SNPs & Indels

❓ Many current analysis pipelines
→ discard non-unique tags

Identification of enriched regions

To identify regions
← which are enriched in the ChIP sample ← relative to the control

Peak callers
→ scan along the genome
→ to identify (the enriched regions) are currently available

A simple fold ratio (← of the signal for ChIP sample) ← around the peak
→ provides (important information)
→ but NOT ❌ adequate

Effective approach
1⃣ A Poisson model
2⃣ The absolute tag numbers

Poisson model
→ can be modified → to account for regional bias (← in tag density)
∵ Chromatin structure & Copy number variation & Amplification bias

Peaks can be scored
← before a combined profile is generated
1⃣ how well (the tag distributions) on the two strands ← resemble each other
2⃣ whether the distance ← between (the peak is close)

Another important (local correction)
← regardless of (the peak detection method)
→ is to adjust → for sequence align-ability

Major difficulty
← in identifying (enriched regions)
1⃣ Sharp
2⃣ Broad
3⃣ Mixed

Sharp peaks
→ are generally found → for (protein-DNA binding) & (histone modifications) ← at regulatory regions
Many (peak detection techniques)
← which used in ChIP-chip & DNA copy number analysis
→ will be modified for ChIP-seq

(A careful comparison) (← of the algorithms)
→ is still being carried out
It is clear → that (the best methods) should
← at least take advantage (← of the strand-specific pattern)
1⃣ Expected at (a binding location)
2⃣ Adjust for (local variation) as measured by input DNA
3⃣ To a lesser extent
4⃣ Correct → for sequence align-ability

Downstream analysis

Approaches
→ to analyses (the biological implications) (← of ChIP-seq data)