PARCLIP data analysis using CTK

From Zhang Laboratory

Revision as of 16:49, 24 May 2018 by Czhang (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
CTK home | Standard/BrdU-CLIP | iCLIP | eCLIP | PAR-CLIP | CTK usage | FAQ


Introduction

This tutorial outlines how to analyze CLIP data derived from the Photoactivatable-Ribonucleoside-Enhanced Crosslinking and Immunoprecipitation (PAR-CLIP) protocol. The method relies on the incorporation of photoreactive ribonucleoside analogs, like 4-thiouridine (4-SU) and 6-thioguanosine (6-SG), into nascent RNA transcripts. Irradiation of cells by UV light crosslinks the photoreactive nucleoside-labeled cellular RNAs to interacting RNA binding proteins.

One characteristic feature of cDNA libraries prepared by PAR-CLIP is that crosslinking can be identified by mutations in the sequenced cDNA. 4-SU induces a thymidine to cytidine (T->C) transition, whereas 6-SG induces guanosine to adenosine (G->A) mutations.

Hafner, M., Landthaler, M., Burger, L., Khorshid, M., Hausser, J., Berninger, P., ... & Ulrich, A. (2010). PAR-CliP-a method to identify transcriptome-wide the binding sites of RNA binding proteins. JoVE (Journal of Visualized Experiments), (41), e2034-e2034.


PAR-CLIP reads typically have the following structure:

[CLIP tag sequence]TCGTATGCCGTCTTCTGCTTG

Sample dataset

We will use PUM2 PAR-CLIP data from 293T cells generated by Hafner et al. (2010) for this tutorial.

Raw FASTQ files

We will use ~/test as our working directory, and download fastq files under ~/test/fastq

mkdir ~/test; cd ~/test
mkdir fastq; cd fastq

The raw sequence files from Illumina sequencing can be downloaded from the ENCODE web site.

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR048/SRR048967/SRR048967.sra
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR048/SRR048968/SRR048968.sra

#convert to fastq files
fastq-dump -F  --gzip SRR048967.sra
fastq-dump -F  --gzip SRR048968.sra

ln -s SRR048967.fastq.gz PUM2_high.fastq.gz
ln -s SRR048968.fastq.gz PUM2_low.fastq.gz


Note that two bands of high and low molecular weight were cut out and used to prepare independent PAR-CLIP libraries. These were treated biological replicates in the original study and in this tutorial.

Output files of major steps generated in this tutorial


The number of reads in each file, after each step of processing, is summarized in Table 1 below.

Table 1: Summary of read numbers analyzed by CTK.
Protein Sample # of raw reads # of trimmed & filtered reads # of collapsed reads # of mapped reads # of unique tags
PUM2 high 5,104,559 4,477,899 1,053,740 81,310 30,539
PUM2 low 5,351,797 3,495,032 1,058,679 198,223 129,781
Total 10,456,356 7,972,931 2,112,419 279,533 160,320

Read preprocessing

Read quality filtering

cd ~/test
mkdir filtering
cd filtering

The fastq_filter.pl script will extract reads passing quality filters for each library (of standard CLIP-seq data) using a simple loop.

#!/bin/bash

for f in PUM2_high PUM2_low; do
	perl /usr/local/CTK/fastq_filter.pl -v -if sanger -f mean:0-24:20 -of fastq ../fastq/$f.fastq.gz - | gzip -c > $f.filter.fastq.gz
done


Trimming of 3' linker sequences

cd ~/test
mkdir filtering
cd filtering


One can then trim the 3' adapter using the command below.

#!/bin/bash

for $f in PUM2_high PUM2_low; do
cutadapt -f fastq --times 1 -e 0.1 -O 1 --quality-cutoff 5 -m 15 -a TCGTATGCCGTCTTCTGCTTG -o $f.filter.trim.fastq.gz $f.filter.fastq.gz > $f.cutadpt.log
done

The parameters are as follows:

  • -f fastq : input is a FASTQ file
  • --times 1 : removes up to 1 adapters from each read; here we specify 2 as occasionally some tags have 2 copies of adapters
  • -e 0.1 : maximum allowed error rate is 0.1
  • -O 1 : the read is not modified if the overlap between the read and the adapter is shorter than 1nt
  • --quality-cutoff : refers to quality
  • -m 15 : minimum length of the read being 15nt tags
  • -a : the 3' adapter

For more cutadapt usage information:

cutadapt --help


This is the same thing, but using a Sun Grid Engine to execute this UNIX job on a remote machine.

#!/bin/bash
#$ -t 1-2 -m a -cwd -N CLIP

files=(PUM2_high PUM2_low);
f=${files[$SGE_TASK_ID-1]}

cutadapt -f fastq --times 1 -e 0.1 -O 1 --quality-cutoff 5 -m 15 -a TCGTATGCCGTCTTCTGCTTG -o $f.filter.trim.fastq.gz $f.filter.fastq.gz > $f.cutadpt.log


Note 1: It is good to check the number of reads by running the following command:

#e.g., for raw reads
for f in `ls ../fastq/*.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done

#e.g., for trimmed reads
for f in `ls *.filter.trim.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done

Collapse Exact PCR Duplicates

Collapse exact PCR duplicates such that if multiple reads have exactly the same sequence, only one is kept.

#!/bin/bash -l
for f in PUM2_high PUM2_low; do
    perl /usr/local/CTK/fastq2collapse.pl $f.filter.trim.fastq.gz - | gzip -c > $f.filter.trim.c.fastq.gz
done

Note 1: It is good to check the number of reads by running the command:

for f in `ls *.filter.trim.c.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done


Note 2: Get the distribution of tag lengths for diagnostic purposes using the following command.

for f in PUM2_high PUM2_low; do
    zcat $f.filter.trim.c.fastq.gz | awk '{if(NR%4==2) {print length($0)}}' | sort -n | uniq -c | awk '{print $2"\t"$1}' > $f.filter.trim.c.seqlen.stat.txt
done

Download fully preprocessed FASTQ files here (60 MB).

Read mapping & parsing

We assume the reference index has been prepared and is available under /genomes/hg19/bwa/.

cd ~/test
mkdir mapping
cd mapping

Run bwa to align the reads to the reference genome.

for f in PUM2_high PUM2_low; do
   bwa aln -t 4 -n 0.06 -q 20 /genomes/hg19/bwa/hg19.fa ../filtering/$f.filter.trim.c.fastq.gz > $f.sai
   bwa samse /genomes/hg19/bwa/hg19.fa $f.sai ../filtering/$f.filter.trim.c.fastq.gz | gzip -c > $f.sam.gz
done


The option -n 0.06 specifies slightly more stringent criteria than the default. The number of allowed mismatches (substitutions or indels) depending on read length is as follows:


[bwa_aln] 17bp reads: max_diff = 1
[bwa_aln] 20bp reads: max_diff = 2
[bwa_aln] 45bp reads: max_diff = 3
[bwa_aln] 73bp reads: max_diff = 4
[bwa_aln] 104bp reads: max_diff = 5
[bwa_aln] 137bp reads: max_diff = 6
[bwa_aln] 172bp reads: max_diff = 7
[bwa_aln] 208bp reads: max_diff = 8
[bwa_aln] 244bp reads: max_diff = 9

The -q option is used to trim low quality reads (an average of 20 or below). However, this does not seem to do too much after the trimming steps above.


Parsing SAM file

for f in PUM2_high PUM2_low; do
   perl /usr/local/CTK/parseAlignment.pl -v --map-qual 1 --min-len 18 --mutation-file $f.mutation.txt $f.sam.gz - | gzip -c > $f.tag.bed.gz
done

This will keep only unique mappings (with MAPQ >=1) and a minimal mapping size of 18 nt.

Note 1: In the tag bed file, the 5th column records the number of mismatches (substitutions) in each read

Note 2: Other aligners might not use a positive MAPQ as an indication of unique mapping.

Another useful option is --indel-to-end, which specifies the number of nucleotides towards the end from which indels should not be called (default=5 nt).

Note 3: The parsing script relies on MD tags, which is an optional field without strict definition in SAM file format specification. Some aligners might have slightly different format how they report mismatches. If other aligners than bwa is used, one should run the following command:

samtools view -bS $f.sam | samtools sort - $f.sorted
samtools fillmd  $f.sorted.bam /genomes/hg19/bwa/hg19.fa | gzip -c > $f.sorted.md.sam.gz

This will ensure the sam file gets parsed properly.

Note 4: Keep track what proportion of reads can be mapped uniquely.

wc -l *.tag.bed

Collapse PCR duplicates

It is critical to collapse PCR duplicates, not only for the exact duplicates collapsed above, but also for those with slight differences due to sequencing errors.

for f in PUM2_high PUM2_low; do
   perl /usr/local/CTK/tag2collapse.pl -v -big -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.bed $f.tag.uniq.bed
done

In the command above, tags mapped to the same genomic positions (i.e., the same start coordinates for the 5' end of RNA tag) will be collapsed together. Note that this PAR-CLIP dataset does not have random barcode attached CLIP tags that can be used as unique molecular identifiers (UMIs).

Note 1: Note that the read ID in the input BED file (in the 4th column) must take the form READ#x, where x is the number of exact duplicates. Read IDs that are not in this format will generate an error.

Note 2: As a diagnostic step, get the length distribution of unique tags, which should be a more faithful representation of the library:

for f in PUM2_high PUM2_low; do
   awk '{print $3-$2}' $f.tag.uniq.bed | sort -n | uniq -c | awk '{print $2"\t"$1}' > $f.tag.uniq.len.dist.txt
done


Get the mutations in unique tags.

for f in PUM2_high PUM2_low; do
   python /usr/local/CTK/joinWrapper.py $f.mutation.txt $f.tag.uniq.bed 4 4 N $f.tag.uniq.mutation.txt
done

This script is from galaxy, and included for your convenience. The parameters 4 4 indicate the columns in the two input file used to join the two, and N indicates that only paired rows should be printed.


Table 2 summarizes the number of unique mutations of different types in each sample.

Table 2: Summary of mutations in unique tags.
Protein Sample # of unique tags Total # of mutations in unique tags Deletions Insertions Substitutions T2C
PUM2 high 30,539 30,467 2,187 1,086 27,194 17,557
PUM2 low 129,781 147,772 5,102 15,019 127,651 46,302
Total 160,320 178,239 7,289 16,105 154,845 63,859


Merging biological replicates

After getting unique tags, one might want to concatenate these runs, which can be distinguished by different colors. As an example:

perl /usr/local/CTK/bed2rgb.pl -v -col "188,0,0" PUM2_high.tag.uniq.bed PUM2_high.tag.uniq.rgb.bed
perl /usr/local/CTK/bed2rgb.pl -v -col "128,0,128" PUM2_low.tag.uniq.bed PUM2_low.tag.uniq.rgb.bed

Then these can be concatenated.

cat PUM2_high.tag.uniq.rgb.bed PUM2_low.tag.uniq.rgb.bed> PUM2.pool.tag.uniq.rgb.bed
cat PUM2_high.tag.uniq.mutation.txt PUM2_low.tag.uniq.mutation.txt  > PUM2.pool.tag.uniq.mutation.txt


Download unique tag and mutation files here (12 MB).

Annotating and visualizing CLIP tags

Get the genomic distribution of CLIP tags

perl /usr/local/CTK/bed2annotation.pl -dbkey hg19 -ss -big -region -v -summary PUM2.pool.tag.uniq.annot.summary.txt PUM2.pool.tag.uniq.rgb.bed PUM2.pool.tag.uniq.annot.txt

Make sure the current genome (hg19) is specified. Check the summary file (PUM2.pool.tag.uniq.annot.summary.txt) for the percentage of tags mapped to CDS, 3'UTR, introns, etc.

Generate bedgraph for visualization in the genome browser

perl /usr/local/CTK/tag2profile.pl -v -ss -exact -of bedgraph -n ″Unique Tag Profile″ PUM2.pool.tag.noRNA.uniq.rgb.bed PUM2.pool.tag.uniq.bedgraph

The output bedgraph file can be loaded into a genome browser for visualization of tag number at each genomic position.

Peak calling

cd ~/test
mkdir cluster
cd cluster
ln -s ../mapping/PUM2.pool.tag.uniq.rgb.bed

Mode 1: Peak calling with no statistical significance

perl /usr/local/CTK/tag2peak.pl -big -ss -v --valley-seeking --valley-depth 0.9 PUM2.pool.tag.uniq.rgb.bed PUM2.pool.tag.uniq.peak.bed --out-boundary PUM2.pool.tag.uniq.peak.boundary.bed --out-half-PH PUM2.pool.tag.uniq.peak.halfPH.bed

Note 1: To annotate peaks with overlapping genes and repeat masked sequences and get genomic breakdown:

perl /usr/local/CTK/bed2annotation.pl -dbkey hg19 -ss -big -region -v -summary PUM2.pool.tag.uniq.peak.annot.summary.txt PUM2.pool.tag.uniq.peak.bed PUM2.pool.tag.uniq.peak.annot.txt

The output file PUM2.pool.tag.uniq.peak.annot.txt has the detailed annotation for each peak, and PUM2.pool.tag.uniq.peak.annot.summary.txt has summary statistics.

Note 2: Another useful option --valley-depth specifies the depth of the valley relative to the peak (0.5-1, 0.9 by default). One also has the option to merge peaks close to each other by specifying the distance between the peaks (e.g., -gap 20).

Note 3: Besides the peak boundaries (PUM2.pool.tag.uniq.peak.bed), one can also output cluster boundaries (--out-boundary) and half peak boundaries (--out-half-PH ) associated with each peak.


In this mode, the script will not assess the statistical significance of the peak height. If one needs this, an alternative mode is as follows:

Mode 2: Peak calling with statistical significance

perl /usr/local/CTK/tag2peak.pl -big -ss -v --valley-seeking -p 0.05 --valley-depth 0.9 --multi-test --dbkey hg19 PUM2.pool.tag.uniq.rgb.bed PUM2.pool.tag.uniq.peak.sig.bed --out-boundary PUM2.pool.tag.uniq.peak.sig.boundary.bed --out-half-PH PUM2.pool.tag.uniq.peak.sig.halfPH.bed

Note 1: To get the number of significant peaks:

wc -l PUM2.pool.tag.noRNA.uniq.peak.sig.bed
2,684 PUM2.pool.tag.uniq.peak.sig.bed

Note 2: To search for a known binding motif, one first defines the center of each peak (based on width at peak).

#extract from -500 to +500
awk '{print $1"\t"int(($2+$3)/2)-500"\t"int(($2+$3)/2)+500"\t"$4"\t"$5"\t"$6}' PUM2.pool.tag.uniq.peak.sig.bed > PUM2.pool.tag.uniq.peak.sig.center.ext1k.bed

This bed file can be used to extract sequences around peak (pay attention to the strand), and search for enrichment of specific motif (e.g., UGUANAUA) relative to the peak.

Note 3: One might want to count the number of tags overlapping with each cluster/peak for each sample (e.g., to evaluate correlation between replicates).

for f in HepG2.RBFOX2.rep1.R2 HepG2.RBFOX2.rep2.R2; do
    perl /usr/local/CTK/tag2profile.pl -ss -region PUM2.pool.tag.uniq.peak.sig.boundary.bed -of bed -v $f.tag.uniq.bed $f.tag.uniq.peak.sig.boundary.count.bed
done


Download output files for peak calling here (691 KB).

CIMS

cd ~/test
mkdir CIMS
cd CIMS

ln -s ../mapping/PUM2.pool.tag.uniq.rgb.bed 
ln -s ../mapping/PUM2.pool.tag.uniq.mutation.txt

Get specific types of mutations

Get specific types of mutations, such as deletions, substitutions, and insertions in unique CLIP tags

perl ~/czlab_src/CTK/getMutationType.pl -t del PUM2.pool.tag.uniq.mutation.txt PUM2.pool.tag.uniq.del.bed
#deletions
perl ~/czlab_src/CTK/getMutationType.pl -t ins PUM2.pool.tag.uniq.mutation.txt PUM2.pool.tag.uniq.ins.bed
#insertions
perl ~/czlab_src/CTK/getMutationType.pl -t sub PUM2.pool.tag.uniq.mutation.txt PUM2.pool.tag.uniq.sub.bed
#substitutions
perl ~/czlab_src/CTK/getMutationType.pl -t t2c --summary PUM2.pool.tag.uniq.mutation.stat.txt PUM2.pool.tag.uniq.mutation.txt PUM2.pool.tag.uniq.t2c.bed
#t2c substitutions, as well as summary statistics of different types of mutations

It is always a good practice to look at the number of each type of mutation to see, e.g., compare the relative abundance of deletions to insertions. This information is provided in the summary statistics file when one specify the option --summary.

Get CIMS

Here, we use deletions as an example.

perl /usr/local/CTK/CIMS.pl -big -n 10 -p -outp PUM2.pool.tag.uniq.t2c.pos.stat.txt -v PUM2.pool.tag.uniq.rgb.bed PUM2.pool.tag.uniq.t2c.bed PUM2.pool.tag.uniq.t2c.CIMS.txt

By default, CIMS.pl will output all sites including those that are not statistically significant. This is recommended because one can play with stringency later.

Extract sites that are statistically significant.

awk '{if($9<=0.05) {print $0}}' PUM2.pool.tag.uniq.t2c.CIMS.txt | sort -k 9,9n -k 8,8nr -k 7,7n > PUM2.pool.tag.uniq.t2c.CIMS.s13.txt
cut -f 1-6 PUM2.pool.tag.uniq.t2c.CIMS.s13.txt > PUM2.pool.tag.uniq.t2c.CIMS.s13.bed
wc -l PUM2.pool.tag.uniq.t2c.CIMS.s13.bed
2095 PUM2.pool.tag.uniq.del.CIMS.s13.bed

This will keep only those with FDR<0.05. Significant sites will also be sorted by FDR, then by the number of tags with mutations, and then by the total number of overlapping tags. One might try different thresholds to get a balance of sensitivity/specificity (e.g. as judged from motif enrichment).

Note 1: Another parameter that might be useful to improve signal to noise is m/k (i.g., $8/$7 in awk)

Note 2: The number of significant CIMS might be slightly different when unique tags are provided in different orders (this could occur easily when multiple replicates are concatenated together). This is because the FDR estimation is based on permutation, which can change slightly depending on the order of tags.

Note 3: To examine enrichment of motif around CIMS:

awk '{print $1"\t"$2-10"\t"$3+10"\t"$4"\t"$5"\t"$6}' PUM2.pool.tag.uniq.t2c.CIMS.s13.bed > PUM2.pool.tag.uniq.sub.CIMS.s13.21nt.bed
#+/-10 around CIMS

This BED file could be used to extract sequences around CIMS (pay attention to the strand), and search for enrichment of specific motif (e.g., UGUANAUA for the case of PUM2) relative to the CIMS.


Download CIMS output files here (4 KB).


One can repeat these steps for the other types of mutations (i.e. deletions, insertions and general substitutions). This is particularly relevant when one works with a new RBP, and it is unknown which type of mutations will be caused by cross linking.


Motif enrichment

The enrichment of motif sites (when the motif is known) around CIMS or CITS provides a quantitative measure of the signal to noise ratio. The plot below shows the proportion of sites with the PUM2 binding UGUANAUA motif starting at each position relative to CIMS (T-C substitutions). This analysis shows U7 as the major crosslink sites in this dataset, which is consistent with the analysis of the original study (Hafner et al. 2010).