1 Introduction
2 Sample dataset
- 2.1 Raw FASTQ files
- 2.2 Output files of major steps generated in this tutorial
3 Read preprocessing
4 Read mapping & parsing
5 Collapse PCR duplicates
- 5.1 Merging biological replicates
6 Annotating and visualizing CLIP tags
- 6.1 Get the genomic distribution of CLIP tags
- 6.2 Generate bedgraph for visualization in the genome browser
7 Peak calling
- 7.1 Mode 1: Peak calling with no statistical significance
- 7.2 Mode 2: Peak calling with statistical significance
8 CIMS
- 8.1 Get specific types of mutations
- 8.2 Get CIMS
9 CITS
10 Motif enrichment

Introduction

This tutorial outlines how to analyze individual-nucleotide resolution CLIP (iCLIP) data. For more information on the biochemical iCLIP protocol, refer to the following references:

Konig, J., K. Zarnack, G. Rot, T. Curk, M. Kayikci, B. Zupan, D. J. Turner, N. M. Luscombe and J. Ule (2010). "iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution." Nat Struct Mol Biol 17(7): 909-915.

Huppertz, I., J. Attig, A. D'Ambrogio, L. E. Easton, C. R. Sibley, Y. Sugimoto, M. Tajnik, J. Konig and J. Ule (2014). "iCLIP: protein-RNA interactions at nucleotide resolution." Methods 65(3): 274-287.

iCLIP reads typically has the following structure:

NNNSSSSNN[CLIP tag sequence]AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG

where N3 and N2 are bipartite random barcode (UMI) and S4 is the sample index.

Sample dataset

Sample Rbfox2 iCLIP data from this study:

Van Nostrand, E. L., G. A. Pratt, A. A. Shishkin, C. Gelboin-Burkhart, M. Y. Fang, B. Sundararaman, S. M. Blue, T. B. Nguyen, C. Surka, K. Elkins, R. Stanton, F. Rigo, M. Guttman and G. W. Yeo (2016). "Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)." Nat Methods 13(6): 508-514.

Raw FASTQ files

The raw sequence files from Illumina sequencing can be downloaded from SRA as outlined below. The four FASTQ files refer to four different runs.

wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR314/SRR3147674/SRR3147674.sra
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR314/SRR3147675/SRR3147675.sra
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR314/SRR3147676/SRR3147676.sra
wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR314/SRR3147677/SRR3147677.sra

fastq-dump -F  --gzip SRR3147674.sra
fastq-dump -F --gzip SRR3147675.sra
fastq-dump -F --gzip SRR3147676.sra
fastq-dump -F --gzip SRR3147677.sra

These sequencing files can be given more meaningful names.

ln -s SRR3147674.fastq.gz RBFOX2.rep1.fastq.gz
ln -s SRR3147675.fastq.gz RBFOX2.rep2.fastq.gz
ln -s SRR3147676.fastq.gz RBFOX2.rep3.fastq.gz
ln -s SRR3147677.fastq.gz RBFOX2.rep4.fastq.gz

These FASTQ files are placed in the directory called fastq under the test folder in home directory (i.e., ~/test is our working directory)

The four runs have sample index GGTC, CCAC,GGTC, CCAC, respectively. In this tutorial, we assume these samples are from independent PCR, so PCR duplicates were collapse independently for each run.

Output files of major steps generated in this tutorial

Fully preprocessed FASTQ files (1.1 GB) : output files generated for use right before mapping.

Unique CLIP tag and mutation files (375 MB) : output files generated after mapping and collapsing PCR duplicates.

Peak calling files (36 MB) : output files after peak calling.

CIMS output files (2.8 MB) : output files after CIMS analysis.

CITS output files (595 KB) : output files after CITS analysis.

The number of reads in each file, after each step of processing, is summarized in Table 1 below.

Table 1: Summary of read numbers analyzed by CTK.
Protein	Sample	# of raw reads	# of trimmed & filtered reads	# of collapsed reads	# of mapped reads	# of reads after removal of repetitive RNAs	# of unique tags
RBFOX2	rep1	21,874,724	20,688,477	2,596,206	1,714,320	1,678,897	646,705
RBFOX2	rep2	112,120,785	103,866,169	14,631,407	9,462,508	9,295,618	3,482,864
RBFOX2	rep3	22,648,473	21,185,357	2,231,051	1,105,998	1,083,783	553,372
RBFOX2	rep4	124,320,492	114,054,608	13,621,263	6,533,253	6,417,281	3,161,497
	Total	280,964,474	259,794,611	33,079,927	18,816,079	18,475,579	7,844,438

Read preprocessing

Demultiplexing samples

The files downloaded from SRA have already be demultiplexed. Therefore this step has already been performed.

For the user who would like to start from a FASTQ file consisting of multiple libraries, check here.

Read quality filtering

For this particular dataset, we skipped the read quality filtering step, as this does not appear to make any major impact on the results.

For the user who would like to include this step, check here.

Trimming of 3' linker sequences

cd ~/test
mkdir filtering
cd filtering

One can then trim the 3' adapter using the command below.

#!/bin/bash
for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
    cutadapt --times 1 -e 0 -O 1 --quality-cutoff 5 -m 24 -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG -o $f.trim.fastq.gz ../fastq/$f.fastq.gz > $f.cutadpt.log
done

The parameters are as follows:

--times 1 : removes up to 1 adapters from each read; here we specify 2 as occasionally some tags have 2 copies of adapters
-e 0 : maximum allowed error rate is 0
-O 1 : the read is not modified if the overlap between the read and the adapter is shorter than 1nt
--quality-cutoff : refers to quality
-m 24 : minimum length of the read being 24nt (9 nt barcode + 15 nt tags)
-a : the 3' adapter

For more cutadapt usage information:

cutadapt --help

This is the same thing, but using a Sun Grid Engine to execute this UNIX job on a remote machine.

#!/bin/bash
#$ -t 1-4 -m a -cwd -N CLIP

files=(RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4);
f=${files[$SGE_TASK_ID-1]}

cutadapt --times 1 -e 0.1 -O 1 --quality-cutoff 5 -m 24 -a AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG -o $f.trim.fastq.gz ../fastq/$f.fastq.gz > $f.cutadpt.log
#results from this demo might have been generated using -e 0, but -e0.1 is probably more reasonable.

Note 1: It is good to check the number of reads by running the following command:

#e.g., for raw reads
for f in `ls ../fastq/*.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done

#e.g., for trimmed reads
for f in `ls *.trim.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done

Collapse exact duplicates

If multiple reads have exactly the same sequence, only one is kept.

#!/bin/bash
for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
    perl /usr/local/CTK/fastq2collapse.pl $f.trim.fastq.gz - | gzip -c > $f.trim.c.fastq.gz
done

Note 1: It is good to check the number of reads by running the command:

for f in `ls *.trim.c.fastq.gz`; do
    c=`zcat $f | wc -l`
    c=$((c/4))
    echo $f $c
done

Strip random barcode (UMI)

The following script removes the 5' degenerate barcode. The barcode is in the format: NNNCCACNN where CCAC (for an example) is fixed for a given library. Therefore, the barcode is 9 nucleotides long.

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
    perl /usr/local/CTK/stripBarcode.pl -format fastq -len 9 $f.trim.c.fastq.gz - | gzip -c > $f.trim.c.tag.fastq.gz
done

Note 1: We include sample index as part of the random barcode in this example.

Note 2: Get the distribution of tag lengths for diagnostic purposes using the following command.

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
    zcat $f.trim.c.tag.fastq.gz | awk '{if(NR%4==2) {print length($0)}}' | sort -n | uniq -c | awk '{print $2"\t"$1}' > $f.trim.c.tag.seqlen.stat.txt
done

Download fully preprocessed FASTQ files here (1.1 GB)

Read mapping & parsing

Read mapping

We assume the reference index has been prepared and is available under /genomes/hg19/bwa/.

cd ~/test
mkdir mapping
cd mapping

Run bwa to align the reads to the reference genome.

#!/bin/bash
for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
   bwa aln -t 4 -n 0.06 -q 20 /genomes/hg19/bwa/hg19.fa ../filtering/$f.trim.c.tag.fastq.gz > $f.sai
   bwa samse /genomes/hg19/bwa/hg19.fa $f.sai ../filtering/$f.trim.c.tag.fastq.gz | gzip -c > $f.sam.gz
done

The option -n 0.06 specifies slightly more stringent criteria than the default. The number of allowed mismatches (substitutions or indels) depending on read length is as follows:


[bwa_aln] 17bp reads: max_diff = 1
[bwa_aln] 20bp reads: max_diff = 2
[bwa_aln] 45bp reads: max_diff = 3
[bwa_aln] 73bp reads: max_diff = 4
[bwa_aln] 104bp reads: max_diff = 5
[bwa_aln] 137bp reads: max_diff = 6
[bwa_aln] 172bp reads: max_diff = 7
[bwa_aln] 208bp reads: max_diff = 8
[bwa_aln] 244bp reads: max_diff = 9

The -q option is used to trim low quality reads (an average of 20 or below). However, this does not seem to do too much after the trimming steps above.

Parsing SAM file

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
perl /usr/local/CTK/parseAlignment.pl -v --map-qual 1 --min-len 18 --mutation-file $f.mutation.txt $f.sam.gz $f.tag.bed
done

This will keep only unique mappings (with MAPQ >=1) and a minimal mapping size of 18 nt.

Note 1: In the tag bed file, the 5th column records the number of mismatches (substitutions) in each read

Note 2: Other aligners might not use a positive MAPQ as an indication of unique mapping.

Another useful option is --indel-to-end, which specifies the number of nucleotides towards the end from which indels should not be called (default=5 nt).

Note 3: The parsing script relies on MD tags, which is an optional field without strict definition in SAM file format specification. Some aligners might have slightly different format how they report mismatches. If other aligners than bwa is used, one should run the following command:

samtools view -bS $f.sam | samtools sort - $f.sorted
samtools fillmd  $f.sorted.bam /genomes/hg19/bwa/hg19.fa | gzip -c > $f.sorted.md.sam.gz

This will ensure the sam file gets parsed properly.

Note 4: Keep track what proportion of reads can be mapped uniquely.

wc -l *.tag.bed

Remove tags from rRNA and other repetitive RNA (optional)

This is an optional step, and performed here following the original analysis of this dataset (see Van Nostrand et al. 2016 Nat Meth 13:508-514 for more detail). In some cases, CLIP data can have contaminants due to abundant repetitive RNAs (such as rRNAs).

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4
do
perl /usr/local/CTK/tagoverlap.pl -big -region /usr/local/CTK/annotation/genomes/hg19/annotation/rmsk.RNA.bed -ss --complete-overlap -r --keep-tag-name --keep-score -v $f.tag.bed $f.tag.norRNA.bed
done

As of now, the path to the BED file with a database of repetitive RNAs has to be specified directly, which is somewhat inconvenient. This will be improved in the future.

Collapse PCR duplicates

It is critical to collapse PCR duplicates, not only for the exact duplicates collapsed above, but also for those with slight differences due to sequencing errors. These sequencing errors can also occur in the random barcode, so a naive method to collapse the same barcode is frequently insufficient.

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4
do
perl /usr/local/CTK/tag2collapse.pl -big -v --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name $f.tag.norRNA.bed $f.tag.uniq.bed
done

In the command above, tags mapped to the same genomic positions (i.e., the same start coordinates for the 5' end of RNA tag) will be collapsed together if they share the same or "similar" barcodes. A model-based algorithm is used to identify "sufficiently distinct" barcodes. This algorithm was described in detail in the following paper:

Darnell JC, et al. FMRP Stalls Ribosomal Translocation on mRNAs Linked to Synaptic Function and Autism. Cell. 2011; 146:247–261.

Compared to the original algorithm, the current implementation estimates sequencing error from aligned reads, which is then fixed during the iterative EM procedure.

Note 1: Sequencing errors in the degenerate barcodes are estimated from results of read alignment. The number of substitutions in each read must be provided in the 5th column.

Note 2: Note that the read ID in the input BED file (in the 4th column) must take the form READ#x#NNNNN, where x is the number of exact duplicates and NNNNN is the bar-code nucleotide sequence appended to read IDs in previous steps. Read IDs that are not in this format will generate an error.

Note 3: As a diagnostic step, get the length distribution of unique tags, which should be a more faithful representation of the library:

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
awk '{print $3-$2}' $f.tag.uniq.bed | sort -n | uniq -c | awk '{print $2"\t"$1}' > $f.tag.uniq.len.dist.txt
done

Get the mutations in unique tags.

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4
do
python /usr/local/CTK/joinWrapper.py $f.mutation.txt $f.tag.uniq.bed 4 4 N $f.tag.uniq.mutation.txt
done

This script is from galaxy, and included for your convenience. The parameters 4 4 indicate the columns in the two input file used to join the two, and N indicates that only paired rows should be printed.

Table 2 summarizes the number of unique mutations of different types in each sample.

Table 2: Summary of mutations in unique tags.
Protein	Sample	# of unique tags	Total # of mutations in unique tags	Deletions	Insertions	Substitutions
RBFOX2	rep1	646,705	94,090	14,712	2,942	76,436
RBFOX2	rep2	3,482,864	801,894	224,166	21,412	556,316
RBFOX2	rep3	553,372	67,486	6,901	2,962	57,623
RBFOX2	rep4	3,161,497	663,031	126,728	31,015	505,288
	Total	7,844,438	1,626,501	372,507	58,331	1,195,663

Merging biological replicates

After getting unique tags, one might want to concatenate these runs, which can be distinguished by different colors. As an example:

perl /usr/local/CTK/bed2rgb.pl -v -col "188,0,0" RBFOX2.rep1.tag.uniq.bed RBFOX2.rep1.tag.uniq.rgb.bed
perl /usr/local/CTK/bed2rgb.pl -v -col "128,0,128" RBFOX2.rep2.tag.uniq.bed RBFOX2.rep2.tag.uniq.rgb.bed
perl /usr/local/CTK/bed2rgb.pl -v -col "0,128,0" RBFOX2.rep3.tag.uniq.bed RBFOX2.rep3.tag.uniq.rgb.bed
perl /usr/local/CTK/bed2rgb.pl -v -col "51,102,255" RBFOX2.rep4.tag.uniq.bed RBFOX2.rep4.tag.uniq.rgb.bed

Then these can be concatenated.

cat RBFOX2.rep1.tag.uniq.rgb.bed RBFOX2.rep2.tag.uniq.rgb.bed RBFOX2.rep3.tag.uniq.rgb.bed RBFOX2.rep4.tag.uniq.rgb.bed > RBFOX2.pool.tag.uniq.rgb.bed
cat RBFOX2.rep1.tag.uniq.mutation.txt RBFOX2.rep2.tag.uniq.mutation.txt RBFOX2.rep3.tag.uniq.mutation.txt RBFOX2.rep4.tag.uniq.mutation.txt > RBFOX2.pool.tag.uniq.mutation.txt

Download unique tag and mutation files here (375 MB).

Annotating and visualizing CLIP tags

Get the genomic distribution of CLIP tags

perl /usr/local/CTK/bed2annotation.pl -dbkey hg19 -ss -big -region -v -summary RBFOX2.pool.tag.uniq.annot.summary.txt RBFOX2.pool.tag.uniq.rgb.bed RBFOX2.pool.tag.uniq.annot.txt

Make sure the current genome (hg19) is specified. Check the summary file (RBFOX2.pool.tag.uniq.annot.summary.txt) for the percentage of tags mapped to CDS, 3'UTR, introns, etc.

Generate bedgraph for visualization in the genome browser

perl /usr/local/CTK/tag2profile.pl -v -ss -exact -of bedgraph -n ″Unique Tag Profile″ RBFOX2.pool.tag.noRNA.uniq.rgb.bed RBFOX2.pool.tag.uniq.bedgraph

The output bedgraph file can be loaded into a genome browser for visualization of tag number at each genomic position.

Peak calling

cd ~/test
mkdir cluster
cd cluster
ln -s ../mapping/RBFOX2.pool.tag.uniq.rgb.bed

Mode 1: Peak calling with no statistical significance

perl /usr/local/CTK/tag2peak.pl -big -ss -v --valley-seeking --valley-depth 0.9 RBFOX2.pool.tag.uniq.rgb.bed RBFOX2.pool.tag.uniq.peak.bed --out-boundary RBFOX2.pool.tag.uniq.peak.boundary.bed --out-half-PH RBFOX2.pool.tag.uniq.peak.halfPH.bed

Note 1: To annotate peaks with overlapping genes and repeat masked sequences and get genomic breakdown:

perl /usr/local/CTK/bed2annotation.pl -dbkey hg19 -ss -big -region -v -summary RBFOX2.pool.tag.uniq.peak.annot.summary.txt RBFOX2.pool.tag.uniq.peak.bed RBFOX2.pool.tag.uniq.peak.annot.txt

The output file RBFOX2.pool.tag.uniq.peak.annot.txt has the detailed annotation for each peak, and RBFOX2.pool.tag.uniq.peak.annot.summary.txt has summary statistics.

Note 2: Another useful option --valley-depth specifies the depth of the valley relative to the peak (0.5-1, 0.9 by default). One also has the option to merge peaks close to each other by specifying the distance between the peaks (e.g., -gap 20).

Note 3: Besides the peak boundaries (RBFOX2.pool.tag.uniq.peak.bed), one can also output cluster boundaries (--out-boundary) and half peak boundaries (--out-half-PH ) associated with each peak.

In this mode, the script will not assess the statistical significance of the peak height. If one needs this, an alternative mode is as follows:

Mode 2: Peak calling with statistical significance

perl /usr/local/CTK/tag2peak.pl -big -ss -v --valley-seeking -p 0.05 --valley-depth 0.9 --multi-test --dbkey hg19 RBFOX2.pool.tag.uniq.rgb.bed RBFOX2.pool.tag.uniq.peak.sig.bed --out-boundary RBFOX2.pool.tag.uniq.peak.sig.boundary.bed --out-half-PH RBFOX2.pool.tag.uniq.peak.sig.halfPH.bed

Note 1: To get the number of significant peaks:

wc -l RBFOX2.pool.tag.noRNA.uniq.peak.sig.bed
35602 RBFOX2.pool.tag.uniq.peak.sig.bed

Note 2: To search for a known binding motif, one first defines the center of each peak (based on width at peak).

#extract from -500 to +500
awk '{print $1"\t"int(($2+$3)/2)-500"\t"int(($2+$3)/2)+500"\t"$4"\t"$5"\t"$6}' RBFOX2.pool.tag.uniq.peak.sig.bed > RBFOX2.pool.tag.uniq.peak.sig.center.ext1k.bed

This bed file can be used to extract sequences around peak (pay attention to the strand), and search for enrichment of specific motif (e.g., UGCAUG) relative to the peak.

Note 3: One might want to count the number of tags overlapping with each cluster/peak for each sample (e.g., to evaluate correlation between replicates).

for f in RBFOX2.rep1 RBFOX2.rep2 RBFOX2.rep3 RBFOX2.rep4; do
    perl /usr/local/CTK/tag2profile.pl -ss -region RBFOX2.pool.tag.uniq.peak.sig.boundary.bed -of bed -v $f.tag.uniq.bed $f.tag.uniq.peak.sig.boundary.count.bed
done

Download output files for peak calling here (36 MB).

CIMS

cd ~/test
mkdir CIMS
cd CIMS

ln -s ../mapping/RBFOX2.pool.tag.uniq.rgb.bed 
ln -s ../mapping/RBFOX2.pool.tag.uniq.mutation.txt

Get specific types of mutations

Get specific types of mutations, such as deletions, substitutions, and insertions in unique CLIP tags

perl ~/czlab_src/CTK/getMutationType.pl -t del RBFOX2.pool.tag.uniq.mutation.txt RBFOX2.pool.tag.uniq.del.bed
#deletions
perl ~/czlab_src/CTK/getMutationType.pl -t ins RBFOX2.pool.tag.uniq.mutation.txt RBFOX2.pool.tag.uniq.ins.bed
#insertions
perl ~/czlab_src/CTK/getMutationType.pl -t sub --summary RBFOX2.pool.tag.uniq.mutation.stat.txt RBFOX2.pool.tag.uniq.mutation.txt RBFOX2.pool.tag.uniq.sub.bed
#substitutions, as well as summary statistics of different types of mutations

It is always a good practice to look at the number of each type of mutation to see, e.g., compare the relative abundance of deletions to insertions. This information is provided in the summary statistics file when one specify the option --summary.

Get CIMS

Here, we use deletions as an example.

perl /usr/local/CTK/CIMS.pl -big -n 10 -p -outp RBFOX2.pool.tag.uniq.del.pos.stat.txt -v RBFOX2.pool.tag.uniq.rgb.bed RBFOX2.pool.tag.uniq.del.bed RBFOX2.pool.tag.uniq.del.CIMS.txt

By default, CIMS.pl will output all sites including those that are not statistically significant. This is recommended because one can play with stringency later.

Extract sites that are statistically significant.

awk '{if($9<=0.001) {print $0}}' RBFOX2.pool.tag.uniq.del.CIMS.txt | sort -k 9,9n -k 8,8nr -k 7,7n > RBFOX2.pool.tag.uniq.del.CIMS.s30.txt
cut -f 1-6 RBFOX2.pool.tag.uniq.del.CIMS.s30.txt > RBFOX2.pool.tag.uniq.del.CIMS.s30.bed

wc -l RBFOX2.pool.tag.uniq.del.CIMS.s30.bed
2151 RBFOX2.pool.tag.uniq.del.CIMS.s30.bed

This will keep only those with FDR<0.001. Significant sites will also be sorted by FDR, then by the number of tags with mutations, and then by the total number of overlapping tags. One might try different thresholds to get a balance of sensitivity/specificity (e.g. as judged from motif enrichment).

Note 1: Another parameter that might be useful to improve signal to noise is m/k (i.g., $8/$7 in awk)

Note 2: By default, the command line above will analyze mutations of size 1. Sometimes deletion of multiple nucleotides occurs and those will be ignored here. One can analyze mutations of size 2 by specifying -w 2. Substitutions are always reported as a single nucleotide (even when consecutive nucleotides are substituted), and insertions occur technically in one position and are thus treated as size 1.

Note 3: The number of significant CIMS might be slightly different when unique tags are provided in different orders (this could occur easily when multiple replicates are concatenated together). This is because the FDR estimation is based on permutation, which can change slightly depending on the order of tags.

Note 4: To examine enrichment of motif around CIMS:

awk '{print $1"\t"$2-10"\t"$3+10"\t"$4"\t"$5"\t"$6}' RBFOX2.pool.tag.uniq.del.CIMS.s30.bed > RBFOX2.pool.tag.uniq.del.CIMS.s30.21nt.bed
#+/-10 around CIMS

This BED file could be used to extract sequences around CIMS (pay attention to the strand), and search for enrichment of specific motif (e.g., UGCAUG for the case of RBFOX2) relative to the CIMS.

Download CIMS output files here (2.8 MB).

One can repeat these steps for the other types of mutations (i.e. substitutions and insertions). This is particularly relevant when one works with a new RBP, and it is unknown which type of mutations will be caused by cross linking.

CITS

cd ~/test
mkdir CITS
cd CITS
ln -s ../mapping/RBFOX2.pool.tag.uniq.rgb.bed
ln -s ../CIMS/RBFOX2.pool.tag.uniq.del.bed

perl /usr/local/CTK/CITS.pl -big -p 0.001 --gap 25 -v RBFOX2.pool.tag.uniq.rgb.bed RBFOX2.pool.tag.uniq.del.bed RBFOX2.pool.tag.uniq.clean.CITS.s30.bed

In this script, we will look for reproducible CLIP tag start positions with more supporting tags than one would expect, which likely indicates crosslink induced truncation sites. For RBFOX2, since the vast majority of deletions are introduced because of cross linking, tags with deletions are treated as read-through tags and removed for CITS analysis.

Note 1: One can now perform motif enrichment analysis as described above in the CIMS section.

Note 2: In this particular example, we decided not to do Bonforroni correction, as we found this is too conservative. If necessary, this can be done by using option --multi-test.

Note 3: In the command line above, we opt to merge sites very close to each other and keep the most significant ones because CITS analysis tends to have more background when read through is frequent.

A few sites that span multiple nucleotides can be removed:

awk '{if($3-$2==1) {print $0}}' RBFOX2.pool.tag.uniq.clean.CITS.s30.bed > RBFOX2.pool.tag.uniq.clean.CITS.s30.singleton.bed

wc -l RBFOX2.pool.tag.uniq.clean.CITS.s30.bed RBFOX2.pool.tag.uniq.clean.CITS.s30.singleton.bed

14696 RBFOX2.pool.tag.uniq.clean.CITS.s30.bed
14656 RBFOX2.pool.tag.uniq.clean.CITS.s30.singleton.bed

Note 4: To examine enrichment of motif around CIMS:

awk '{print $1"\t"$2-10"\t"$3+10"\t"$4"\t"$5"\t"$6}' RBFOX2.pool.tag.uniq.clean.CITS.s30.singleton.bed > RBFOX2.pool.tag.uniq.clean.CITS.s30.singleton.21nt.bed
#+/-10 around CITS

This BED file could be used to extract sequences around CITS (pay attention to the strand), and search for enrichment of specific motif (e.g., UGCAUG for the case of Rbfox1-3) relative to the CITS.

Download CITS output files here (595 KB).

Motif enrichment

The enrichment of motif sites (when the motif is known) around CIMS or CITS provides a quantitative measure of the signal to noise ratio. The plot below shows the proportion of sites with the RBFOX2 binding UGCAUG motif starting at each position relative to CIMS/CITS (using tools not included in CTK). This analysis identifies G2 and G5 as the major crosslink sites between RBFOX2 and RNA.

Navigation

ICLIP data analysis using CTK

From Zhang Laboratory

Contents

Introduction

Sample dataset

Raw FASTQ files

Output files of major steps generated in this tutorial

Read preprocessing

Demultiplexing samples

Read quality filtering

Trimming of 3' linker sequences

Collapse exact duplicates

Strip random barcode (UMI)

Read mapping & parsing

Read mapping

Parsing SAM file

Remove tags from rRNA and other repetitive RNA (optional)

Collapse PCR duplicates

Merging biological replicates

Annotating and visualizing CLIP tags

Get the genomic distribution of CLIP tags

Generate bedgraph for visualization in the genome browser

Peak calling

Mode 1: Peak calling with no statistical significance

Mode 2: Peak calling with statistical significance

CIMS

Get specific types of mutations

Get CIMS

CITS

Motif enrichment