CTK Documentation

From Zhang Laboratory

Revision as of 10:49, 28 February 2016 by Czlab (Talk | contribs) (Collapsing PCR duplicates)

Jump to: navigation, search

Introduction

Crosslinking and immunoprecipitation (CLIP) has now been widely used to map protein-RNA interactions on a genome-wide scale. This software package named CLIP Tool Kit (CTK) is to provide a set of tools for analysis of CLIP data starting from the raw data generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, mapping the exact protein-RNA crosslink site by CIMS and CITS analysis. This package will replace our original CIMS package, which has more limited features.

Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome. More details can be found in the following references:

Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614. 

Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.


For cross link induced truncation site analysis (CITS) described below, please refer to:

Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 10.1016/j.celrep.2014.02.005. 

The following individuals contributed to this document: Chaolin Zhang, Kernyu Park, and Ankeeta Shah.

Versions

  • v1.0.0 ( 10-12-2015 ), current
    • The initial beta release

Download

Source code:

Installation

Prerequisites

This software is implemented with perl . It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X. In addition, several software packages are required by the pipeline for read processing and alignment.

Steps to install the software

  • Download the perl library files czplib, if not already.

Decompress it and move it to a place you like

$tar zxvf czplib.v1.0.x.tgz
$mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib
  • Download CTK codes, if not already.

Decompress it and move it to a place you like

$tar zxvf CTK.v1.0.x.tgz
$cd CTK
$chmod 755 *.pl
$mv CTK /usr/local/CTK

Add the dir to your $PATH environment variable.

Read preprocessing

Compared to previous versions, this new pipeline, in all preprocessing steps prior to read alignment, operates on FASTQ files to take advantage of sequence quality scores.

In our current experimental design, several CLIP libraries with different indexes are typically pooled together to be sequenced in one lane. In this documentation, we assume that the raw FASTQ file from Illumina sequencing (101 nt reads) has the name raw.fastq.

Demultiplexing samples and read quality filtering

Assuming that we have six samples with indexes GTCA, GCATG, ACTG, AGCT, GCATC, TCGA, the following script will extract reads for each library using a simple loop.

mkdir filtering
cd filtering
#!/usr/bin/sh
index=(GTCA GCATG ACTG AGCT GCATC TCGA);
samples=(sample1 sample2 sample3 sample4 sample5 sample6);

for ((i=0;i<=5;i=i+1)); do
	idx=${index[$i]}
	name=${samples[$i]}
	perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-38:20 -of fastq raw.fastq.gz $name.fastq
done

Alternatively, we can parallelize the process by submitting an array of jobs to SGE.


#!/usr/bin/sh
#$ -t 1-6 -m a -cwd -N CLIP
#-l "mem=2G,time=192::"

index=(GTCA GCATG ACTG AGCT GCATC TCGA);
samples=(sample1 sample2 sample3 sample4 sample5 sample6);


idx=${index[$SGE_TASK_ID-1]}
name=${samples[$SGE_TASK_ID-1]}
perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-38:20 -of fastq raw.fastq.gz $name.fastq

This script filters raw reads based on quality scores. -f mean:0-38-20 specifies a mean score of 20 or above in the first 39 bases (for BrdU CLIP), which includes 14 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag. This script can be altered if utilizing standard CLIP data by changing the line below so that we will specify a mean score of 20 or above in the first 30 bases.


perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-29:20 -of fastq raw.fastq.gz $name.fastq

The reason we do this is because low quality reads can introduce mapping errors and background. They will inflate the number of unique tags after removal of PCR duplicates.

Note 1: As good practice, always double check the number of reads in each library and also confirm if the extracted reads account for a vast majority of all reads in the raw fastq file. If you run the fastq_filter.pl without specifying -index, you will get all reads that passed quality filtering.


perl ~/czlab_src/CTK/fastq_filter.pl -if sanger -f mean:0-38:20 -of fasta raw.fastq.gz - | grep -v \> | cut -c 1-5 | sort | uniq -c > raw.stat.txt

The command above will give the number of reads starting with particular 5mers.

Note 2: The reason we used the first four positions of the index for demultiplexing is due to historical confusion about the identity of the index sequences. It really should be 5nt, but it does not make a substantial difference.

Trimming of 3' linker sequences

Note 1: due to long 101nt reads, collapsing before trimming is not very helpful. Therefore, we trim the 3' linker first.

For each sample, run the following command using tools provided by the fastx toolkit.


cat sample1.fastq | fastx_clipper -a TCGTATGCCGTCTTCTGCTTG -l 29 -n | fastq_quality_trimmer -t 5 -l 29 -o sample1.trim.fastq

This command will remove the 3' linker and/or also remove extremely low quality bases (score < 5).

Note 2: It is good to check the number of reads by running the command:


wc -l sample1.trim.fastq

Collapse exact duplicates

If multiple reads have exactly the same sequence, only one is kept.


perl ~/czlab_src/CTK/fastq2collapse.pl sample1.trim.fastq sample1.trim.c.fastq

Strip barcode

The following script removes the 5' linker degenerate barcode.


perl ~/czlab_src/CTK/stripBarcode.pl -format fastq -len 14 sample1.trim.c.fastq sample1.trim.c.tag.fastq

After this step, one can get the distribution of tag length for diagnostic purposes.


awk '{if(NR%4==2) {print $0}}' sample1.trim.c.tag.fastq | awk -F "" '{print NF}' | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample1.trim.c.tag.seqlen.stat.txt

Read mapping & parsing

Read mapping

We are now using bwa for alignment instead of novoalign (which was used in the old documentation) for two reasons: 1. novoalign is very slow, in part because the academic version does not allow multi threading. 2. bwa allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).


mkdir mapping
cd mapping
bwa aln -t 4 -n 0.06 -q 20 ~/data/genomes/mm10/bwa/mm10.fa ../filtering/sample1.trim.c.tag.fastq > sample1.sai

bwa samse ~/data/genomes/mm10/bwa/mm10.fa sample1.sai ../filtering/sample1.trim.c.tag.fastq > sample1.sam

The option -n 0.06 specifies slightly more stringent criteria than the default. The number of allowed mismatches (substitutions or indels) is as follows:


[bwa_aln] 17bp reads: max_diff = 1
[bwa_aln] 20bp reads: max_diff = 2
[bwa_aln] 45bp reads: max_diff = 3
[bwa_aln] 73bp reads: max_diff = 4
[bwa_aln] 104bp reads: max_diff = 5
[bwa_aln] 137bp reads: max_diff = 6
[bwa_aln] 172bp reads: max_diff = 7
[bwa_aln] 208bp reads: max_diff = 8
[bwa_aln] 244bp reads: max_diff = 9

The -q option is used to trim low quality reads (an average of 20 or below). However, this does not seem to do too much after the trimming steps above.

Parsing SAM file


perl ~/czlab_src/CTK/parseAlignment.pl -v --map-qual 1 --min-len 18 --mutation-file sample1.mutation.txt sample1.sam sample1.tag.bed

This will keep only unique mappings (with MAPQ >=1) and a minimal mapping size of 18 nt.

Note 1: other aligners might not use a positive MAPQ as an indication of unique mapping.

Another useful option is --indel-to-end, which specifies the number of nucleotides towards the end from which indels should not be called.

Note 2: the parsing script relies on MD tags, which is an optional field without strict definition. Some aligners might have slightly different format how they report mismatches. If other aligners than bwa is used, one should run the following command:


samtools view -bS sample1.sam | samtools sort - sample1.sorted
samtools fillmd  sample1.sorted.bam ~/data/genomes/mm10/mm10.fa > sample1.sorted.md.sam

This will ensure the sam file gets parsed properly.

Note 3: keep track what proportion of reads can be mapped uniquely.


wc -l sample1.tag.bed

Note 4: the format of the mutation file is essentially the same as before.

Collapsing PCR duplicates

It is critical to collapse PCR duplicates, not only for the exact duplicates collapsed above, but also for those with slight differences due to sequencing errors.


perl ~/czlab_src/CTK/tag2collapse.pl -v --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name sample1.tag.bed sample1.tag.uniq.bed

Note 1: As a diagnostic step, get the length distribution of unique tags, which should be a more faithful representation of the library:


awk '{print $3-$2}' sample1.tag.uniq.bed | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample1.uniq.len.dist.txt

Get the mutations in unique tags

python ~/src/CTK/joinWrapper.py sample1.mutation.txt sample1.tag.uniq.bed 4 4 N sample1.tag.uniq.mutation.txt


Note 2: After getting the unique tags of each library, one might concatenate biological replicates, which are distinguished by different colors.


perl ~/czlab_src/CTK/bed2rgb.pl -v -col "128,0,0" sample1.tag.uniq.bed sample1.tag.uniq.rgb.bed
...

cat sample1.tag.uniq.rgb.bed sample2.tag.uniq.rgb.bed sample3.tag.uniq.rgb.bed > sample.pool.tag.uniq.rgb.bed
cat sample1.tag.uniq.mutation.txt sample2.tag.uniq.mutation.txt sample3.tag.uniq.mutation.txt > sample.pool.tag.uniq.mutation.txt

Note 3: get genomic distribution of CLIP tags. [Do not include ~ in the path]

sed "s#/path/to/src#<insert path to CTK here>#"  ~/czlab_src/CTK/CTK_annotation.loc ~/czlab_src/CTK/CTK_annotation.custom.loc
perl ~/czlab_src/CTK/bed2annotation.pl -conf ~/czlab_src/CTK/CTK_annotation.custom.loc -dbkey mm10 -ss -big -gene -region -v -summary sample1.tag.uniq.annot.summary.txt sample1.tag.uniq.rgb.bed sample1.tag.uniq.annot.txt

Check the summary file for the percentage of tags mapped to CDS, 3'UTR, introns, etc.

Peak calling

Mode 1: Peak calling without statistical assessment


perl ~/czlab_src/CTK/tag2peak.pl -big -ss -v --valley-seeking sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.peak.bed --out-boundary sample.pool.tag.uniq.peak.boundary.bed --out-half-PH sample.pool.tag.uniq.peak.halfPH.bed

This script searches for all potential peaks using a "valley seeking" algorithm. A peak is called if valleys of certain depths are found on both sides so that it is separated from other peaks. The results are similar to those generated by the original clustering algorithm. However, this valley seeking algorithm is advantageous because it is able to separate neighboring peaks without zero-tag gaps.

Another useful option --valley-depth specifies the depth of the valley relative to the peak (0-1, 0.5 by default).

One also has the option to merge peaks close to each other by specifying the distance between the peaks (e.g., -gap 20).

In this mode, the script will not assess the statistical significance of the peak height. If one needs this, an alternative mode is as follows:

Mode 2: Peak calling with statistical assessment


perl ~/scripts/tag2peak.pl -big -ss -v --valley-seeking -p 0.05 --multi-test -gene ~/data/genomes/mm10/annotation/refGene_knownGene.meta.bed sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.peak.sig.bed --out-boundary sample.pool.tag.uniq.peak.sig.boundary.bed --out-half-PH sample.pool.tag.uniq.peak.sig.halfPH.bed

Here, the script searches for significant peaks by considering one specific genic region as a unit for permutations to estimate expected peak height (scan statistics is used here, so the permutation is only conceptual).

It is important to note that in this mode, only clusters overlapping with annotated genes are kept.

In both cases, the three output files provided the positions of peak (many times it is 1nt, but not always), the peak width at half peak height, and the boundaries of each peak. In practice, it appears that the center of peak width at half PH is the most robust/precise indicator of the binding sites (although a shift might be present because of CITS, which can be adjusted after one examines the enrichment of the motif).

Note 1: To search for motif, one first defines the center of each peak (based on width at half PH).


awk '{print $1"\t"int(($2+$3)/2)-500"\t"int(($2+$3)/2)+500"\t"$4"\t"$5"\t"$6}' sample.pool.tag.uniq.peak.sig.halfPH.bed > sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.bed
perl ~/scripts/bed2fasta.pl -v -org mm10 sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.bed sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.fa
PatternMatch -c TGCATG sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.fa | cut -f 2 | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.TGCATG.profile.txt

#using Rbfox as an example

Note 2:One might want to count the number of tags in each cluster/peak for each sample.


perl ~/czlab_src/CTK/tag2profile.pl -ss -region sample.pool.tag.uniq.peak.sig.boundary.bed -of bed -v sample1.tag.uniq.rgb.bed sample1.tag.uniq.peak.sig.boundary.count.bed

CIMS analysis

mkdir CIMS
cd CIMS
ln -s ../mapping/sample.pool.tag.uniq.mutation.txt ./
ln -s ../mapping/sample.pool.tag.uniq.rgb.bed ./

Get specific types of mutations

Get specific types of mutations, such as deletions, substitutions, and insertions around the cross-linked mutation site.


awk '{if($9==">") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.sub.bed
awk '{if($9=="-") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.del.bed
awk '{if($9=="+") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.ins.bed

It is always a good practice to look at the number of each type of mutation to see, e.g., compare the relative abundance of deletions to insertions.

For example, to get the number of deletions of different sizes:


awk '{print $3-$2}' sample.pool.tag.uniq.del.bed | sort -n | uniq -c

Get CIMS

Here we use deletions as an example.


perl ~/czlab_src/CTK/CIMS.pl -n 10 -p -v --keep-cache -c cache_del sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.del.bed sample.pool.tag.uniq.del.CIMS.txt

By default, it will output all sites including those that are not statistically significant. This is recommended because one can play with stringency later.


awk '{if($9<=0.05) {print $0}}' sample.pool.tag.uniq.del.CIMS.txt | sort -k 9,9n -k 8,8nr -k 7,7n > sample.pool.tag.uniq.del.CIMS.s13.txt
cut -f 1-6 sample.pool.tag.uniq.del.CIMS.s13.txt > sample.pool.tag.uniq.del.CIMS.s13.bed

This will keep only those with FDR<0.05. One might try different thresholds to get a balance of sensitivity/specificity (e.g. as judged from motif enrichment).

Note 1: Another parameter that might be useful to improve signal to noise is m/k (i.g., $8/$7 in awk)

Note 2: By default, the command line above will analyze mutations of size 1. Sometimes deletion of multiple nucleotides occurs and those will be ignored here. One can analyze mutations of size 2 by specifying -w 2. Substitutions are always reported as a single nucleotide (even when consecutive nucleotides are substituted), and insertions occur technically in one position and are thus treated as size 1.

Note 3: To examine enrichment of motif around CIMS;

perl ~/scripts/bed2fasta.pl -v -org mm10 -l "-500" -r 500 sample.pool.tag.uniq.del.CIMS.s13.bed sample.pool.tag.uniq.del.CIMS.s13.1k.fa
PatternMatch -c TGCATG sample.pool.tag.uniq.del.CIMS.s13.1k.fa | cut -f 2 | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample.pool.tag.uniq.del.CIMS.s13.1k.TGCATG.profile.txt

One can repeat these steps for the other types of mutations (i.e. substitutions and insertions).

CITS analysis


mkdir CITS
cd CITS
ln -s ../CIMS/sample.pool.tag.uniq.del.bed ./
ln -s ../mapping/sample.pool.tag.uniq.rgb.bed ./

Since the vast majority of deletions are introduced because of cross linking, tags with deletions are treated as read-through tags and removed.


cut -f 4 sample.pool.tag.uniq.del.bed | sort | uniq > sample.pool.tag.uniq.del.id
python ~/czlab_src/CTK/joinWrapper.py sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.del.id 4 1 V sample.pool.tag.uniq.rgb.clean.bed

Get the position before the start site as a potential cross link site that causes truncation.


perl ~/czlab_src/CTK/bedExt.pl -n up -l "-1" -r "-1" -v sample.pool.tag.uniq.rgb.clean.bed sample.pool.tag.uniq.clean.trunc.bed


Cluster overlapping tags.


perl ~/czlab_src/CTK/tag2cluster.pl -big -s -maxgap "-1" -of bed -v sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.cluster.0.bed
awk '{if($5>2) {print $0}}' sample.pool.tag.uniq.cluster.0.bed > sample.pool.tag.uniq.cluster.bed

Now we are ready for the CITS analysis:


perl ~/scripts/tag2peak.pl -big -ss -v --prefix "CITS" -gap 25 -p 0.001 -gene sample.pool.tag.uniq.cluster.bed sample.pool.tag.uniq.clean.trunc.bed sample.pool.tag.uniq.clean.CITS.s30.bed

Note 1: one can now perform motif enrichment analysis as described above.

Note 2: In the command line above, we opt to merge sites very close to each other and keep the most significant ones because CITS analysis tends to have more background when read through is frequent.

Note 3: In theory, one should be able to use the clusters/peaks defined above for permutations in CITS analysis.


perl ~/czlab_src/CTK/bedExt.pl -l "-1" -v ../cluster/sample.pool.tag.uniq.peak.boundary.bed sample.pool.tag.uniq.peak.boundary.ext1.bed
#we extend one 1nt upstream to make sure CITS is covered by clusters
perl ~/czlab_src/CTK/tag2peak.pl -big -ss -v --prefix "CITS" -gap 25 -p 0.001 -gene sample.pool.tag.uniq.peak.boundary.ext1.bed sample.pool.tag.uniq.clean.trunc.bed sample.pool.tag.uniq.clean.CITS.s30.bed

In one case, we observed that clusters defined by grouping overlapping tags appear to give slightly better results, although this might just be one case. Additional comparisons are required to make a conclusion.