CTK Documentation
From Zhang Laboratory
Contents
Introduction
Crosslinking and immunoprecipitation followed by highthroughput sequencing (HITS-CLIP or CLIP-Seq) has now been widely used to map protein-RNA interactions on a genome-wide scale. The CLIP Tool Kit (CTK) is a software package that provides a set of tools for analysis of CLIP data starting from the raw reads generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, and define the exact protein-RNA crosslink sites by CIMS and CITS analysis. This software package is an expanded version of our previous CIMS package.
Crosslinking induced mutation site (CIMS) and cross linking induced truncation site (CITS) analyses are computational methods for CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. These methods are based on the observation that UV crosslinked amino-acid-RNA adducts can introduce reverse transcription errors, including mutations and premature in cDNAs at a certain frequency, which are captured by sequencing and subsequent comparison of CLIP tags with a reference genome.
If you use the software, please cite:
Shah,A., Qian,Y., Weyn-Vanhentenryck,S.M., Zhang,C. 2017. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics. 33:566-567.
More details of the biochemical and computational aspects of CLIP can be found in the following references:
Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat. Biotech. 29:607-614. Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93. doi:10.1038/nprot.2014.012.
For crosslinking induced trunction analysis (CITS) described below, please refer to:
Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 6:1139-1152.
User group
For questions/answers, please visit our user group: https://groups.google.com/forum/#!forum/ctk-user-group
Versions
- v1.1.3 (12/2018) current
- minor bug fix
- v1.1.2 (08/02/2018)
- update in hg38 annotation files
- v1.1.1 (07/20/2018)
- bug fix related to the use of dm6 annotation files
- v1.1.0 ( 07/14/2018 )
- included support for dm6.
- v1.0.9 ( 06/12/2018 )
- minor bug fix
- v1.0.8 ( 05/24/2018 )
- improved selection of unique CLIP tags.
- improved support for CIMS anlaysis of particular types of substitutions (e.g., T-C for PAR-CLIP).
- a wrapper CITS.pl is included to simplify CITS anlaysis.
- included additional annotations files.
- included support for hg38.
- minor bug fixes
- improved/expanded documentation and tutorials
- Various improvement and bug fixes to improve efficiency and robustness
- v1.0.7 ( 01-16-2017 )
- fix mac-specific crash
- v1.0.6 ( 01-04-2017 )
- minor bug fix
- v1.0.5 ( 11-10-2016 )
- fixed path to annotation files
- have default path to gene bed file for tag2peak.pl
- v1.0.4 ( 10-05-2016 )
- minor fixes
- v1.0.3 ( 08-08-2016 )
- improvement in software packaging and usage
- v1.0.0 ( 10-12-2015 )
- The initial beta release
Download
- czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. Download from github
- CTK (perl): the core algorithm. Download from github
Prerequisites
This software is implemented in perl. It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on Cent OS, although it is expected to work on most unix-like systems, including Mac OS X. In addition, several software packages are required by the pipeline for sequence preprocessing and alignment (the version number in our test is also indicated).
- FASTX Tool-Kit Version 0.0.13: http://hannonlab.cshl.edu/fastx_toolkit/download.html
- cutadapt Version 1.14: https://pypi.python.org/pypi/cutadapt/ (an alternative to FASTX Tool-Kit)
- Burrows Wheeler Aligner (BWA) Version 0.7.12: http://bio-bwa.sourceforge.net/
- Samtools Version 1.3.1: http://samtools.sourceforge.net
- Perl Version 5.14.3 was used for testing, but we expect that newer versions of Perl will also be compatible: https://www.perl.org/get.html
- Perl library Math::CDF Version 0.1: http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm
Installation
- Download and install software packages described in prerequisites.
- Download the czplib perl library files (refer back to Download section above)
- Decompress and move to whatever directory you like (as an example, we use /usr/local/lib/)
- Replace "x.tgz" below with the version of the package you downloaded
$unzip czplib-1.0.x.zip $mv czplib-1.0.x /usr/local/lib/czplib
Add the library path to the environment variable, so perl can find it.
export PERL5LIB=/usr/local/lib/czplib
- Download CTK code and likewise decompress and move to whatever directory you like (as an example, we use /usr/local/)
$unzip ctk-1.0.x.zip $mv ctk-1.0.x /usr/local/CTK
Add the dir to your $PATH environment variable if you would like.
Finally, some of the scripts will use a cache directory, which is under the working directory by default. One can specify another folder for cache using environment variable (recommended).
#e.g., add the following lines in .bash_profile CACHEHOME=$HOME/cache export CACHEHOME
Indexing reference genome
We are now using BWA (version 0.7.12) for alignment instead of novoalign for two reasons:
- novoalign is slower than some of the other algorithms that become available, in part because the academic version of novoalign does not allow multi threading.
- BWA allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).
This step needs to be done only once.
After you have installed BWA, prepare a reference genome:
For example, build a reference mm10 genome. Download the reference genome here: http://ccb.jhu.edu/software/tophat/igenomes.shtml. In this case, make sure you are downloading the "Mus musculus UCSC MM10" reference.
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz tar -xvf Mus_musculus_UCSC_mm10.tar.gz cd /Mus_musculus_UCSC_mm10/Mus_musculus/UCSC/mm10/Sequence/Chromosomes
Change the chromosome header and combine the chromosomes into a full genome. Note that we exclude random chromosomes and the mitochondria chromosome in our analysis.
cat ch1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > mm10.fa
Note 1 : Make sure each chromosome is named by '>chrN' instead of '>N'
Note 2: In our analysis, we do not include random chromosomes or chrM, which might not be the best for certain projects.
Finally, create a BWA index and move it to a directory you like. In this example, the index is in the /genomes/mm10/bwa/ directory.
cd /genomes/mm10/bwa/ bwa index -a bwtsw mm10.fa
Genome assemblies with annotations included in CTK
While CTK is not limited to specific species/genome assemblies in general, several steps require gene annotations. Currently the annotation files of the following assemblies have been included as part of CTK:
- hg38
- hg19
- mm10
- dm6
If you are interested in certain genome assemblies not currently supported by CTK, feel free to let us know by posting in our Google CTK user group.