Revision as of 17:49, 24 May 2018

Introduction

Crosslinking and immunoprecipitation followed by highthroughput sequencing (HITS-CLIP or CLIP-Seq) has now been widely used to map protein-RNA interactions on a genome-wide scale. The CLIP Tool Kit (CTK) is a software package that provides a set of tools for analysis of CLIP data starting from the raw reads generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, and define the exact protein-RNA crosslink sites by CIMS and CITS analysis. This software package is an expanded version of our previous CIMS package.

Crosslinking induced mutation site (CIMS) and cross linking induced truncation site (CITS) analyses are computational methods for CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. These methods are based on the observation that UV crosslinked amino-acid-RNA adducts can introduce reverse transcription errors, including mutations and premature in cDNAs at a certain frequency, which are captured by sequencing and subsequent comparison of CLIP tags with a reference genome.

If you use the software, please cite:

Shah,A., Qian,Y., Weyn-Vanhentenryck,S.M., Zhang,C. 2017. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics. 33:566-567.

More details of the biochemical and computational aspects of CLIP can be found in the following references:

Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614. 

Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.

For crosslinking induced trunction analysis (CITS) described below, please refer to:

Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 6:1139-1152.

Versions

v1.0.8 ( 05/24/2018 ) current
- improved selection of unique CLIP tags.
- improved support for CIMS anlaysis of particular types of substitutions (e.g., T-C for PAR-CLIP).
- a wrapper CITS.pl is included to simplify CITS anlaysis.
- included additional annotations files.
- included support for hg38.
- minor bug fixes

7. improved/expanded documentation and tutorials

- Various improvement and bug fixes to improve efficiency and robustness
v1.0.7 ( 01-16-2017 )
- fix mac-specific crash
v1.0.6 ( 01-04-2017 )
- minor bug fix
v1.0.5 ( 11-10-2016 )
- fixed path to annotation files
- have default path to gene bed file for tag2peak.pl
v1.0.4 ( 10-05-2016 )
- minor fixes
v1.0.3 ( 08-08-2016 )
- improvement in software packaging and usage
v1.0.0 ( 10-12-2015 )
- The initial beta release

Download

Source code

czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. Download from github
CTK (perl): the core algorithm. Download from github

Prerequisites

This software is implemented in perl. It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on Cent OS, although it is expected to work on most unix-like systems, including Mac OS X. In addition, several software packages are required by the pipeline for sequence preprocessing and alignment (the version number in our test is also indicated).

FASTX Tool-Kit Version 0.0.13: http://hannonlab.cshl.edu/fastx_toolkit/download.html
cutadapt Version 1.14: https://pypi.python.org/pypi/cutadapt/ (an alternative to FASTX Tool-Kit)
Burrows Wheeler Aligner (BWA) Version 0.7.12: http://bio-bwa.sourceforge.net/
Samtools Version 1.3.1: http://samtools.sourceforge.net
Perl Version 5.14.3 was used for testing, but we expect that newer versions of Perl will also be compatible: https://www.perl.org/get.html
Perl library Math::CDF Version 0.1: http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm

Installation

Download and install software packages described in prerequisites.
Download the czplib perl library files (refer back to Download section above)
Decompress and move to whatever directory you like (as an example, we use /usr/local/lib/)
Replace "x.tgz" below with the version of the package you downloaded

$unzip czplib-1.0.x.zip
$mv czplib-1.0.x /usr/local/lib/czplib

Add the library path to the environment variable, so perl can find it.

export PERL5LIB=/usr/local/lib/czplib

Download CTK code and likewise decompress and move to whatever directory you like (as an example, we use /usr/local/)

$unzip ctk-1.0.x.zip
$mv ctk-1.0.x /usr/local/CTK

Add the dir to your $PATH environment variable if you would like.

Finally, some of the scripts will use a cache directory, which is under the working directory by default. One can specify another folder for cache using environment variable (recommended).


#e.g., add the following lines in  .bash_profile
CACHEHOME=$HOME/cache
export CACHEHOME

Indexing reference genome

We are now using BWA (version 0.7.12) for alignment instead of novoalign for two reasons:

novoalign is slower than some of the other algorithms that become available, in part because the academic version of novoalign does not allow multi threading.
BWA allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).

This step needs to be done only once.

After you have installed BWA, prepare a reference genome:

For example, build a reference mm10 genome. Download the reference genome here: http://ccb.jhu.edu/software/tophat/igenomes.shtml. In this case, make sure you are downloading the "Mus musculus UCSC MM10" reference.

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
tar -xvf Mus_musculus_UCSC_mm10.tar.gz
cd /Mus_musculus_UCSC_mm10/Mus_musculus/UCSC/mm10/Sequence/Chromosomes

Change the chromosome header and combine the chromosomes into a full genome. Note that we exclude random chromosomes and the mitochondria chromosome in our analysis.

cat ch1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > mm10.fa

Note 1 : Make sure each chromosome is named by '>chrN' instead of '>N'

Note 2: In our analysis, we do not include random chromosomes or chrM, which might not be the best for certain projects.

Finally, create a BWA index and move it to a directory you like. In this example, the index is in the /genomes/mm10/bwa/ directory.

cd /genomes/mm10/bwa/
bwa index -a bwtsw mm10.fa

Navigation

Difference between revisions of "CTK Documentation"

From Zhang Laboratory