Difference between revisions of "CTK Documentation"

From Zhang Laboratory

Jump to: navigation, search
(Collapsing PCR duplicates)
(Installation)
 
(46 intermediate revisions by 3 users not shown)
Line 1: Line 1:
=Introduction=
+
<center>[[CTK|CTK home]] | [[Standard/BrdU-CLIP data analysis using CTK|Standard/BrdU-CLIP]] | [[iCLIP data analysis using CTK|iCLIP]] | [[eCLIP data analysis using CTK|eCLIP]] | [[PARCLIP data analysis using CTK|PAR-CLIP]] | [[CTK_usage|CTK usage]] | [[CTK_FAQ|FAQ]]</center>
  
Crosslinking and immunoprecipitation (CLIP) has now been widely used to map protein-RNA interactions on a genome-wide scale.  This software package named CLIP Tool Kit (CTK) is to provide a set of tools for analysis of CLIP data starting from the raw data generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, mapping the exact protein-RNA crosslink site by CIMS and CITS analysis.  This package will replace our original CIMS package, which has more limited features. 
+
__TOC__
  
Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution.  This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome.  More details can be found in the following references:
+
=Introduction=
  
<pre>
+
[[File:CTK_Pipeline_Overview.png|500px|center]]
Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614.  
+
  
Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.
 
</pre>
 
  
 +
Crosslinking and immunoprecipitation followed by highthroughput sequencing (HITS-CLIP or CLIP-Seq) has now been widely used to map protein-RNA interactions on a genome-wide scale.  The CLIP Tool Kit (CTK) is a software package that provides a set of tools for analysis of CLIP data starting from the raw reads generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, and define the exact protein-RNA crosslink sites by CIMS and CITS analysis.  This software package is an expanded version of our previous CIMS package. 
  
For cross link induced truncation site analysis (CITS) described below, please refer to:
+
Crosslinking induced mutation site (CIMS) and cross linking induced truncation site (CITS) analyses are computational methods for CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution.  These methods are based on the observation that UV crosslinked amino-acid-RNA adducts can introduce reverse transcription errors, including mutations and premature in cDNAs at a certain frequency, which are captured by sequencing and subsequent comparison of CLIP tags with a reference genome. 
  
<pre>
 
Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 10.1016/j.celrep.2014.02.005.
 
</pre>
 
 
The following individuals contributed to this document: Chaolin Zhang, Kernyu Park, and Ankeeta Shah.
 
 
=Versions=
 
*v1.0.0 ( 10-12-2015 ), current
 
**The initial beta release
 
 
=Download=
 
 
'''Source code:'''
 
 
*czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. ([http://sourceforge.net/p/czplib/ download from SourceForge.net])
 
*CTK (perl): the core algorithm. ([http://sourceforge.net/p/ngs-ctk/ download from SourceForge.net])
 
 
=Installation=
 
 
==Prerequisites==
 
 
This software is implemented with perl .  It also relies on several standard linux/unix tools such as grep, cat, sort, etc.  We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.  In addition, several software packages are required by the pipeline for read processing and alignment.
 
 
*FASTX Tool-Kit: http://hannonlab.cshl.edu/fastx_toolkit/commandline.html
 
*bwa: http://bio-bwa.sourceforge.net
 
*samtools: http://samtools.sourceforge.net
 
 
==Steps to install the software==
 
 
* Download the perl library files czplib, if not already.
 
 
Decompress it and move it to a place you like
 
  
 +
If you use the software, please cite:
 
<pre>
 
<pre>
$tar zxvf czplib.v1.0.x.tgz
+
Shah,A., Qian,Y., Weyn-Vanhentenryck,S.M., Zhang,C. 2017. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics. 33:566-567.
$mv czplib /usr/local/lib
+
 
</pre>
 
</pre>
  
Add the library path to the environment variable, so perl can find it. 
 
<pre>
 
PERL5LIB=/usr/local/lib/czplib
 
</pre>
 
  
* Download CTK codes, if not already.
+
More details of the biochemical and computational aspects of CLIP can be found in the following references:
Decompress it and move it to a place you like
+
  
 
<pre>
 
<pre>
$tar zxvf CTK.v1.0.x.tgz
+
Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614.  
$cd CTK
+
$chmod 755 *.pl
+
$mv CTK /usr/local/CTK
+
</pre>
+
  
Add the dir to your $PATH environment variable.
+
Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.
 
+
=Read preprocessing=
+
 
+
Compared to previous versions, this new pipeline, in all preprocessing steps prior to read alignment, operates on FASTQ files to take advantage of sequence quality scores.  
+
 
+
In our current experimental design, several CLIP libraries with different indexes are typically pooled together to be sequenced in one lane. In this documentation, we assume that the raw FASTQ file from Illumina sequencing (101 nt reads) has the name raw.fastq.
+
 
+
==Demultiplexing samples and read quality filtering==
+
 
+
Assuming that we have six samples with indexes GTCA, GCATG, ACTG, AGCT, GCATC, TCGA, the following script will extract reads for each library using a simple loop.
+
<pre>
+
mkdir filtering
+
cd filtering
+
 
</pre>
 
</pre>
  
<pre>
+
For crosslinking induced trunction analysis (CITS) described below, please refer to:
#!/usr/bin/sh
+
index=(GTCA GCATG ACTG AGCT GCATC TCGA);
+
samples=(sample1 sample2 sample3 sample4 sample5 sample6);
+
 
+
for ((i=0;i<=5;i=i+1)); do
+
idx=${index[$i]}
+
name=${samples[$i]}
+
perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-38:20 -of fastq raw.fastq.gz $name.fastq
+
done
+
</pre>
+
 
+
Alternatively, we can parallelize the process by submitting an array of jobs to SGE.
+
  
 
<pre>
 
<pre>
 
+
Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 6:1139-1152.  
#!/usr/bin/sh
+
#$ -t 1-6 -m a -cwd -N CLIP
+
#-l "mem=2G,time=192::"
+
 
+
index=(GTCA GCATG ACTG AGCT GCATC TCGA);
+
samples=(sample1 sample2 sample3 sample4 sample5 sample6);
+
 
+
 
+
idx=${index[$SGE_TASK_ID-1]}
+
name=${samples[$SGE_TASK_ID-1]}
+
perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-38:20 -of fastq raw.fastq.gz $name.fastq
+
 
+
 
</pre>
 
</pre>
  
This script filters raw reads based on quality scores. -f mean:0-38-20 specifies a mean score of 20 or above in the first 39 bases (for BrdU CLIP), which includes 14 positions with sample indexes and the random barcode, followed by 25 positions with the actual CLIP tag. This script can be altered if utilizing standard CLIP data by changing the line below so that we will specify a mean score of 20 or above in the first 30 bases.
+
=User group=
  
<pre>
+
For questions/answers, please visit our user group: https://groups.google.com/forum/#!forum/ctk-user-group
  
perl ~/czlab_src/CTK/fastq_filter.pl -v -if sanger -index 0:$idx -f mean:0-29:20 -of fastq raw.fastq.gz $name.fastq
+
=Versions=
 +
*v1.1.3 (12/2018) current
 +
**minor bug fix
 +
*v1.1.2 (08/02/2018)
 +
**update in hg38 annotation files
 +
*v1.1.1 (07/20/2018)
 +
** bug fix related to the use of dm6 annotation files
 +
*v1.1.0 ( 07/14/2018 )
 +
** included support for dm6.
 +
*v1.0.9 ( 06/12/2018 )
 +
** minor bug fix
 +
*v1.0.8 ( 05/24/2018 )
 +
**improved selection of unique CLIP tags.
 +
**improved support for CIMS anlaysis of particular types of substitutions (e.g., T-C for PAR-CLIP).
 +
**a wrapper CITS.pl is included to simplify CITS anlaysis.
 +
**included additional annotations files.
 +
**included support for hg38.
 +
**minor bug fixes
 +
**improved/expanded documentation and tutorials
 +
**Various improvement and bug fixes to improve efficiency and robustness
 +
*v1.0.7 ( 01-16-2017 )
 +
**fix mac-specific crash
 +
*v1.0.6 ( 01-04-2017 )
 +
**minor bug fix
 +
*v1.0.5 ( 11-10-2016 )
 +
**fixed path to annotation files
 +
**have default path to gene bed file for tag2peak.pl
 +
*v1.0.4 ( 10-05-2016 )
 +
**minor fixes
 +
*v1.0.3 ( 08-08-2016 )
 +
**improvement in software packaging and usage
 +
*v1.0.0 ( 10-12-2015 )
 +
**The initial beta release
  
</pre>
+
=Download=
  
The reason we do this is because low quality reads can introduce mapping errors and background. They will inflate the number of unique tags after removal of PCR duplicates.  
+
*czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. [https://github.com/chaolinzhanglab/czplib Download from github]
 +
*CTK (perl): the core algorithm. [https://github.com/chaolinzhanglab/ctk Download from github]
  
'''Note 1:''' As good practice, always double check the number of reads in each library and also confirm if the extracted reads account for a vast majority of all reads in the raw fastq file.  If you run the fastq_filter.pl without specifying -index, you will get all reads that passed quality filtering.
+
=Prerequisites=
  
<pre>
+
This software is implemented in perl.  It also relies on several standard linux/unix tools such as grep, cat, sort, etc.  We have tested the software on Cent OS, although it is expected to work on most unix-like systems, including Mac OS X.  In addition, several software packages are required by the pipeline for sequence preprocessing and alignment (the version number in our test is also indicated).
  
perl ~/czlab_src/CTK/fastq_filter.pl -if sanger -f mean:0-38:20 -of fasta raw.fastq.gz - | grep -v \> | cut -c 1-5 | sort | uniq -c > raw.stat.txt
+
*FASTX Tool-Kit Version 0.0.13: http://hannonlab.cshl.edu/fastx_toolkit/download.html
 +
*cutadapt Version 1.14: https://pypi.python.org/pypi/cutadapt/ (an alternative to FASTX Tool-Kit)
 +
*Burrows Wheeler Aligner (BWA) Version 0.7.12: http://bio-bwa.sourceforge.net/
 +
*Samtools Version 1.3.1: http://samtools.sourceforge.net
 +
*Perl Version 5.14.3 was used for testing, but we expect that newer versions of Perl will also be compatible: https://www.perl.org/get.html
 +
*Perl library Math::CDF Version 0.1: http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm
  
</pre>
+
=Installation=
  
The command above will give the number of reads starting with particular 5mers.
+
==Through anaconda==
  
'''Note 2:''' The reason we used the first four positions of the index for demultiplexing is due to historical confusion about the identity of the index sequences.  It really should be 5nt, but it does not make a substantial difference.
+
Below are the installation instructions for the perl packages CTK and CZPLIB through Anaconda.
  
==Trimming of 3' linker sequences==
+
* Setup the working environment 'ctk' and install all the packages by running the commands below:
 
+
'''Note 1:''' due to long 101nt reads, collapsing before trimming is not very helpful. Therefore, we trim the 3' linker first.
+
 
+
For each sample, run the following command using tools provided by the fastx toolkit.
+
  
 
<pre>
 
<pre>
 +
myenv='ctk'
 +
conda create --yes --name $myenv
  
cat sample1.fastq | fastx_clipper -a TCGTATGCCGTCTTCTGCTTG -l 29 -n | fastq_quality_trimmer -t 5 -l 29 -o sample1.trim.fastq
+
conda activate $myenv
  
</pre>
+
conda config --env --append channels conda-forge
 
+
conda config --env --append channels bioconda
This command will remove the 3' linker and/or also remove extremely low quality bases (score < 5).
+
 
+
'''Note 2:''' It is good to check the number of reads by running the command:
+
 
+
<pre>
+
 
+
wc -l sample1.trim.fastq
+
  
 +
conda install --yes -c chaolinzhanglab ctk
 
</pre>
 
</pre>
  
==Collapse exact duplicates==
+
If you would like to install a specific platform version e.g. 'noarch', then use the below command:
 
+
If multiple reads have exactly the same sequence, only one is kept.  
+
 
+
 
<pre>
 
<pre>
 
+
conda install --yes -c chaolinzhanglab/noarch ctk
perl ~/czlab_src/CTK/fastq2collapse.pl sample1.trim.fastq sample1.trim.c.fastq
+
 
+
 
</pre>
 
</pre>
  
==Strip barcode==
+
For ease of installation, we recommend the above commands to be copied to a script setup_ctk.sh. This will also make it easy to include the required steps (See '''Note''' below) for restoring the PATH variable if we want to.
 
+
The following script removes the 5' linker degenerate barcode.
+
  
 +
Assuming the conda base environment is already setup and activated, run the setup_ctk.sh script from the terminal.
 
<pre>
 
<pre>
 
+
(base)...$source setup_ctk.sh
perl ~/czlab_src/CTK/stripBarcode.pl -format fastq -len 14 sample1.trim.c.fastq sample1.trim.c.tag.fastq
+
 
+
 
</pre>
 
</pre>
  
After this step, one can get the distribution of tag length for diagnostic purposes.
 
  
<pre>
+
Now we can proceed to our working directory for performing the CTK analysis.
  
awk '{if(NR%4==2) {print $0}}' sample1.trim.c.tag.fastq | awk -F "" '{print NF}' | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample1.trim.c.tag.seqlen.stat.txt
 
  
</pre>
+
'''Acknowledgment Note:'''
 +
There is a R wrapper for CTK, called CLIPflexR which can call some other external libraries within R.
  
=Read mapping & parsing=
+
For details please visit the CLIPflexR webpage:
  
==Read mapping==
+
https://kathrynrozengagnon.github.io/CLIPflexR/index.html
  
We are now using bwa for alignment instead of novoalign (which was used in the old documentation) for two reasons:
 
1. novoalign is very slow, in part because the academic version does not allow multi threading.
 
2. bwa allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).
 
  
<pre>
 
  
mkdir mapping
+
'''Note:'''
cd mapping
+
If we need to reset the PATH variable when we do 'conda deactivate' to go to the base environment from the working environment 'ctk', we suggest to include the following steps in the setup_ctk.sh.
bwa aln -t 4 -n 0.06 -q 20 ~/data/genomes/mm10/bwa/mm10.fa ../filtering/sample1.trim.c.tag.fastq > sample1.sai
+
This is a crude way but serves the purpose.
  
bwa samse ~/data/genomes/mm10/bwa/mm10.fa sample1.sai ../filtering/sample1.trim.c.tag.fastq > sample1.sam
+
1. At the beginning of the script setup_ctk.sh, include the following:
 
+
</pre>
+
 
+
The option -n 0.06 specifies slightly more stringent criteria than the default. The number of allowed mismatches (substitutions or indels) is as follows:
+
 
<pre>
 
<pre>
 
+
export CONDA_PATH_RESET=/usr/local/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:${CONDA_PREFIX}/bin:${CONDA_PREFIX}/condabin
[bwa_aln] 17bp reads: max_diff = 1
+
[bwa_aln] 20bp reads: max_diff = 2
+
[bwa_aln] 45bp reads: max_diff = 3
+
[bwa_aln] 73bp reads: max_diff = 4
+
[bwa_aln] 104bp reads: max_diff = 5
+
[bwa_aln] 137bp reads: max_diff = 6
+
[bwa_aln] 172bp reads: max_diff = 7
+
[bwa_aln] 208bp reads: max_diff = 8
+
[bwa_aln] 244bp reads: max_diff = 9
+
 
+
 
</pre>
 
</pre>
  
The -q option is used to trim low quality reads (an average of 20 or below). However, this does not seem to do too much after the trimming steps above.
+
2. Just before the line "conda activate $myenv", include the following:
 
+
==Parsing SAM file==
+
 
+
 
<pre>
 
<pre>
 +
echo "CONDA_PATH_RESET=$CONDA_PATH_RESET" > "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"
 +
echo "export CONDA_PATH_RESET" >> "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"
  
perl ~/czlab_src/CTK/parseAlignment.pl -v --map-qual 1 --min-len 18 --mutation-file sample1.mutation.txt sample1.sam sample1.tag.bed
+
chmod +x "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"
 
+
 
</pre>
 
</pre>
  
This will keep only unique mappings (with MAPQ >=1) and a minimal mapping size of 18 nt.
+
3. Just after the line "conda activate $myenv", include the following:
 
+
'''Note 1:''' other aligners might not use a positive MAPQ as an indication of unique mapping.
+
 
+
Another useful option is --indel-to-end, which specifies the number of nucleotides towards the end from which indels should not be called.
+
 
+
'''Note 2:''' the parsing script relies on MD tags, which is an optional field without strict definition.  Some aligners might have slightly different format how they report mismatches.  If other aligners than bwa is used, one should run the following command:
+
 
+
 
<pre>
 
<pre>
 +
echo "PATH=\$CONDA_PATH_RESET" > "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"
 +
echo "export PATH" >> "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"
  
samtools view -bS sample1.sam | samtools sort - sample1.sorted
+
chmod +x "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"
samtools fillmd  sample1.sorted.bam ~/data/genomes/mm10/mm10.fa > sample1.sorted.md.sam
+
 
+
 
</pre>
 
</pre>
  
This will ensure the sam file gets parsed properly.
+
These steps will write the required activate.d and deactivate.d scripts to reset the $PATH variable
  
'''Note 3:''' keep track what proportion of reads can be mapped uniquely.
 
<pre>
 
  
wc -l sample1.tag.bed
 
 
</pre>
 
 
'''Note 4:''' the format of the mutation file is essentially the same as before.
 
 
=Collapsing PCR duplicates=
 
 
It is critical to collapse PCR duplicates, not only for the exact duplicates collapsed above, but also for those with slight differences due to sequencing errors.
 
  
 +
==Manual Installation==
 +
* Download and install software packages described in prerequisites.
 +
* Download the czplib perl library files (refer back to '''Download''' section above)
 +
* Decompress and move to whatever directory you like (as an example, we use /usr/local/lib/)
 +
* Replace "x.tgz" below with the version of the package you downloaded
 
<pre>
 
<pre>
 
+
$unzip czplib-1.0.x.zip
perl ~/czlab_src/CTK/tag2collapse.pl -v --random-barcode -EM 30 --seq-error-model alignment -weight --weight-in-name --keep-max-score --keep-tag-name sample1.tag.bed sample1.tag.uniq.bed
+
$mv czplib-1.0.x /usr/local/lib/czplib
 
+
 
</pre>
 
</pre>
  
'''Note 1:''' As a diagnostic step, get the length distribution of unique tags, which should be a more faithful representation of the library:
+
Add the library path to the environment variable, so perl can find it. 
 
+
 
<pre>
 
<pre>
 
+
export PERL5LIB=/usr/local/lib/czplib
awk '{print $3-$2}' sample1.tag.uniq.bed | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample1.uniq.len.dist.txt
+
 
+
 
</pre>
 
</pre>
  
Get the mutations in unique tags
+
* Download CTK code and likewise decompress and move to whatever directory you like (as an example, we use /usr/local/)
<pre>
+
python ~/src/CTK/joinWrapper.py sample1.mutation.txt sample1.tag.uniq.bed 4 4 N sample1.tag.uniq.mutation.txt
+
</pre>
+
 
+
 
+
'''Note 2:''' After getting the unique tags of each library, one might concatenate biological replicates, which are distinguished by different colors.
+
  
 
<pre>
 
<pre>
 
+
$unzip ctk-1.0.x.zip
perl ~/czlab_src/CTK/bed2rgb.pl -v -col "128,0,0" sample1.tag.uniq.bed sample1.tag.uniq.rgb.bed
+
$mv ctk-1.0.x /usr/local/CTK
...
+
 
+
cat sample1.tag.uniq.rgb.bed sample2.tag.uniq.rgb.bed sample3.tag.uniq.rgb.bed > sample.pool.tag.uniq.rgb.bed
+
cat sample1.tag.uniq.mutation.txt sample2.tag.uniq.mutation.txt sample3.tag.uniq.mutation.txt > sample.pool.tag.uniq.mutation.txt
+
 
+
 
</pre>
 
</pre>
  
'''Note 3:''' get genomic distribution of CLIP tags. [Do not include ~ in the path]
+
Add the dir to your $PATH environment variable if you would like.
  
<pre>
+
Finally, some of the scripts will use a cache directory, which is under the working directory by default. One can specify another folder for cache using environment variable (recommended).
sed "s#/path/to/src#<insert path to CTK here>#" ~/czlab_src/CTK/CTK_annotation.loc ~/czlab_src/CTK/CTK_annotation.custom.loc
+
perl ~/czlab_src/CTK/bed2annotation.pl -conf ~/czlab_src/CTK/CTK_annotation.custom.loc -dbkey mm10 -ss -big -gene -region -v -summary sample1.tag.uniq.annot.summary.txt sample1.tag.uniq.rgb.bed sample1.tag.uniq.annot.txt
+
</pre>
+
 
+
Check the summary file for the percentage of tags mapped to CDS, 3'UTR, introns, etc.
+
 
+
=Peak calling=
+
  
==Mode 1: Peak calling without statistical assessment==
 
 
<pre>
 
<pre>
  
perl ~/czlab_src/CTK/tag2peak.pl -big -ss -v --valley-seeking sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.peak.bed --out-boundary sample.pool.tag.uniq.peak.boundary.bed --out-half-PH sample.pool.tag.uniq.peak.halfPH.bed
+
#e.g., add the following lines in  .bash_profile
 
+
CACHEHOME=$HOME/cache
 +
export CACHEHOME
 
</pre>
 
</pre>
  
This script searches for all potential peaks using a "valley seeking" algorithm.  A peak is called if valleys of certain depths are found on both sides so that it is separated from other peaks.  The results are similar to those generated by the original clustering algorithm. However, this valley seeking algorithm is advantageous because it is able to separate neighboring peaks without zero-tag gaps.
+
=Indexing reference genome =
  
Another useful option --valley-depth specifies the depth of the valley relative to the peak (0-1, 0.5 by default).
+
We are now using BWA (version 0.7.12) for alignment instead of novoalign for two reasons:
 +
* novoalign is slower than some of the other algorithms that become available, in part because the academic version of novoalign does not allow multi threading.
 +
* BWA allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).
  
One also has the option to merge peaks close to each other by specifying the distance between the peaks (e.g., -gap 20).
+
This step needs to be done only once.
  
In this mode, the script will not assess the statistical significance of the peak height. If one needs this, an alternative mode is as follows:
+
After you have installed BWA, prepare a reference genome:
 
+
==Mode 2: Peak calling with statistical assessment==
+
  
 +
For example, build a reference mm10 genome. Download the reference genome here: http://ccb.jhu.edu/software/tophat/igenomes.shtml. In this case, make sure you are downloading the "''Mus musculus'' UCSC MM10" reference.
 
<pre>
 
<pre>
 
+
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
perl ~/scripts/tag2peak.pl -big -ss -v --valley-seeking -p 0.05 --multi-test -gene ~/data/genomes/mm10/annotation/refGene_knownGene.meta.bed sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.peak.sig.bed --out-boundary sample.pool.tag.uniq.peak.sig.boundary.bed --out-half-PH sample.pool.tag.uniq.peak.sig.halfPH.bed
+
tar -xvf Mus_musculus_UCSC_mm10.tar.gz
 
+
cd /Mus_musculus_UCSC_mm10/Mus_musculus/UCSC/mm10/Sequence/Chromosomes
 
</pre>
 
</pre>
  
Here, the script searches for significant peaks by considering one specific genic region as a unit for permutations to estimate expected peak height (scan statistics is used here, so the permutation is only conceptual). 
+
Change the chromosome header and combine the chromosomes into a full genome. Note that we exclude random chromosomes and the mitochondria chromosome in our analysis.
 
+
It is important to note that in this mode, only clusters overlapping with annotated genes are kept.
+
 
+
In both cases, the three output files provided the positions of peak (many times it is 1nt, but not always), the peak width at half peak height, and the boundaries of each peak. In practice, it appears that the center of peak width at half PH is the most robust/precise indicator of the binding sites (although a shift might be present because of CITS, which can be adjusted after one examines the enrichment of the motif).
+
 
+
'''Note 1:''' To search for motif, one first defines the center of each peak (based on width at half PH).
+
 
+
 
<pre>
 
<pre>
 +
cat ch1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > mm10.fa
  
awk '{print $1"\t"int(($2+$3)/2)-500"\t"int(($2+$3)/2)+500"\t"$4"\t"$5"\t"$6}' sample.pool.tag.uniq.peak.sig.halfPH.bed > sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.bed
 
perl ~/scripts/bed2fasta.pl -v -org mm10 sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.bed sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.fa
 
PatternMatch -c TGCATG sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.fa | cut -f 2 | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample.pool.tag.uniq.peak.sig.halfPH.center.ext1k.TGCATG.profile.txt
 
 
#using Rbfox as an example
 
  
 
</pre>
 
</pre>
  
'''Note 2:'''One might want to count the number of tags in each cluster/peak for each sample.
+
'''Note 1''' : Make sure each chromosome is named by '>chrN' instead of '>N'
  
<pre>
+
'''Note 2''': In our analysis, we do not include random chromosomes or chrM, which might not be the best for certain projects.
  
perl ~/czlab_src/CTK/tag2profile.pl -ss -region sample.pool.tag.uniq.peak.sig.boundary.bed -of bed -v sample1.tag.uniq.rgb.bed sample1.tag.uniq.peak.sig.boundary.count.bed
 
 
</pre>
 
 
=CIMS analysis=
 
  
 +
Finally, create a BWA index and move it to a directory you like. In this example, the index is in the /genomes/mm10/bwa/ directory.
 
<pre>
 
<pre>
mkdir CIMS
+
cd /genomes/mm10/bwa/
cd CIMS
+
bwa index -a bwtsw mm10.fa
ln -s ../mapping/sample.pool.tag.uniq.mutation.txt ./
+
ln -s ../mapping/sample.pool.tag.uniq.rgb.bed ./
+
 
</pre>
 
</pre>
  
==Get specific types of mutations==
 
Get specific types of mutations, such as deletions, substitutions, and insertions around the cross-linked mutation site.
 
  
<pre>
+
=Genome assemblies with annotations included in CTK=
 +
While CTK is not limited to specific species/genome assemblies in general, several steps require gene annotations.  Currently the annotation files of the following assemblies have been included as part of CTK:
 +
*hg38
 +
*hg19
 +
*mm10
 +
*dm6
  
awk '{if($9==">") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.sub.bed
+
If you are interested in certain genome assemblies not currently supported by CTK, feel free to let us know by posting in our [https://groups.google.com/forum/#!forum/ctk-user-group Google CTK user group].
awk '{if($9=="-") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.del.bed
+
awk '{if($9=="+") {print $0}}' sample.pool.tag.uniq.mutation.txt | cut -f 1-6 > sample.pool.tag.uniq.ins.bed
+
  
</pre>
 
  
It is always a good practice to look at the number of each type of mutation to see, e.g., compare the relative abundance of deletions to insertions.
 
  
For example, to get the number of deletions of different sizes:
 
<pre>
 
 
awk '{print $3-$2}' sample.pool.tag.uniq.del.bed | sort -n | uniq -c
 
 
</pre>
 
 
==Get CIMS==
 
 
Here we use deletions as an example.
 
 
<pre>
 
 
perl ~/czlab_src/CTK/CIMS.pl -n 10 -p -v --keep-cache -c cache_del sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.del.bed sample.pool.tag.uniq.del.CIMS.txt
 
 
</pre>
 
 
By default, it will output all sites including those that are not statistically significant.  This is recommended because one can play with stringency later.
 
 
<pre>
 
 
awk '{if($9<=0.05) {print $0}}' sample.pool.tag.uniq.del.CIMS.txt | sort -k 9,9n -k 8,8nr -k 7,7n > sample.pool.tag.uniq.del.CIMS.s13.txt
 
cut -f 1-6 sample.pool.tag.uniq.del.CIMS.s13.txt > sample.pool.tag.uniq.del.CIMS.s13.bed
 
 
</pre>
 
 
This will keep only those with FDR<0.05.  One might try different thresholds to get a balance of sensitivity/specificity (e.g. as judged from motif enrichment).
 
 
'''Note 1:''' Another parameter that might be useful to improve signal to noise is m/k (i.g., $8/$7 in awk)
 
 
'''Note 2:''' By default, the command line above will analyze mutations of size 1.  Sometimes deletion of multiple nucleotides occurs and those will be ignored here.  One can analyze mutations of size 2 by specifying -w 2.  Substitutions are always reported as a single nucleotide (even when consecutive nucleotides are substituted), and insertions occur technically in one position and are thus treated as size 1.
 
 
'''Note 3:''' To examine enrichment of motif around CIMS;
 
 
<pre>
 
perl ~/scripts/bed2fasta.pl -v -org mm10 -l "-500" -r 500 sample.pool.tag.uniq.del.CIMS.s13.bed sample.pool.tag.uniq.del.CIMS.s13.1k.fa
 
PatternMatch -c TGCATG sample.pool.tag.uniq.del.CIMS.s13.1k.fa | cut -f 2 | sort -n | uniq -c | awk '{print $2"\t"$1}' > sample.pool.tag.uniq.del.CIMS.s13.1k.TGCATG.profile.txt
 
</pre>
 
 
One can repeat these steps for the other types of mutations (i.e. substitutions and insertions).
 
 
=CITS analysis=
 
 
<pre>
 
 
mkdir CITS
 
cd CITS
 
ln -s ../CIMS/sample.pool.tag.uniq.del.bed ./
 
ln -s ../mapping/sample.pool.tag.uniq.rgb.bed ./
 
 
</pre>
 
 
Since the vast majority of deletions are introduced because of cross linking, tags with deletions are treated as read-through tags and removed.
 
 
<pre>
 
 
cut -f 4 sample.pool.tag.uniq.del.bed | sort | uniq > sample.pool.tag.uniq.del.id
 
python ~/czlab_src/CTK/joinWrapper.py sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.del.id 4 1 V sample.pool.tag.uniq.rgb.clean.bed
 
 
</pre>
 
 
Get the position before the start site as a potential cross link site that causes truncation.
 
 
<pre>
 
 
perl ~/czlab_src/CTK/bedExt.pl -n up -l "-1" -r "-1" -v sample.pool.tag.uniq.rgb.clean.bed sample.pool.tag.uniq.clean.trunc.bed
 
 
</pre>
 
 
 
Cluster overlapping tags.
 
 
<pre>
 
 
perl ~/czlab_src/CTK/tag2cluster.pl -big -s -maxgap "-1" -of bed -v sample.pool.tag.uniq.rgb.bed sample.pool.tag.uniq.cluster.0.bed
 
awk '{if($5>2) {print $0}}' sample.pool.tag.uniq.cluster.0.bed > sample.pool.tag.uniq.cluster.bed
 
 
</pre>
 
 
Now we are ready for the CITS analysis:
 
 
<pre>
 
 
perl ~/scripts/tag2peak.pl -big -ss -v --prefix "CITS" -gap 25 -p 0.001 -gene sample.pool.tag.uniq.cluster.bed sample.pool.tag.uniq.clean.trunc.bed sample.pool.tag.uniq.clean.CITS.s30.bed
 
 
</pre>
 
 
'''Note 1:''' one can now perform motif enrichment analysis as described above.
 
 
'''Note 2:''' In the command line above, we opt to merge sites very close to each other and keep the most significant ones because CITS analysis tends to have more background when read through is frequent.
 
 
'''Note 3:''' In theory, one should be able to use the clusters/peaks defined above for permutations in CITS analysis.
 
 
<pre>
 
 
perl ~/czlab_src/CTK/bedExt.pl -l "-1" -v ../cluster/sample.pool.tag.uniq.peak.boundary.bed sample.pool.tag.uniq.peak.boundary.ext1.bed
 
#we extend one 1nt upstream to make sure CITS is covered by clusters
 
perl ~/czlab_src/CTK/tag2peak.pl -big -ss -v --prefix "CITS" -gap 25 -p 0.001 -gene sample.pool.tag.uniq.peak.boundary.ext1.bed sample.pool.tag.uniq.clean.trunc.bed sample.pool.tag.uniq.clean.CITS.s30.bed
 
 
</pre>
 
  
In one case, we observed that clusters defined by grouping overlapping tags appear to give slightly better results, although this might just be one case. Additional comparisons are required to make a conclusion.
+
<center>[[CTK|CTK home]] | [[Standard/BrdU-CLIP data analysis using CTK|Standard/BrdU-CLIP]] | [[iCLIP data analysis using CTK|iCLIP]] | [[eCLIP data analysis using CTK|eCLIP]] | [[PARCLIP data analysis using CTK|PAR-CLIP]] | [[CTK_usage|CTK usage]] | [[CTK_FAQ|FAQ]]</center>

Latest revision as of 14:04, 24 August 2022

CTK home | Standard/BrdU-CLIP | iCLIP | eCLIP | PAR-CLIP | CTK usage | FAQ

Introduction

CTK Pipeline Overview.png


Crosslinking and immunoprecipitation followed by highthroughput sequencing (HITS-CLIP or CLIP-Seq) has now been widely used to map protein-RNA interactions on a genome-wide scale. The CLIP Tool Kit (CTK) is a software package that provides a set of tools for analysis of CLIP data starting from the raw reads generated by the sequencer. It includes pipelines to filter and map reads, collapse PCR duplicates to obtain unique CLIP tags, define CLIP tag clusters and call peaks, and define the exact protein-RNA crosslink sites by CIMS and CITS analysis. This software package is an expanded version of our previous CIMS package.

Crosslinking induced mutation site (CIMS) and cross linking induced truncation site (CITS) analyses are computational methods for CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. These methods are based on the observation that UV crosslinked amino-acid-RNA adducts can introduce reverse transcription errors, including mutations and premature in cDNAs at a certain frequency, which are captured by sequencing and subsequent comparison of CLIP tags with a reference genome.


If you use the software, please cite:

Shah,A., Qian,Y., Weyn-Vanhentenryck,S.M., Zhang,C. 2017. CLIP Tool Kit (CTK): a flexible and robust pipeline to analyze CLIP sequencing data. Bioinformatics. 33:566-567.


More details of the biochemical and computational aspects of CLIP can be found in the following references:

Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614. 

Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.

For crosslinking induced trunction analysis (CITS) described below, please refer to:

Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 6:1139-1152. 

User group

For questions/answers, please visit our user group: https://groups.google.com/forum/#!forum/ctk-user-group

Versions

  • v1.1.3 (12/2018) current
    • minor bug fix
  • v1.1.2 (08/02/2018)
    • update in hg38 annotation files
  • v1.1.1 (07/20/2018)
    • bug fix related to the use of dm6 annotation files
  • v1.1.0 ( 07/14/2018 )
    • included support for dm6.
  • v1.0.9 ( 06/12/2018 )
    • minor bug fix
  • v1.0.8 ( 05/24/2018 )
    • improved selection of unique CLIP tags.
    • improved support for CIMS anlaysis of particular types of substitutions (e.g., T-C for PAR-CLIP).
    • a wrapper CITS.pl is included to simplify CITS anlaysis.
    • included additional annotations files.
    • included support for hg38.
    • minor bug fixes
    • improved/expanded documentation and tutorials
    • Various improvement and bug fixes to improve efficiency and robustness
  • v1.0.7 ( 01-16-2017 )
    • fix mac-specific crash
  • v1.0.6 ( 01-04-2017 )
    • minor bug fix
  • v1.0.5 ( 11-10-2016 )
    • fixed path to annotation files
    • have default path to gene bed file for tag2peak.pl
  • v1.0.4 ( 10-05-2016 )
    • minor fixes
  • v1.0.3 ( 08-08-2016 )
    • improvement in software packaging and usage
  • v1.0.0 ( 10-12-2015 )
    • The initial beta release

Download

Prerequisites

This software is implemented in perl. It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on Cent OS, although it is expected to work on most unix-like systems, including Mac OS X. In addition, several software packages are required by the pipeline for sequence preprocessing and alignment (the version number in our test is also indicated).

Installation

Through anaconda

Below are the installation instructions for the perl packages CTK and CZPLIB through Anaconda.

  • Setup the working environment 'ctk' and install all the packages by running the commands below:
myenv='ctk'
conda create --yes --name $myenv 

conda activate $myenv

conda config --env --append channels conda-forge
conda config --env --append channels bioconda

conda install --yes -c chaolinzhanglab ctk

If you would like to install a specific platform version e.g. 'noarch', then use the below command:

conda install --yes -c chaolinzhanglab/noarch ctk

For ease of installation, we recommend the above commands to be copied to a script setup_ctk.sh. This will also make it easy to include the required steps (See Note below) for restoring the PATH variable if we want to.

Assuming the conda base environment is already setup and activated, run the setup_ctk.sh script from the terminal.

(base)...$source setup_ctk.sh


Now we can proceed to our working directory for performing the CTK analysis.


Acknowledgment Note: There is a R wrapper for CTK, called CLIPflexR which can call some other external libraries within R.

For details please visit the CLIPflexR webpage:

https://kathrynrozengagnon.github.io/CLIPflexR/index.html


Note: If we need to reset the PATH variable when we do 'conda deactivate' to go to the base environment from the working environment 'ctk', we suggest to include the following steps in the setup_ctk.sh. This is a crude way but serves the purpose.

1. At the beginning of the script setup_ctk.sh, include the following:

export CONDA_PATH_RESET=/usr/local/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:${CONDA_PREFIX}/bin:${CONDA_PREFIX}/condabin

2. Just before the line "conda activate $myenv", include the following:

echo "CONDA_PATH_RESET=$CONDA_PATH_RESET" > "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"
echo "export CONDA_PATH_RESET" >> "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"

chmod +x "${CONDA_PREFIX}/envs/${myenv}/etc/conda/activate.d/${myenv}_env_activate.sh"

3. Just after the line "conda activate $myenv", include the following:

echo "PATH=\$CONDA_PATH_RESET" > "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"
echo "export PATH" >> "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"

chmod +x "${CONDA_PREFIX}/etc/conda/deactivate.d/${myenv}_env_deactivate.sh"

These steps will write the required activate.d and deactivate.d scripts to reset the $PATH variable


Manual Installation

  • Download and install software packages described in prerequisites.
  • Download the czplib perl library files (refer back to Download section above)
  • Decompress and move to whatever directory you like (as an example, we use /usr/local/lib/)
  • Replace "x.tgz" below with the version of the package you downloaded
$unzip czplib-1.0.x.zip
$mv czplib-1.0.x /usr/local/lib/czplib

Add the library path to the environment variable, so perl can find it.

export PERL5LIB=/usr/local/lib/czplib
  • Download CTK code and likewise decompress and move to whatever directory you like (as an example, we use /usr/local/)
$unzip ctk-1.0.x.zip
$mv ctk-1.0.x /usr/local/CTK

Add the dir to your $PATH environment variable if you would like.

Finally, some of the scripts will use a cache directory, which is under the working directory by default. One can specify another folder for cache using environment variable (recommended).


#e.g., add the following lines in  .bash_profile
CACHEHOME=$HOME/cache
export CACHEHOME

Indexing reference genome

We are now using BWA (version 0.7.12) for alignment instead of novoalign for two reasons:

  • novoalign is slower than some of the other algorithms that become available, in part because the academic version of novoalign does not allow multi threading.
  • BWA allows one to specify mismatch rate instead of the the absolute number, which is more appropriate for tags of different sizes (i.e. a smaller number of mismatches allowed for shorter tags after trimming).

This step needs to be done only once.

After you have installed BWA, prepare a reference genome:

For example, build a reference mm10 genome. Download the reference genome here: http://ccb.jhu.edu/software/tophat/igenomes.shtml. In this case, make sure you are downloading the "Mus musculus UCSC MM10" reference.

wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
tar -xvf Mus_musculus_UCSC_mm10.tar.gz
cd /Mus_musculus_UCSC_mm10/Mus_musculus/UCSC/mm10/Sequence/Chromosomes

Change the chromosome header and combine the chromosomes into a full genome. Note that we exclude random chromosomes and the mitochondria chromosome in our analysis.

cat ch1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa > mm10.fa 


Note 1 : Make sure each chromosome is named by '>chrN' instead of '>N'

Note 2: In our analysis, we do not include random chromosomes or chrM, which might not be the best for certain projects.


Finally, create a BWA index and move it to a directory you like. In this example, the index is in the /genomes/mm10/bwa/ directory.

cd /genomes/mm10/bwa/
bwa index -a bwtsw mm10.fa


Genome assemblies with annotations included in CTK

While CTK is not limited to specific species/genome assemblies in general, several steps require gene annotations. Currently the annotation files of the following assemblies have been included as part of CTK:

  • hg38
  • hg19
  • mm10
  • dm6

If you are interested in certain genome assemblies not currently supported by CTK, feel free to let us know by posting in our Google CTK user group.



CTK home | Standard/BrdU-CLIP | iCLIP | eCLIP | PAR-CLIP | CTK usage | FAQ