Difference between revisions of "CIMS Documentation"

From Zhang Laboratory

Jump to: navigation, search
(Versions)
(Replaced content with "=Introduction= The CIMS software package has been replaced by [CTK CLIP Tool Kit (CTK)].")
Line 1: Line 1:
 
=Introduction=
 
=Introduction=
  
Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution.  This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome.  More details can be found in the following references:
+
The CIMS software package has been replaced by [CTK CLIP Tool Kit (CTK)].
<pre>
+
Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614.
+
 
+
Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.
+
</pre>
+
 
+
This brief document provides only the most critical information about how to run the program, which complements a more detailed, step-by-step guide described in the second reference above.
+
 
+
For cross link induced deletion analysis (CITS) described below, please refer to:
+
 
+
<pre>
+
Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 10.1016/j.celrep.2014.02.005.
+
</pre>
+
 
+
=Versions=
+
*v1.0.5 ( 03-10-2015 ), current
+
**dramatic improvement in speed and memory usage in tag collapsing using random barcodes for high-complexity libraries.
+
*v1.0.4 ( 12-04-2014 )
+
**improved computational efficiency and memory usage
+
**required czplib czplib.v1.0.5 (or up)
+
*v1.0.3 ( 05-05-2014 )
+
**Included scripts and documentation for crosslinking induced truncation site (CITS) analysis
+
*v1.0.2 ( 08-15-2013 )
+
*v1.0.1 ( 05-22-2013 )
+
**Minor internal extension
+
**Included joinWrapper.py which was missing in the previous version
+
*v1.0.0 ( 12-14-2012 )
+
**The initial public release
+
 
+
=Download=
+
 
+
'''Source code:'''
+
 
+
*czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. ([http://sourceforge.net/p/czplib/ download from SourceForge.net])
+
*CIMS (perl): the core algorithm. ([http://sourceforge.net/p/ngs-cims/ download from SourceForge.net])
+
 
+
=Installation=
+
 
+
==Prerequisites==
+
 
+
This software is implemented with perl .  It also relies on several standard linux/unix tools such as grep, cat, sort, etc.  We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.
+
 
+
==Steps to install the software==
+
 
+
* Download the perl library files czplib, if not already.
+
 
+
Decompress it and move it to a place you like
+
 
+
<pre>
+
$tar zxvf czplib.v1.0.x.tgz
+
$mv czplib /usr/local/lib
+
</pre>
+
 
+
Add the library path to the environment variable, so perl can find it. 
+
<pre>
+
PERL5LIB=/usr/local/lib/czplib
+
</pre>
+
 
+
* Download CIMS codes, if not already.
+
Decompress it and move it to a place you like
+
 
+
<pre>
+
$tar zxvf CIMS.v1.0.x.tgz
+
$cd CIMS
+
$chmod 755 *.pl
+
$mv CIMS /usr/local/CIMS
+
</pre>
+
 
+
Add the dir to your $PATH environment variable.
+
 
+
=CIMS analysis=
+
==Input files==
+
 
+
The key script one needs to run is CIMS.pl, which will take two BED files as input: a list of unique CLIP tags (properly mapped to the reference genome), and the coordinates of mutations (deletions, insertions, or substitutions) in the reference genome and relative the CLIP tags.  These files can be generated by following our '''Nature Protocols''' paper.  If one would like to generate these files using their own programs, it is '''critical''' to make sure:
+
 
+
* analyze one type of mutations at a time.
+
* the 4th column of the mutation BED file should match the name of the CLIP tag in the first BED file.
+
* the coordinates of mutations relative to the CLIP tag (from the 5' end of the Watson strand, 0-based) is correctly specified in the 5th column of the second BED file. 
+
* only mutations in unique CLIP tags should be included.
+
 
+
 
+
Now you can run something like
+
 
+
perl /usr/local/CIMS/CIMS.pl -v -n 5 -p -FDR 0.001 -c ./cache_del  test.uniq.bed test.uniq.del.bed test.uniq.del.CIMS.txt
+
 
+
 
+
The output is a list of CIMS at FDR<0.001, one per line.
+
 
+
The first six columns of this file follow the definition of a BED file, including coordinates and strand of each CIMS.  Columns 7-10 are k, m, FDR, and number of sites with m or more tags with mutations given k tags at that position in total (the denominator to calculate FDR, which gives an idea about the precision of the FDR value).
+
 
+
This file can be reordered with the following command:
+
 
+
<pre>
+
sort test.uniq.del.CIMS.txt -k 9,9n -k 8,8nr -k 7,7n > test.uniq.del.CIMS.sort.txt
+
</pre>
+
 
+
==Usage==
+
<pre>
+
CIMS.pl [options] <tag.bed> <mutation.bed> <out.txt>
+
</pre>
+
 
+
Arguments:
+
{|class="wikitable" width="100%" style="border:1px solid"
+
|-
+
!scope="column" width=150|'''Argument'''
+
|'''Description'''
+
|-
+
|<tag.bed>
+
|BED file of unique CLIP tags
+
|-
+
|<mutation.bed>
+
|BED file of mutations in unique CLIP tags. Make sure you paid attention to the notes above
+
|-
+
|<out.txt>
+
| output file with the list of CIMS
+
|}
+
 
+
 
+
Options:
+
{|class="wikitable" width="100%" style="border:1px solid"
+
|-
+
!scope="column" width=150|'''Option'''
+
|'''Description'''
+
|-
+
| -big
+
| input files are big (e.g. over 6 million lines)  
+
|-
+
| -n [int]
+
|number of iterations for permutation (default: 5)
+
|-
+
| -p
+
| track mutation position relative to read start
+
|-
+
| --no-sparse-correct
+
| no sparcity correction *
+
|-
+
| -FDR [float]
+
| threshold of FDR (default: 1)
+
|-
+
| -mkr [float]
+
| threshold of m-over-k-ratio (default: 0)
+
|}
+
 
+
<nowiki>*</nowiki>This option should not be used in general, but is included to reproduce our earlier analysis.  We introduced this feature to eliminate an additional filtering step based on mutation freqeuncy (i.e., the "m" value).
+
 
+
=CITS analysis=
+
Some variations of CLIP protocols, including BrdU-CLIP and iCLIP, allow capture of tags that are truncated at the cross link sites (and those read through as well).  This section describes our method to detect robust cross link induced truncation sites (CITS).
+
 
+
CITS analysis requires unique CLIP tags as a BED file, which was used to get the coordinates of potential truncation sites, defined as the position immediately upstream of the first nucleotide of each CLIP tag.
+
 
+
For the program to work, it is '''critical''' to note that we collapse all CLIP tags with the same start in a library to eliminate potential PCR duplicates.  Therefore, one has to have multiple independent CLIP libraries or CLIP libraries with degenerate barcodes to allow tags with the same starts to be identified as potential truncation sites.
+
 
+
<pre>
+
perl ~/scripts/bedExt.pl -n up -l "-1" -r "-1" -v tag.uniq.bed tag.uniq.trunc.bed
+
</pre>
+
 
+
 
+
Since CLIP tags with crosslinking induced mutations (i.e. deletions) most likely represent tags that read through the cross link site, it is recommended that these tags are removed.
+
 
+
<pre>
+
cut -f 4 tag.uniq.del.bed | sort | uniq > tag.uniq.del.id
+
python ~/src/CIMS/joinWrapper.py tag.uniq.trunc.bed tag.uniq.del.id 4 1 V tag.uniq.trunc.clean.bed
+
</pre>
+
 
+
Here tag.uniq.del.bed is the list of deletions in unique tags.  You should have obtained this file by CIMS analysis.
+
 
+
We now cluster all unique CLIP tags, if not already. 
+
<pre>
+
perl ~/scripts/tag2cluster.pl -s -maxgap "-1" -of bed -v tag.uniq.bed tag.uniq.cluster.0.bed
+
awk '{if($5>2) {print $0}}' tag.uniq.cluster.0.bed > tag.uniq.cluster.bed
+
</pre>
+
 
+
In the next step below, these clusters will be used to shuffle truncation events to evaluate the statistical significance of reproducibility.  This is more stringent than gene-based permutation.
+
 
+
<pre>
+
perl ~/scripts/tag2peak.pl -c ./cache -ss -v -gap 25 -p 0.001 tag.uniq.cluster.bed tag.uniq.trunc.clean.bed tag.uniq.CITS.s30.bed
+
</pre>
+
 
+
This script will define CITS with p < 0.001 using scan statistics (instead of actually doing random permutation).  Sites within 25 nt with each other will be clustered together.  One might also do Bonforroni multiple testing correction.  If this is the case, use an additional option --multi-test.
+
 
+
The output file tag.uniq.CITS.s30.bed is the list of robust CITS.  One can perform further analysis, such as motif enrichment analysis or de novo motif discovery to evaluate the signal to noise ratio and fine tune the parameters.
+

Revision as of 23:22, 11 October 2016

Introduction

The CIMS software package has been replaced by [CTK CLIP Tool Kit (CTK)].