CIMS Documentation obsolete

From Zhang Laboratory

Revision as of 23:17, 11 October 2016 by Czhang (Talk | contribs) (Created page with "=Introduction= Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites an...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Introduction

Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome. More details can be found in the following references:

Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614. 

Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.

This brief document provides only the most critical information about how to run the program, which complements a more detailed, step-by-step guide described in the second reference above.

For cross link induced deletion analysis (CITS) described below, please refer to:

Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 10.1016/j.celrep.2014.02.005. 

Versions

  • v1.0.5 ( 03-10-2015 ), current
    • dramatic improvement in speed and memory usage in tag collapsing using random barcodes for high-complexity libraries.
  • v1.0.4 ( 12-04-2014 )
    • improved computational efficiency and memory usage
    • required czplib czplib.v1.0.5 (or up)
  • v1.0.3 ( 05-05-2014 )
    • Included scripts and documentation for crosslinking induced truncation site (CITS) analysis
  • v1.0.2 ( 08-15-2013 )
  • v1.0.1 ( 05-22-2013 )
    • Minor internal extension
    • Included joinWrapper.py which was missing in the previous version
  • v1.0.0 ( 12-14-2012 )
    • The initial public release

Download

Source code:

Installation

Prerequisites

This software is implemented with perl . It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.

Steps to install the software

  • Download the perl library files czplib, if not already.

Decompress it and move it to a place you like

$tar zxvf czplib.v1.0.x.tgz
$mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib
  • Download CIMS codes, if not already.

Decompress it and move it to a place you like

$tar zxvf CIMS.v1.0.x.tgz
$cd CIMS
$chmod 755 *.pl
$mv CIMS /usr/local/CIMS

Add the dir to your $PATH environment variable.

CIMS analysis

Input files

The key script one needs to run is CIMS.pl, which will take two BED files as input: a list of unique CLIP tags (properly mapped to the reference genome), and the coordinates of mutations (deletions, insertions, or substitutions) in the reference genome and relative the CLIP tags. These files can be generated by following our Nature Protocols paper. If one would like to generate these files using their own programs, it is critical to make sure:

  • analyze one type of mutations at a time.
  • the 4th column of the mutation BED file should match the name of the CLIP tag in the first BED file.
  • the coordinates of mutations relative to the CLIP tag (from the 5' end of the Watson strand, 0-based) is correctly specified in the 5th column of the second BED file.
  • only mutations in unique CLIP tags should be included.


Now you can run something like

perl /usr/local/CIMS/CIMS.pl -v -n 5 -p -FDR 0.001 -c ./cache_del  test.uniq.bed test.uniq.del.bed test.uniq.del.CIMS.txt


The output is a list of CIMS at FDR<0.001, one per line.

The first six columns of this file follow the definition of a BED file, including coordinates and strand of each CIMS. Columns 7-10 are k, m, FDR, and number of sites with m or more tags with mutations given k tags at that position in total (the denominator to calculate FDR, which gives an idea about the precision of the FDR value).

This file can be reordered with the following command:

sort test.uniq.del.CIMS.txt -k 9,9n -k 8,8nr -k 7,7n > test.uniq.del.CIMS.sort.txt

Usage

CIMS.pl [options] <tag.bed> <mutation.bed> <out.txt>

Arguments:

Argument Description
<tag.bed> BED file of unique CLIP tags
<mutation.bed> BED file of mutations in unique CLIP tags. Make sure you paid attention to the notes above
<out.txt> output file with the list of CIMS


Options:

Option Description
-big input files are big (e.g. over 6 million lines)
-n [int] number of iterations for permutation (default: 5)
-p track mutation position relative to read start
--no-sparse-correct no sparcity correction *
-FDR [float] threshold of FDR (default: 1)
-mkr [float] threshold of m-over-k-ratio (default: 0)

*This option should not be used in general, but is included to reproduce our earlier analysis. We introduced this feature to eliminate an additional filtering step based on mutation freqeuncy (i.e., the "m" value).

CITS analysis

Some variations of CLIP protocols, including BrdU-CLIP and iCLIP, allow capture of tags that are truncated at the cross link sites (and those read through as well). This section describes our method to detect robust cross link induced truncation sites (CITS).

CITS analysis requires unique CLIP tags as a BED file, which was used to get the coordinates of potential truncation sites, defined as the position immediately upstream of the first nucleotide of each CLIP tag.

For the program to work, it is critical to note that we collapse all CLIP tags with the same start in a library to eliminate potential PCR duplicates. Therefore, one has to have multiple independent CLIP libraries or CLIP libraries with degenerate barcodes to allow tags with the same starts to be identified as potential truncation sites.

perl ~/scripts/bedExt.pl -n up -l "-1" -r "-1" -v tag.uniq.bed tag.uniq.trunc.bed


Since CLIP tags with crosslinking induced mutations (i.e. deletions) most likely represent tags that read through the cross link site, it is recommended that these tags are removed.

cut -f 4 tag.uniq.del.bed | sort | uniq > tag.uniq.del.id
python ~/src/CIMS/joinWrapper.py tag.uniq.trunc.bed tag.uniq.del.id 4 1 V tag.uniq.trunc.clean.bed

Here tag.uniq.del.bed is the list of deletions in unique tags. You should have obtained this file by CIMS analysis.

We now cluster all unique CLIP tags, if not already.

perl ~/scripts/tag2cluster.pl -s -maxgap "-1" -of bed -v tag.uniq.bed tag.uniq.cluster.0.bed
awk '{if($5>2) {print $0}}' tag.uniq.cluster.0.bed > tag.uniq.cluster.bed

In the next step below, these clusters will be used to shuffle truncation events to evaluate the statistical significance of reproducibility. This is more stringent than gene-based permutation.

perl ~/scripts/tag2peak.pl -c ./cache -ss -v -gap 25 -p 0.001 tag.uniq.cluster.bed tag.uniq.trunc.clean.bed tag.uniq.CITS.s30.bed

This script will define CITS with p < 0.001 using scan statistics (instead of actually doing random permutation). Sites within 25 nt with each other will be clustered together. One might also do Bonforroni multiple testing correction. If this is the case, use an additional option --multi-test.

The output file tag.uniq.CITS.s30.bed is the list of robust CITS. One can perform further analysis, such as motif enrichment analysis or de novo motif discovery to evaluate the signal to noise ratio and fine tune the parameters.