Test

From Zhang Laboratory

Revision as of 08:22, 5 May 2014 by Czhang (Talk | contribs) (CITS analysis)

Jump to: navigation, search

Introduction

Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution. This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome. More details can be found in the following references:

Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614. 

Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93.  doi:10.1038/nprot.2014.012.

This brief document provides only the most critical information about how to run the program, which complements a more detailed, step-by-step guide described in the second reference above.

Versions

  • v1.0.1 ( 5-22-2013 ), current
    • Minor internal extension
    • Included joinWrapper.py which was missing in the previous version
  • v1.0.0 ( 12-14-2012 )
    • The initial public release

Download

Source code:

Installation

Prerequisites

This software is implemented with perl . It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.

Steps to install the software

  • Download the perl library files czplib, if not already.

Decompress it and move it to a place you like

$tar zxvf czplib.v1.0.x.tgz
$mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib
  • Download CIMS codes, if not already.

Decompress it and move it to a place you like

$tar zxvf CIMS.v1.0.x.tgz
$cd CIMS
$chmod 755 *.pl
$mv CIMS /usr/local/CIMS

Add the dir to your $PATH environment variable.

CIMS analysis

Input files

The key script one needs to run is CIMS.pl, which will take two BED files as input: a list of unique CLIP tags (properly mapped to the reference genome), and the coordinates of mutations (deletions, insertions, or substitutions) in the reference genome and relative the CLIP tags. It is critical to make sure:

  • analyze one type of mutations at a time.
  • the 4th column of the mutation BED file should match the name of the CLIP tag in the first BED file.
  • the coordinates of mutations relative to the CLIP tag (from the 5' end of the Watson strand, 0-based) is correctly specified in the 5' column of the second BED file.
  • only mutations in unique CLIP tags should be included.


Now you can run something like

perl /usr/local/CIMS/CIMS.pl -v -n 5 -p -FDR 0.001 -c ./cache_del  test.uniq.bed test.uniq.del.bed test.uniq.del.CIMS.txt


The output is a list of CIMS at FDR<0.001, one per line.

The first six columns of this file follow the definition of a BED file, including coordinates and strand of each CIMS. Columns 7-10 are k, m, FDR, and number of sites with m or more tags with mutations given k tags at that position in total (the denominator to calculate FDR, which gives an idea about the precision of the FDR value).

This file can be reordered with the following command:

sort test.uniq.del.CIMS.txt -k 9,9n -k 8,8nr -k 7,7n > test.uniq.del.CIMS.sort.txt

Usage

CIMS.pl [options] <tag.bed> <mutation.bed> <out.txt>

Arguments:

Argument Description
<tag.bed> BED file of unique CLIP tags
<mutation.bed> BED file of mutations in unique CLIP tags. Make sure you paid attention to the notes above
<out.txt> output file with the list of CIMS


Options:

Option Description
-big input files are big (e.g. over 6 million lines)
-n [int] number of iterations for permutation (default: 5)
-p track mutation position relative to read start
--no-sparse-correct no sparcity correction *
-FDR [float] threshold of FDR (default: 1)
-mkr [float] threshold of m-over-k-ratio (default: 0)

*This option should not be used in general, but is included to reproduce our earlier analysis. We introduced this feature to eliminate an additional filtering step based on mutation freqeuncy (i.e., the "m" value).

CITS analysis

Some variations of CLIP protocols, including BrdU-CLIP and iCLIP allows capture of tags that are truncated at the cross link sites (and those read through as well). This section describes our method to detect robust cross link induced truncation sites (CITS).

CITS analysis requires unique CLIP tags as a BED file, which was used to get the coordinates of the potential truncation site, defined as the position immediately upstream of the first nucleotide of the CLIP tag.

perl ~/scripts/bedExt.pl -n up -l "-1" -r "-1" -v tag.uniq.bed tag.uniq.trunc.bed


Since CLIP tags with cross linking induced mutations (i.e. deletions) most likely represent tags that read through the cross link site, it is recommended that these tags are removed.

cut -f 4 tag.uniq.del.bed | sort | uniq > tag.uniq.del.id
perl ~/scripts/removeRow.pl -q 3 -v tag.uniq.trunc.bed tag.uniq.del.id > tag.uniq.trunc.clean.bed

We now cluster all potential truncation events

perl ~/scripts/bedExt.pl -n up -l "-1" -r "-1" -v tag.uniq.trunc.clean.bed tag.uniq.trunc.clean.cluster.bed

Also, we cluster CLIP tags

perl ~/scripts/tag2cluster.pl -s -maxgap "-1" -of bed -v tag.uniq.bed tag.uniq.cluster.0.bed
awk '{if($5>2) {print $0}}' tag.uniq.cluster.0.bed > tag.uniq.cluster.bed