Difference between revisions of "MCarts Documentation"

From Zhang Laboratory

Jump to: navigation, search
(Get started)
(Get started)
Line 91: Line 91:
 
=Get started=
 
=Get started=
  
mCarts requires genomic sequences of genic regions, multiple alignments, exon/intron/UTR annotations, RNA accessibility information, etc, which are provided in the library files.
+
mCarts requires genomic sequences of genic regions, multiple alignments, exon/intron/UTR annotations, RNA accessibility information, etc, which are all provided in the library files (that's why it is huge!!).
  
 
In addition, mCarts takes two sets of genomic regions, provided in two BED files, to obtain motif sites in the positive and negative training datasets of the HMM.  The positive training regions are typically (several thousand) regions of robust CLIP tag clusters.  The negative training regions are typically genic regions without any CLIP tags. Only motif sites in genic regions (as defined by library files) are actually used.
 
In addition, mCarts takes two sets of genomic regions, provided in two BED files, to obtain motif sites in the positive and negative training datasets of the HMM.  The positive training regions are typically (several thousand) regions of robust CLIP tag clusters.  The negative training regions are typically genic regions without any CLIP tags. Only motif sites in genic regions (as defined by library files) are actually used.

Revision as of 16:24, 19 September 2012


Prediction of clustered RNA-binding protein motif sites in the mammalian genome Chaolin Zhang,1,* Kuang-Yung Lee2,3, Maurice S. Swanson2, Robert B. Darnell1,*

1 Laboratory of Molecular Neuro-Oncology, Howard Hughes Medical Institute, The Rockefeller University, 1230 York Avenue, New York, NY 10021, USA 2 Department of Molecular Genetics and Microbiology and the Center for NeuroGenetics, University of Florida, College of Medicine, Gainesville, FL 32610, USA 3 Department of Neurology, Chang Gung Memorial Hospital, Keelung, Taiwan

* Corresponding authors


Introduction

mCarts is a hidden Markov model (HMM) based methods to predict clusters RNA motif sites.

Many RBPs recognize very short and degenerate sequences, with targeting specificity achieved by mechanisms such as synergistic binding to multiple clustered sites and modulation of site accessibility through different RNA-secondary structures. mCarts integrates the number and spacing of individual motif sites, their accessibility and conservation, which substantially improves signal to noise ratio. This algorithm learns and quantifies rules of these features, taking advantage of a large number of in vivo RBP binding sites obtained from high throughput sequencing of RNAs isolated by cross-linking and immunoprecipitation (HITS-CLIP). We applied this algorithm to study two representative RBPs, Nova and Mbnl. Despite the very low information content in individual motif elements, our algorithm made very specific predictions for successful experimental validation.

Download

Source code:

  • czplib (perl): a perl library with various functions for genomic/bioinformatic analysis
  • mCarts (perl): the core algorithm
  • PatternMatch (c/c++): a handy tool to search individual motif sites based on consensus. It supports degeneracy and mismatches
  • RegExpMatch (c/c++): a handy tool to search individual motif sites based on regular expression

Library data:

  • mm9 (15 Gb compressed /109 Gb uncompressed)
  • hg18 (15 Gb compressed /212 Gb uncompressed)

Installation

Prerequisite

This software is implemented with perl and c/c++. It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.

Steps to install the software

  • Download the perl library files czplib, if not already.

Decompress it and move it to a place you like

tar zxvf czplib.1.0.0.tgz
mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib
  • Download mCart codes, if not already.

Decompress it and move it to a place you like

tar zxvf mCarts.1.0.0.tgz
cd mCarts
chmod 755 *.pl
mv mCarts /usr/local/mCarts

Add the dir to your $PATH environment variable.

  • Download and compile PatternMatch and RegExpMatch
tar zxvf PatternMatch.1.0.0.tgz
cd PatternMatch
make
chmod 755 PatternMatch
mv PatternMatch /usr/local/bin

tar zxvf RegExpMatch.1.0.0.tgz
cd RegExpMatch
make
chmod 755 PatternMatch
mv RegExpMatch /usr/local/bin

Make sure /usr/local/bin is already in your $PATH

  • Download and decompress library files
tar zxvf mCart_lib_data_mm9.tgz
mv mCart_lib_data_mm9 /home/czhang/data/mCart_lib_data_mm9

Get started

mCarts requires genomic sequences of genic regions, multiple alignments, exon/intron/UTR annotations, RNA accessibility information, etc, which are all provided in the library files (that's why it is huge!!).

In addition, mCarts takes two sets of genomic regions, provided in two BED files, to obtain motif sites in the positive and negative training datasets of the HMM. The positive training regions are typically (several thousand) regions of robust CLIP tag clusters. The negative training regions are typically genic regions without any CLIP tags. Only motif sites in genic regions (as defined by library files) are actually used.


A real example to predict Nova binding YCAY clusters.

To get training data, click here (XX Mb).

There are two files in the compressed package: mm9.Nova.train.pos.bed specifies 6,231 non-repetitive, genic Nova CLIP tag clusters with peak height (PH)≥15, and located in exons or 1 kb flanking intronic sequences on each side (exon+ext1k sequences). mm9.Nova.train.neg.bed specifies 110,998 exon+ext1k sequences, in which no CLIP tags were present.


It is recommended to divide the process to two steps:

  • training
mCarts -v -ref mm9 -w YCAY -f ./Nova.train.pos.bed -b ./Nova.train.neg.bed -lib /home/zhangc/data/mm9_mammal_input_data --train-only ./mm9_Nova_out

This command specifies the verbose mode (-v), reference genome (-ref mm9), the consensus motif to search (-w YCAY), foreground or positive training regions (-f ./Nova.train.pos.bed), background or negative training regions (-b ./Nova.train.neg.bed), directory with library files (-lib /home/zhangc/data/mm9_mammal_input_data), model training only (--train-only), and the output dir (./mm9_Nova_out)

see a complete list of options.

This command will do the following things:

  1. Search for individual motif sites in the genic regions (with 10 kb extension on each side) in the reference genome (mm9) and many additional genomes (e.g., 19 other mammalian genomes aligned to mm9) and evaluate their conservation using branch length scores (BLS).
  2. Retrieve the RNA accessibility information as measured by probability of unpairedness (PU) or single strandedness. These scores for all tetramers in genic regions were pre-calculated, because calculating PU scores is quite slow.
  3. intersect motif sites with positive and negative training regions to get training motif sites for the HMM
  4. estimate parameters of the HMM.

The following out put files are of particular interest:

model.txt:


  • prediction
mCarts -v --exist-model ./mm9_Nova_out