Difference between revisions of "Test"

From Zhang Laboratory

Jump to: navigation, search
Line 1: Line 1:
=Introduction=
+
{{#meta: keywords | computational biology; systems biology; RNA splicing and regulatory networks; gene expression }}{{#meta: description | The Chaolin Zhang Laboratory Home Page at Columbia University}}{{#meta: Content-Type | text/html; charset=utf-8 }}
  
mCarts is a hidden Markov model (HMM)-based methods to predict clusters RNA motif sites.
+
<html>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Core.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Debug.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.EventManager.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Point.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Utils.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Easing.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Navigator.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.CaptionSlider.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.Slider.js"></script>
 +
    <script type="text/javascript" src="/data/Jssor/js/Jssor.ThumbnailNavigator.js"></script>
  
Many RBPs recognize very short and degenerate sequences, with targeting specificity achieved by mechanisms such as synergistic binding to multiple clustered sites and modulation of site accessibility through different RNA-secondary structures.  mCarts integrates the number and spacing of individual motif sites, their accessibility and conservation, which substantially improves signal to noise ratio. This algorithm learns and quantifies rules of these features, taking advantage of a large number of in vivo RBP binding sites obtained from high throughput sequencing of RNAs isolated by cross-linking and immunoprecipitation (HITS-CLIP). We applied this algorithm to study two representative RBPs, Nova and Mbnl. Despite the very low information content in individual motif elements, our algorithm made very specific predictions for successful experimental validation.
+
    <script>
 +
        var _SlideshowTransitions = [
 +
        //Clip & Chessout
 +
        {$Duration: 2000, $Cols: 8, $Rows: 4, $Clip: 15, $During: { $Top: [0, 0.5], $Clip: [0.5, 0.5] }, $FlyDirection: 8, $SlideOut: true, $Formation: $JssorSlideshowFormations$.$FormationStraight, $ChessMode: { $Column: 12 }, $ScaleClip: 0.5 },
  
'''Citation''':
+
        //DodgeinTeam
 +
        {$Duration: 1500, $Delay: 20, $Cols: 8, $Rows: 4, $Clip: 15, $FlyDirection: 9, $Formation: $JssorSlideshowFormations$.$FormationStraightStairs, $Easing: { $Left: $JssorEasing$.$EaseInJump, $Top: $JssorEasing$.$EaseInJump }, $ScaleHorizontal: 0.3, $ScaleVertical: 0.3, $Round: { $Left: 0.8, $Top: 0.8} },
  
Zhang, C. †, Lee, K.-Y., Swanson, M.S., Darnell, R.B. † 2013. Prediction of clustered RNA-binding protein motif sites in the mammalian genome. <i>Nucleic Acids Res</i>, in press.
+
        //DodgeoutTeam
 +
        {$Duration: 1500, $Delay: 20, $Cols: 8, $Rows: 4, $Clip: 15, $SlideOut: true, $FlyDirection: 9, $Formation: $JssorSlideshowFormations$.$FormationStraightStairs, $Easing: { $Left: $JssorEasing$.$EaseInJump, $Top: $JssorEasing$.$EaseInJump }, $ScaleHorizontal: 0.3, $ScaleVertical: 0.3, $Round: { $Left: 0.8, $Top: 0.8} },
  
=Download=
+
        //Flutterin
 +
        {$Duration: 1800, $Delay: 30, $Cols: 10, $Rows: 5, $Clip: 15, $FlyDirection: 1, $Formation: $JssorSlideshowFormations$.$FormationStraightStairs, $Assembly: 2050, $Easing: $JssorEasing$.$EaseInOutQuad, $ScaleHorizontal: 1, $Outside: true, $Round: { $Top: 0.8} },
  
'''Source code:'''
+
        //CollapseStairs
 +
        {$Duration: 1200, $Delay: 30, $Cols: 8, $Rows: 4, $Clip: 15, $SlideOut: true, $Formation: $JssorSlideshowFormations$.$FormationStraightStairs, $Assembly: 2049, $Easing: $JssorEasing$.$EaseOutQuad },
  
*czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. ([http://sourceforge.net/p/czplib/ download from SourceForge.net])
+
        //CollapseRandom
*mCarts (perl): the core algorithm. ([http://sourceforge.net/p/mcarts/ download from SourceForge.net])
+
        {$Duration: 1000, $Delay: 30, $Cols: 8, $Rows: 4, $Clip: 15, $SlideOut: true, $Easing: $JssorEasing$.$EaseOutQuad }
*PatternMatch (c/c++): a handy tool to search individual motif sites based on consensus. It supports degeneracy and mismatches. ([http://sourceforge.net/projects/bio-patmatch/ download from SourceForge.net])
+
        ];
*RegExpMatch (c/c++): a handy tool to search individual motif sites based on regular expression ([http://sourceforge.net/projects/regexpmatch/ download from SourceForge.net])
+
    </script>
 +
    <script>
 +
        jssor_slider1_starter = function (containerId) {
 +
            var jssor_slider1 = new $JssorSlider$(containerId, {
 +
                $ShowLoading: true,                                //[Optional] Show loading screen or not default value is false
 +
                $AutoPlay: true,                                    //[Optional] Whether to auto play, default value is false
  
'''Library data: '''
+
                $SlideshowOptions: {                                //Options which specifies enable slideshow or not
 +
                    $Class: $JssorSlideshowRunner$,                //[Required] Class to create instance of slideshow
 +
                    $Transitions: _SlideshowTransitions,            //[Required] Transitions to play slide, see jssor slideshow transition builder
 +
                    $TransitionsOrder: 1,                          //[Required] The way to choose transition to play slide, 1 Sequence, 0 Random
 +
                    $ShowLink: 2                                    //[Required] 0 After Slideshow, 2 Always
 +
                }
 +
            });
 +
        }
 +
    </script>
 +
<center>
 +
    <div id="slider1_container" class="slider1" style="position: relative; width: 400px;
 +
        height: 200px;">
 +
       
 +
        <!-- Loading Screen -->
 +
        <div u="loading" style="position: absolute; top: 0px; left: 0px;">
 +
            <div style="filter: alpha(opacity=70); opacity:0.7; position: absolute; display: block;
 +
                background-color: #000000; top: 0px; left: 0px;width: 100%;height:100%;">
 +
            </div>
 +
            <div style="position: absolute; display: block; background: url(/data/Jssor/img/loading.gif) no-repeat center center;
 +
                top: 0px; left: 0px;width: 100%;height:100%;">
 +
            </div>
 +
        </div>
  
*mm9 (9.8 Gb compressed /61 Gb uncompressed): [http://zhanglab.c2b2.columbia.edu/data/mCarts/data/mCarts_lib_data_mm9.tgz download]
+
        <!-- Slides Container -->
*hg18 (24 Gb compressed /119 Gb uncompressed): [http://zhanglab.c2b2.columbia.edu/data/mCarts/data/mCarts_lib_data_hg18.tgz download]
+
 +
        <div u="slides" style="position: absolute; left: 0px; top: 0px; width: 400px; height: 200px; overflow: hidden;">
  
 +
      <div>
 +
                <a u=image href="https://sfari.org" rel="nofollow"><img u="image" src="/data/images/slideshow/SFARI.png" width="400" height="200" /></a>
 +
            </div>
  
'''Nova training data (for example):'''
+
            <div>
1.8 Mb compressed: [http://zhanglab.c2b2.columbia.edu/data/mCarts/data/Nova_train_data.tgz download]
+
                <a u=image href="http://zhanglab.c2b2.columbia.edu/index.php/MCarts_Documentation" rel="nofollow"><img u="image" src="/data/images/slideshow/mCarts.png" width="400" height="200" /></a>
 +
            </div>
 +
            <div>
 +
                <a u=image href="http://zhanglab.c2b2.columbia.edu/index.php/OLego" rel="nofollow"><img u="image" src="/data/images/slideshow/olego.png" width="400" height="200" /></a>
 +
            </div>
  
=Installation=
+
<!--
 +
            <div>
 +
                <a u=image href="http://zhanglab.c2b2.columbia.edu/index.php/Openings" rel="nofollow"><img u="image" src="/data/images/slideshow/hiring.png" width="400" height="200" /></a>
 +
            </div>
 +
//-->
 +
        </div>
 +
        <a style="display:none" href="http://slideshow.jssor.com">Javascript Slideshow</a>
 +
        <!-- Trigger -->
 +
        <script>
 +
            jssor_slider1_starter('slider1_container');
 +
        </script>
 +
    </div>
  
==Prerequisites==
+
</center>
 +
</html>
  
This software is implemented with perl and c/c++.  It also relies on several standard linux/unix tools such as grep, cat, sort, etc.  We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.
 
  
==Steps to install the software==
+
'''Introduction of the Zhang Laboratory'''
  
* Download the perl library files czplib, if not already.
+
We are part of [http://cpmcnet.columbia.edu/dept/gsas/biochem/ Department of Biochemistry and Molecular Biophysics], [http://sbi.c2b2.columbia.edu/ Columbia Initiative in Systems Biology], [http://www.columbiamnc.org/ Motor Neuron Center], [http://stemcell.columbia.edu/ Columbia Stem Cell Initiative], and [http://hiccc.columbia.edu Herbert Irving Comprehensive Cancer Center] at [http://http://www.cumc.columbia.edu Columbia University Medical Center].  
  
Decompress it and move it to a place you like
+
We are fascinated by the complexity of the mammalian brain and the underlying molecular mechanisms.  While mammals have a similar number of genes compared to phenotypically simpler organisms (such as worm), one apparent feature of mammalian genes is their more complicated gene structures, providing opportunity of sophisticated regulation at the RNA level. 
  
<pre>
+
The vision of my lab is to infer RNA regulatory networks in the nervous system, as a way to understand the mammlian complexity manifested in evolutionary-developmental (evo-devo) processes and in several neuronal disorders. Specifically we are interested in obtaining fundamental understanding how neuronal cell types are specified during the normal development process, how this process can be reversed in certain pathologic contexts (such as brain tumors), and why they die abnormally in neurodegenerative diseases. My lab will have a mixed dry and wet lab setup (a.k.a. "humid" lab). We use different model systems and a combination of high-throughput data driven and hypothesis driven approaches.
$tar zxvf czplib.v1.0.x.tgz
+
$mv czplib /usr/local/lib
+
</pre>
+
  
Add the library path to the environment variable, so perl can find it. 
 
<pre>
 
PERL5LIB=/usr/local/lib/czplib
 
</pre>
 
  
* Download mCart codes, if not already.
 
Decompress it and move it to a place you like
 
  
<pre>
 
$tar zxvf mCarts.v1.0.x.tgz
 
$cd mCarts
 
$chmod 755 *.pl
 
$mv mCarts /usr/local/mCarts
 
</pre>
 
  
Add the dir to your $PATH environment variable.
 
  
* Download and compile PatternMatch and RegExpMatch
 
<pre>
 
$tar zxvf PatternMatch.v1.0.x.tgz
 
$cd PatternMatch
 
$make
 
$chmod 755 PatternMatch
 
$mv PatternMatch /usr/local/bin
 
  
$tar zxvf RegExpMatch.v1.0.x.tgz
 
$cd RegExpMatch
 
$make
 
$chmod 755 PatternMatch
 
$mv RegExpMatch /usr/local/bin
 
</pre>
 
  
Make sure /usr/local/bin is already in your $PATH
 
  
* Download and decompress library files
 
  
<pre>
+
<html>
$tar zxvf mCart_lib_data_mm9.tgz
+
<div align="center">
$mv mCart_lib_data_mm9 /home/czhang/data/mCart_lib_data_mm9
+
</pre>
+
  
=Get started=
+
<a href="http://sbi.c2b2.columbia.edu/"><img src="/data/images/sbi_logo.png" width="223" height="56" /></a>
 +
<a href="http://www.columbiamnc.org/"><img src="/data/images/mnc_logo.gif" width="122" height="35" /></a>
 +
<a href="http://www.c2b2.columbia.edu"><img src="/data/images/C2B2_logo.png" width="215" height="60" /></a> <p>
  
mCarts requires genomic sequences of genic regions, multiple alignments, exon/intron/UTR annotations, RNA accessibility information, etc, which are all provided in the library files (that's why it is huge!!). The list of library files is describe below (using mouse as an example):
+
</div>
 
+
</html>
#mm9.exon.uniq.bed: a collection of unique exons
+
#mm9.genic.bed: a collection of genic regions
+
#mm9.genic.ext10k.bed: a collection of genic regions, with 10 kb extension on both sides
+
#mm9_genic.ext10k_maf_split: multiple sequence alignments of extended genic regions
+
#mm9_genic.ext10k_pu4: pre-calculated single strandedness of all tetramers in extended genic regions
+
#refGene_knownGene.3utr.bed: 3' utr regions based on refSeq and UCSC known genes
+
#refGene_knownGene.5utr.bed: 5' utr regions based on refSeq and UCSC known genes
+
#species: list of 20 mammalian species (change the symbolic link to the 30 vertebrate species if necessary)
+
#tree.nh: phylogenetic tree of the 20 mammalian species (change the symbolic link to the 30 vertebrate species if necessary)
+
 
+
In addition, mCarts takes two sets of genomic regions, provided in two BED files, to obtain motif sites in the positive and negative training datasets of the HMM.  The positive training regions are typically (several thousand) regions of robust CLIP tag clusters.  The negative training regions are typically genic regions without any CLIP tags. Only motif sites in genic regions (as defined by library files) are actually used.
+
 
+
 
+
'''A real example to predict Nova binding YCAY clusters.'''
+
 
+
To get training data, [http://zhanglab.c2b2.columbia.edu/data/mCarts/data/Nova_train_data.tgz download]. 
+
 
+
There are two files in the compressed package: <tt>mm9.Nova.train.pos.bed</tt> specifies 6,231 non-repetitive, genic Nova CLIP tag clusters with peak height (PH)≥15, and located in exons or 1 kb flanking intronic sequences on each side (exon+ext1k sequences). <tt>mm9.Nova.train.neg.bed</tt> specifies 110,998 exon+ext1k sequences, in which no CLIP tags were present.
+
 
+
 
+
It is recommended to divide the process to two steps, although they can also be combined into one single step:
+
 
+
==Model training==
+
 
+
<pre>
+
$mCarts -v -ref mm9 -w YCAY -f ./Nova.train.pos.bed -b ./Nova.train.neg.bed -lib /home/zhangc/data/mm9_mammal_input_data --train-only ./mm9_Nova_out
+
</pre>
+
 
+
This command specifies the verbose mode (-v), reference genome (-ref mm9), the consensus motif to search (-w YCAY), foreground or positive training regions (-f ./Nova.train.pos.bed), background or negative training regions (-b ./Nova.train.neg.bed), directory with library files (-lib /home/zhangc/data/mm9_mammal_input_data), model training only (--train-only), and the output dir (./mm9_Nova_out)
+
 
+
see a complete list of options.
+
 
+
This command will do the following things:
+
 
+
# Search for individual motif sites in the genic regions (with 10 kb extension on each side) in the reference genome (mm9) and many additional genomes (e.g., 19 other mammalian genomes aligned to mm9) and evaluate their conservation using branch length scores (BLS).
+
# Retrieve the RNA accessibility information as measured by probability of unpairedness (PU) or single strandedness.  These scores for all tetramers in genic regions were pre-calculated, because calculating PU scores is quite slow.
+
# intersect motif sites with positive and negative training regions to get training motif sites for the HMM
+
# estimate parameters of the HMM.
+
 
+
The following out put files are of particular interest:
+
 
+
*'''model.txt:'''
+
 
+
This file saves parameters of the HMM, including emission probabilities and data to calculate transition probabilities.  The emission probability is represented by histograms (nonparametric), and can show how each feature contrast RBP bound motif sites vs. background motif sites.  This information is specified in the following lines:
+
 
+
<pre>
+
 
+
distance_positive 3.59E-20 3.59E-20 3.59E-20 0.078482739 ...
+
distance_negative 2.90E-22 2.90E-22 2.90E-22 0.044107727 ...
+
 
+
# 0 0.05 0.1 0.15 ...
+
conservation_positive_0 0.179478553 0.182674516 1.68E-19 0.023044575 ...
+
conservation_positive_1 0.02999663 0.047691271 1.69E-19 0.005729693 ...
+
conservation_positive_2 0.101576182 0.126094571 8.76E-19 0.014886165 ...
+
conservation_positive_3 0.12325902 0.163378809 4.75E-20 0.019869753 ...
+
conservation_negative_0 0.341826256 0.282508127 3.69E-22 0.033694368 ...
+
conservation_negative_1 0.11093915 0.126552235 2.04E-21 0.01781323 ...
+
conservation_negative_2 0.237788887 0.22985303 6.83E-21 0.022994864 ...
+
conservation_negative_3 0.291355883 0.259434326 4.82E-21 0.029970126 ...
+
accessibility_positive 0.014387222 0.057225909 0.073668448 0.082271419 ...
+
accessibility_negative 0.059879224 0.132628229 0.118651224 0.106571071 ...
+
 
+
</pre>
+
 
+
The probability of distance are for spacing from 0,1,2, ...
+
 
+
Conservation scores and accessibility scores are in the range between 0 and 1. For conservation scores, the suffix represents different regions (0-intron, 1-CDS, 2-5'UTR, 3-3'UTR). 
+
 
+
This information can be used to produce a figure as shown below:
+
 
+
 
+
[[File:MCarts emission.png|board|HMM mission probability trained on Nova data]]
+
 
+
 
+
Note that the distance distribution for the positive training sites is censored at 30 nt (--max-dist 30, default).  For Nova, all features contribute to the discrimination of Nova bound YCAYs and background sites.
+
 
+
*'''BLS/mm9.genic.ext10k.motif.bls.chrom.bed'''
+
This is the bed file of individual motif sites, which could be quite big if the motif is short and degenerate (like Nova).  You can load this file to the genome browser for visualization.  The 5th column if each row is the BLS conservation score of each site, which is in the range of 0 and 1, and should be re-scaled to 0-1000 for the best visualization.
+
 
+
<pre>
+
$awk '{print $1"\t"$2"\t"$3"\t"$4"\t"$5*1000"\t"$6}' BLS/mm9.genic.ext10k.motif.bls.chrom.bed > mm9.genic.ext10k.motif.bls.chrom.2.bed
+
</pre>
+
 
+
==Prediction==
+
 
+
After the model is generated, and verified (or even modified if you like by editing model.txt) to make sense, one can move forward for prediction using the command line below.
+
 
+
<pre>
+
$mCarts -v --exist-model ./mm9_Nova_out
+
</pre>
+
 
+
 
+
This will produce the results in the bed file cluster.bed.  The 5th column is the motif cluster score, which can be used to rank the clusters.  High scoring clusters are in general more reliable than low scoring clusters.
+
 
+
This file can be converted into bedGraph, which in combination with the individual motif sites, will give the best visualization.
+
 
+
=Additional options=
+
 
+
If you run mCarts without any parameters, it prints the usage information:
+
 
+
<pre>
+
$perl ~/src/czsrc/mCarts/mCarts
+
search clustered RNA motif sites
+
Usage: mCarts [options] <out dir>
+
Example1: mCarts -v -ref mm9 -f CLIP.pos.bed -b CLIP.neg.bed -lib mm9_mammal_input_data -w YCAY -m 3 --min-site 3 --max-dist 30 out_dir
+
Example2: mCarts -v --exist-model out_dir_from_prev_run
+
 
+
</pre>
+
 
+
Options:
+
 
+
{|class="wikitable" width="100%" style="border:1px solid"
+
|-
+
!scope="column" width=150|'''Option'''
+
|'''Description'''
+
|-
+
|  -ref      [string]
+
|reference species to search (mm9)
+
The data libraries for reference species of mm9 and hg18 are provided
+
|-
+
| -f        [string]
+
|a BED file specifying positive (foreground) training regions
+
|-
+
| -b        [string]
+
|a BED file specifying negative (background) training regions
+
|-
+
| -lib      [string]
+
|dir with data library files
+
|-
+
| -w        [string]
+
|the consensus motif to search
+
e.g., YCAY
+
|-
+
|  -m        [int]
+
|number of mismatches (0)
+
|-
+
| -r
+
|the motifs provided are regular expressions (will disable -n and -m)
+
The -w specifies a regular expression (e.g., TTTT+).  More details about syntax is available at http://www.boost.org/doc/libs/1_51_0/libs/regex/doc/html/index.html
+
|-
+
| --min-site [int]
+
|minimum sites in clusters (3)
+
|-
+
| --max-dist [int]
+
|max distance allowed in clusters (30)
+
The max spacing between neighboring sites in a cluster
+
|-
+
| --train-only
+
|train the model only, no prediction
+
|-
+
| --exist-model
+
|prediction based on existing model specified in out dir
+
|-
+
| --check-maf
+
|check maf files in the library dir
+
This option is reserved, and should not be used
+
|-
+
| -v
+
|verbose
+
|}
+
 
+
=Running jobs in parallel=
+
 
+
Support for running jobs in parallel using a queue system (tested for SGE) is already included.  The program will try to find if SGE is available.  If yes, jobs will be dispatched to the default queue, and the results will be collected and combined when all jobs are done.  Otherwise, the job will run locally.
+

Revision as of 23:48, 25 September 2013


Javascript Slideshow


Introduction of the Zhang Laboratory

We are part of Department of Biochemistry and Molecular Biophysics, Columbia Initiative in Systems Biology, Motor Neuron Center, Columbia Stem Cell Initiative, and Herbert Irving Comprehensive Cancer Center at Columbia University Medical Center.

We are fascinated by the complexity of the mammalian brain and the underlying molecular mechanisms. While mammals have a similar number of genes compared to phenotypically simpler organisms (such as worm), one apparent feature of mammalian genes is their more complicated gene structures, providing opportunity of sophisticated regulation at the RNA level.

The vision of my lab is to infer RNA regulatory networks in the nervous system, as a way to understand the mammlian complexity manifested in evolutionary-developmental (evo-devo) processes and in several neuronal disorders. Specifically we are interested in obtaining fundamental understanding how neuronal cell types are specified during the normal development process, how this process can be reversed in certain pathologic contexts (such as brain tumors), and why they die abnormally in neurodegenerative diseases. My lab will have a mixed dry and wet lab setup (a.k.a. "humid" lab). We use different model systems and a combination of high-throughput data driven and hypothesis driven approaches.