Difference between revisions of "Test"

From Zhang Laboratory

Jump to: navigation, search
(Versions)
Line 1: Line 1:
=Introduction=
+
We are working on RNA at the interface of Systems Biology, Computer Science and Molecular Neuroscience, currently sponsored by NIH, Simons Foundation, and Columbia University Medical Center startup funds.
  
Crosslinking induced mutation site or CIMS analysis is a computational method for HITS-CLIP data analysis to determine the exact protein-RNA crosslink sites and thereby map protein-RNA interactions at single-nucleotide resolution.  This method is based on the observation that UV cross linked amino-acid-RNA adducts introduce reverse transcription errors in cDNAs at certain frequencies, which are captured by sequencing and comparison of CLIP tags with the reference genome.  More details can be found in the following references:
+
=Elucidating neuronal RNA-regulatory networks=
<pre>
+
Zhang, C. †, Darnell, R.B. † 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data.  Nat. Biotech. 29:607-614.
+
  
Moore, J.*, Zhang, C.*, Grantman E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9(2):263-93doi:10.1038/nprot.2014.012.
+
It is increasingly recognized that post-transcriptional regulation at the RNA level plays critical roles for orchestrated gene expression in mammalian systems. Such regulation is conferred through interaction of at least several hundred RNA-binding proteins (RBPs) with their target transcripts, or RNA-regulatory networks.  The challenge to infer RNA-regulatory networks lies in the fact that most RBPs recognize very short and generate sequence motifs with limited information content. We takes advantage of recent advances in biochemical and high-throughput assays, such as HITS-CLIP (or CLIP-Seq) and RNA-Seq that profile transcriptomes and protein-RNA interactomesWe apply these assays, mostly using mouse brain as a model system, and intersect them with statistical and machine learning approaches to develop methods that predict specific protein-RNA interactions. These predictions complement and enhance the information we can obtain from experimental data. We also develop integrative modeling approaches to combine multiple types of data to infer direct and functional targets of specific RBPs.
</pre>
+
  
This brief document provides only the most critical information about how to run the program, which complements a more detailed, step-by-step guide described in the second reference above.
+
*Zhang, C.†, Frias, M.A., Mele, A., Ruggiu, M., Eom, T., Marney, C.B., Wang, H., Licatalosi, D.D., Fak, J.J., Darnell, R.B.† 2010. Integrative modeling defines the Nova splicing-regulatory network and its combinatorial controls. ''Science'', 329: 439-443.
 +
*Zhang, C.*, Zhang, Z.*, Castle, J., Sun, S., Johnson, J., Krainer, A.R. and Zhang, M.Q. 2008. Defining the regulatory network of the tissue-specific splicing factors Fox-1 and Fox-2. ''Genes Dev'', 22:2550-2563.
  
For cross link induced deletion analysis (CITS) described below, please refer to:
 
  
<pre>
 
Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. †, Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep. 10.1016/j.celrep.2014.02.005.
 
</pre>
 
  
=Versions=
+
=RNA-regulatory networks in evo-devo processes and in neuronal disorders=
*v1.0.3 ( 5-05-2014 ), current
+
**Included scripts and documentation for crosslinking induced truncation site (CITS) analysis
+
*v1.0.2 ( 8-15-2013 )
+
*v1.0.1 ( 5-22-2013 ), current
+
**Minor internal extension
+
**Included joinWrapper.py which was missing in the previous version
+
*v1.0.0 ( 12-14-2012 )
+
**The initial public release
+
  
=Download=
+
One of the ultimate goals to infer RNA-regulatory networks is to understand their forms and functions in normal physiological and pathological contexts.  In real systems, the networks are almost always under combinatorial and dynamic regulation, which tremendously increase the complexity.  We are working toward a better understanding of how such dynamic regulation drives the developing mammalian brain, differentiation of neurons from stem cells and (at a larger scale) mammalian evolution.  We are also interested in understanding how such networks are perturbed in neurodegenerative diseases (such as ALS and SMA) and neurodevelopment disorders (such as autism).
  
'''Source code:'''  
+
*Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. Zhang,C. † 2014.  HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. ''Cell Rep'' In press.
  
*czplib (perl): a perl library with various functions for genomic/bioinformatic analysis. ([http://sourceforge.net/p/czplib/ download from SourceForge.net])
+
=High-throughput transcriptomic data analysis=
*CIMS (perl): the core algorithm. ([http://sourceforge.net/p/ngs-cims/ download from SourceForge.net])
+
  
=Installation=
+
Our work heavily relies on high-throughput technologies which produce enormous amount of data, and algorithms to transform these data into useful information.  We are interested in developing better algorithms to process transcroptomic data, such as mapping RNA-Seq reads, discovering and quantifying RNA processing in specific conditions, and analyzing CLIP data to map protein-RNA interactions at a single nucleotide resolution.
  
==Prerequisites==
+
*Moore, M.*, Zhang, C.*, Gantman, E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols.  9:263-293. ([[CIMS_Documentation|Software]])
  
This software is implemented with perl . It also relies on several standard linux/unix tools such as grep, cat, sort, etc. We have tested the software on RedHat Linux, although it is expected to work on most unix-like systems, including Mac OS X.
+
*Wu,J., Anczukow,O., Krainer,A.R., Zhang,M.Q. †, Zhang,C. †, 2013. OLego: Fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. ''Nucleic Acids Res.'' , In press. ([[OLego|Software]])
  
==Steps to install the software==
+
*Zhang, C., Darnell, R.B.† 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. ''Nat Biotech'', 29:607-614.
 
+
* Download the perl library files czplib, if not already.
+
 
+
Decompress it and move it to a place you like
+
 
+
<pre>
+
$tar zxvf czplib.v1.0.x.tgz
+
$mv czplib /usr/local/lib
+
</pre>
+
 
+
Add the library path to the environment variable, so perl can find it. 
+
<pre>
+
PERL5LIB=/usr/local/lib/czplib
+
</pre>
+
 
+
* Download CIMS codes, if not already.
+
Decompress it and move it to a place you like
+
 
+
<pre>
+
$tar zxvf CIMS.v1.0.x.tgz
+
$cd CIMS
+
$chmod 755 *.pl
+
$mv CIMS /usr/local/CIMS
+
</pre>
+
 
+
Add the dir to your $PATH environment variable.
+
 
+
=CIMS analysis=
+
==Input files==
+
 
+
The key script one needs to run is CIMS.pl, which will take two BED files as input: a list of unique CLIP tags (properly mapped to the reference genome), and the coordinates of mutations (deletions, insertions, or substitutions) in the reference genome and relative the CLIP tags.  It is critical to make sure:
+
 
+
* analyze one type of mutations at a time.
+
* the 4th column of the mutation BED file should match the name of the CLIP tag in the first BED file.
+
* the coordinates of mutations relative to the CLIP tag (from the 5' end of the Watson strand, 0-based) is correctly specified in the 5' column of the second BED file. 
+
* only mutations in unique CLIP tags should be included.
+
 
+
 
+
Now you can run something like
+
 
+
perl /usr/local/CIMS/CIMS.pl -v -n 5 -p -FDR 0.001 -c ./cache_del  test.uniq.bed test.uniq.del.bed test.uniq.del.CIMS.txt
+
 
+
 
+
The output is a list of CIMS at FDR<0.001, one per line.
+
 
+
The first six columns of this file follow the definition of a BED file, including coordinates and strand of each CIMS.  Columns 7-10 are k, m, FDR, and number of sites with m or more tags with mutations given k tags at that position in total (the denominator to calculate FDR, which gives an idea about the precision of the FDR value).
+
 
+
This file can be reordered with the following command:
+
 
+
<pre>
+
sort test.uniq.del.CIMS.txt -k 9,9n -k 8,8nr -k 7,7n > test.uniq.del.CIMS.sort.txt
+
</pre>
+
 
+
==Usage==
+
<pre>
+
CIMS.pl [options] <tag.bed> <mutation.bed> <out.txt>
+
</pre>
+
 
+
Arguments:
+
{|class="wikitable" width="100%" style="border:1px solid"
+
|-
+
!scope="column" width=150|'''Argument'''
+
|'''Description'''
+
|-
+
|<tag.bed>
+
|BED file of unique CLIP tags
+
|-
+
|<mutation.bed>
+
|BED file of mutations in unique CLIP tags. Make sure you paid attention to the notes above
+
|-
+
|<out.txt>
+
| output file with the list of CIMS
+
|}
+
 
+
 
+
Options:
+
{|class="wikitable" width="100%" style="border:1px solid"
+
|-
+
!scope="column" width=150|'''Option'''
+
|'''Description'''
+
|-
+
| -big
+
| input files are big (e.g. over 6 million lines)
+
|-
+
| -n [int]
+
|number of iterations for permutation (default: 5)
+
|-
+
| -p
+
| track mutation position relative to read start
+
|-
+
| --no-sparse-correct
+
| no sparcity correction *
+
|-
+
| -FDR [float]
+
| threshold of FDR (default: 1)
+
|-
+
| -mkr [float]
+
| threshold of m-over-k-ratio (default: 0)
+
|}
+
 
+
<nowiki>*</nowiki>This option should not be used in general, but is included to reproduce our earlier analysis.  We introduced this feature to eliminate an additional filtering step based on mutation freqeuncy (i.e., the "m" value).
+
 
+
=CITS analysis=
+
Some variations of CLIP protocols, including BrdU-CLIP and iCLIP, allow capture of tags that are truncated at the cross link sites (and those read through as well).  This section describes our method to detect robust cross link induced truncation sites (CITS).
+
 
+
CITS analysis requires unique CLIP tags as a BED file, which was used to get the coordinates of potential truncation sites, defined as the position immediately upstream of the first nucleotide of each CLIP tag.
+
 
+
<pre>
+
perl ~/scripts/bedExt.pl -n up -l "-1" -r "-1" -v tag.uniq.bed tag.uniq.trunc.bed
+
</pre>
+
 
+
 
+
Since CLIP tags with crosslinking induced mutations (i.e. deletions) most likely represent tags that read through the cross link site, it is recommended that these tags are removed.
+
 
+
<pre>
+
cut -f 4 tag.uniq.del.bed | sort | uniq > tag.uniq.del.id
+
python ~/src/CIMS/joinWrapper.py tag.uniq.trunc.bed tag.uniq.del.id 4 1 V tag.uniq.trunc.clean.bed
+
</pre>
+
 
+
Here tag.uniq.del.bed is the list of deletions in unique tags.  You should have obtained this file by CIMS analysis.
+
 
+
We now cluster all unique CLIP tags, if not already. 
+
<pre>
+
perl ~/scripts/tag2cluster.pl -s -maxgap "-1" -of bed -v tag.uniq.bed tag.uniq.cluster.0.bed
+
awk '{if($5>2) {print $0}}' tag.uniq.cluster.0.bed > tag.uniq.cluster.bed
+
</pre>
+
 
+
In the next step below, these clusters will be used to shuffle truncation events to evaluate the statistical significance of reproducibility.  This is more stringent than gene-based permutation.
+
 
+
<pre>
+
perl ~/scripts/tag2peak.pl -ss -v -gap 25 -p 0.001 tag.uniq.cluster.bed tag.uniq.trunc.clean.bed tag.uniq.CITS.s30.bed
+
</pre>
+
 
+
This script will define CITS with p < 0.001 using scan statistics (instead of actually doing random permutation).  Sites within 25 nt with each other will be clustered together.  One might also do Bonforroni multiple testing correction.  If this is the case, use an additional option --multi-test.
+
 
+
The output file tag.uniq.CITS.s30.bed is the list of robust CITS.  One can perform further analysis, such as motif enrichment analysis or de novo motif discovery to evaluate the signal to noise ratio and fine tune the parameters.
+

Revision as of 11:14, 5 November 2014

We are working on RNA at the interface of Systems Biology, Computer Science and Molecular Neuroscience, currently sponsored by NIH, Simons Foundation, and Columbia University Medical Center startup funds.

Elucidating neuronal RNA-regulatory networks

It is increasingly recognized that post-transcriptional regulation at the RNA level plays critical roles for orchestrated gene expression in mammalian systems. Such regulation is conferred through interaction of at least several hundred RNA-binding proteins (RBPs) with their target transcripts, or RNA-regulatory networks. The challenge to infer RNA-regulatory networks lies in the fact that most RBPs recognize very short and generate sequence motifs with limited information content. We takes advantage of recent advances in biochemical and high-throughput assays, such as HITS-CLIP (or CLIP-Seq) and RNA-Seq that profile transcriptomes and protein-RNA interactomes. We apply these assays, mostly using mouse brain as a model system, and intersect them with statistical and machine learning approaches to develop methods that predict specific protein-RNA interactions. These predictions complement and enhance the information we can obtain from experimental data. We also develop integrative modeling approaches to combine multiple types of data to infer direct and functional targets of specific RBPs.

  • Zhang, C.†, Frias, M.A., Mele, A., Ruggiu, M., Eom, T., Marney, C.B., Wang, H., Licatalosi, D.D., Fak, J.J., Darnell, R.B.† 2010. Integrative modeling defines the Nova splicing-regulatory network and its combinatorial controls. Science, 329: 439-443.
  • Zhang, C.*, Zhang, Z.*, Castle, J., Sun, S., Johnson, J., Krainer, A.R. and Zhang, M.Q. 2008. Defining the regulatory network of the tissue-specific splicing factors Fox-1 and Fox-2. Genes Dev, 22:2550-2563.


RNA-regulatory networks in evo-devo processes and in neuronal disorders

One of the ultimate goals to infer RNA-regulatory networks is to understand their forms and functions in normal physiological and pathological contexts. In real systems, the networks are almost always under combinatorial and dynamic regulation, which tremendously increase the complexity. We are working toward a better understanding of how such dynamic regulation drives the developing mammalian brain, differentiation of neurons from stem cells and (at a larger scale) mammalian evolution. We are also interested in understanding how such networks are perturbed in neurodegenerative diseases (such as ALS and SMA) and neurodevelopment disorders (such as autism).

  • Weyn-Vanhentenryck,S.,M.*, Mele,A.*, Yan,Q.*, Sun,S., Farny,N., Zhang,Z., Xue,C., Herre,M., Silver,P.A., Zhang,M.Q., Krainer,A.R., Darnell,R.B. Zhang,C. † 2014. HITS-CLIP and integrative modeling define the Rbfox splicing-regulatory network linked to brain development and autism. Cell Rep In press.

High-throughput transcriptomic data analysis

Our work heavily relies on high-throughput technologies which produce enormous amount of data, and algorithms to transform these data into useful information. We are interested in developing better algorithms to process transcroptomic data, such as mapping RNA-Seq reads, discovering and quantifying RNA processing in specific conditions, and analyzing CLIP data to map protein-RNA interactions at a single nucleotide resolution.

  • Moore, M.*, Zhang, C.*, Gantman, E.C., Mele, A., Darnell, J.C., Darnell, R.B. 2014. Mapping Argonaute and conventional RNA-binding protein interactions with RNA at single-nucleotide resolution using HITS-CLIP and CIMS analysis. Nat Protocols. 9:263-293. (Software)
  • Wu,J., Anczukow,O., Krainer,A.R., Zhang,M.Q. †, Zhang,C. †, 2013. OLego: Fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. , In press. (Software)
  • Zhang, C.†, Darnell, R.B.† 2011. Mapping in vivo protein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotech, 29:607-614.