Difference between revisions of "MCross"

From Zhang Laboratory

Jump to: navigation, search
(Download)
(Run mCross based on top n mer and CIMS/CITS sequences)
 
(22 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
=Introduction=
 +
mCross is a computational tool to perform de novo motif discovery for RNA-binding proteins using CLIP data. mCross jointly models the sequence specificity and protein-RNA crosslinking position in the RBP binding motif \by leveraging the crosslink sites mapped at the single-nucleotide resolution by crosslinking induced mutation site (CIMS) and truncation site (CITS) analysis. 
 +
 +
More details about this work can be found in the following paper:
 +
<pre>
 +
Feng et al. (2019), Modeling the in vivo specificity of RNA-binding proteins by precisely registering protein-RNA crosslink sites. Mol Cell. 74:1189-1204.E6.
 +
</pre>
 +
 +
=Versions=
 +
*v1.0.0 ( 05-07-2021 )
 +
**The initial public release
 +
 +
=Software installation=
 +
==Prerequisites==
 +
This software is implemented with Perl and R. We have tested the software on RedHat Linux, although it is expected to work on most Unix-like systems, including Mac OS X.  The package requires the following packages to be installed:
 +
* R (version 3.0.0 and higher).
 +
* R packages: '''gplots''', '''motifStack''', '''ggplot2''', '''gridExtra''',  '''cowplot''',and '''getopt'''.
 +
* Perl
 +
 +
==Installation through anaconda==
 +
The working environment is set as previously in [https://zhanglab.c2b2.columbia.edu/index.php/CTK_Documentation#Download ctk]. Packages can be installed to other conda environments by preference.
 +
 +
Activate the working environment 'ctk' and install the packages by running the commands below:
 +
<pre>
 +
myenv='ctk'
 +
conda activate $myenv
 +
 +
conda install -y -c chaolinzhanglab mcross
 +
</pre>
 +
 +
If you would like to install a specific platform version e.g. 'noarch', then use one of the below commands:
 +
<pre>
 +
conda install -y -c chaolinzhanglab/noarch mcross
 +
 +
conda install -y -c bioconda mcross
 +
</pre>
 +
 +
==Manual installation==
 +
 +
Install software packages described in prerequisites:
 +
* Download and install the czplib Perl library (refer to [https://zhanglab.c2b2.columbia.edu/index.php/CTK_Documentation#Download CTK] documentation)
 +
* Download the mCross from GitHub to whatever directory you like (as an example, we use /czlab_src/github/)
 +
 +
<pre>
 +
cd ~/czlab_src/github/
 +
git clone https://github.com/chaolinzhanglab/mCross.git
 +
</pre>
 +
 +
Add the dir to your $PATH environment variable if you would like.
 +
 +
=Usage=
 +
 +
 +
You can run the following command to show descriptions of arguments, input and output format.
 +
<pre>
 +
mCross.pl [options] <seq_file> <out_file or out_file_stem>
 +
</pre>
 +
 +
Arguments:
 +
 +
{|class="wikitable" width="55%" style="border:1px solid"
 +
 +
!'''Argument'''!!'''Description'''
 +
|-
 +
| -l||sequence extension around crosslink site
 +
 +
|-
 +
|  --seed|| top_nmer_file
 +
 +
|-
 +
| --bg|| if top_nmer not provided, fg and bg file are used to get the list
 +
 +
|-
 +
| -p|| pad the seed motif on both sides
 +
 +
|-
 +
|  -m|| number of mismatches allowed in the core motif
 +
 +
|-
 +
|  -N|| max number of seed words to search
 +
 +
|-
 +
|  --cluster-seeds|| cluster seed word
 +
 +
|-
 +
|  --xl-model|| crosslink model (1=simple(default), 2=nucleotide-specific)
 +
 +
|-
 +
|  --score-method||  ([log])/sqrt
 +
 +
|-
 +
|  --prefix||  prefix of the motif name
 +
 +
|-
 +
|  --single-output-file||  write all motifs to a single file
 +
 +
|-
 +
| -c, cache dir||path to write temporary file
 +
 +
|-
 +
| -v, verbose||verbose mode
 +
 +
|}
 +
 +
 +
 +
mCross takes the sequences around CIMS/CITS sites as input and generate the binding motifs for each input. Please note that mCross can either accept the top n mer file or background sequence fasta file to get the top n mer list.  The top n mer file is generated by counting the occourrance of n mer in the sequence around peak region or CIMS/CITS region. The top n mer file can be generated combined with our CTK toolkit(http://zhanglab.c2b2.columbia.edu/index.php/CTK_Documentation).
 +
 +
 +
==Peak calling and CIMS/CITS analysis==
 +
 +
Please check http://zhanglab.c2b2.columbia.edu/index.php/ECLIP_data_analysis_using_CTK for details.
 +
 +
==Get enriched top n mer from sequences around peak/CIMS/CITS regions==
 +
1. Extract fasta sequences from the beds files representing the peak or CIMS/CITS region. Typically, we extend 50bp both upstream and downstream around peak center or 10bp around CIMS/CITS sites.
 +
 +
2. Generate background sequences. In the example below, we take -550 to -450 upstream of the peak center and 450 to 550 downstream of the peak center as background sequences.
 +
 +
3. Calculate the enrichment score of n mer. We set n equal to 7 as an example in the following command.
 +
 +
<pre>
 +
word_enrich.pl -w 7 -test binom -v Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.normsk.fa  Rbfox_R2.tag.uniq.peak.sig.PH10.center.bg.100.normsk.fa Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.txt
 +
</pre>
 +
 +
4. Generate top n mer file as input of mCross.
 +
<pre>
 +
gen_word_enrich_matrix.pl  peak.conf  Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.zcore.mat.txt
 +
</pre>
 +
This script take a configuration file which includes two columns separated by tab as input:
 +
<pre>
 +
Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.txt \tab Rbfox_peak
 +
</pre>
 +
 +
5. Extract the top n mer list.
 +
<pre>
 +
Rscript topword.R Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.zcore.mat.txt Rbfox_peak_top7mer
 +
</pre>
 +
 +
==Run mCross based on top n mer and CIMS/CITS sequences==
 +
With the top n mer files ready, mCross can either take CIMS or CITS sequence as input. Here we use CITS sequence as an example:
 +
<pre>
 +
mCross.pl -l 10 -p 2 -N 10 -m 1 --cluster-seeds --seed Rbfox_peak_top7mer/top.Rbfox_peak.txt --prefix Rbfox --score-method sqrt Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa Rbfox_peakvsCITS
 +
</pre>
 +
mCross will output a list of discovered motifs represented by TRANSFAC format in the output folder.
 +
 +
Alternatively, you can skip the steps above of finding top n mers by providing a background sequence file using argument --bg:
 +
 +
<pre>
 +
mCross.pl -l 10 -p 2 -N 10 -m 1 --cluster-seeds --bg Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.bg.21.normsk.fa--prefix Rbfox --score-method sqrt Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa Rbfox_CITS
 +
</pre>
 +
 +
'''Note''':
 +
 +
1. In this case, top 7-mers will be found by comparing  Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa as foreground and Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.bg.21.normsk.fa as background.  The identified top 7mers will be used as seeds for de novo motif discovery.
 +
 +
2. We allow the users to specify the list of seed n-mers to provide the flexibility (e.g., the users prefer to determine seeds using sequences around peaks rather than crosslink sites).
 +
 +
==Visualization==
 +
 +
Run mCross2logo.R in the mCross package to generate the plot.
 +
<pre>
 +
Rscript mCross2logo.R -i Rbfox.00.mat -o Rbfox.00.pdf -s rna -v
 +
</pre>
 +
 +
=mCrossDB=
 +
 
==Web interface==
 
==Web interface==
  
Access [http://hfaistos.uio.no:8002 mCrossDb>>>]
+
Access [http://zhanglab.c2b2.columbia.edu/mCrossBase/ mCrossBase>>>].
  
  
 
==Download==
 
==Download==
A list of position frequency matrices for 112 unique RBPs derived from eCLIP data: [download]
+
A list of position frequency matrices for 112 unique RBPs derived from eCLIP data: [http://zhanglab.c2b2.columbia.edu/data/mCross/eCLIP_mCross_PWM.tgz download here (199 kb)].
 
+
==Citation==
+
Feng et al. (2018), Modeling the in vivo specificity of RNA-binding proteins by precisely registering protein-RNA crosslink sites.  in submission.
+

Latest revision as of 21:10, 27 April 2024

Introduction

mCross is a computational tool to perform de novo motif discovery for RNA-binding proteins using CLIP data. mCross jointly models the sequence specificity and protein-RNA crosslinking position in the RBP binding motif \by leveraging the crosslink sites mapped at the single-nucleotide resolution by crosslinking induced mutation site (CIMS) and truncation site (CITS) analysis.

More details about this work can be found in the following paper:

Feng et al. (2019), Modeling the in vivo specificity of RNA-binding proteins by precisely registering protein-RNA crosslink sites. Mol Cell. 74:1189-1204.E6.

Versions

  • v1.0.0 ( 05-07-2021 )
    • The initial public release

Software installation

Prerequisites

This software is implemented with Perl and R. We have tested the software on RedHat Linux, although it is expected to work on most Unix-like systems, including Mac OS X. The package requires the following packages to be installed:

  • R (version 3.0.0 and higher).
  • R packages: gplots, motifStack, ggplot2, gridExtra, cowplot,and getopt.
  • Perl

Installation through anaconda

The working environment is set as previously in ctk. Packages can be installed to other conda environments by preference.

Activate the working environment 'ctk' and install the packages by running the commands below:

myenv='ctk'
conda activate $myenv

conda install -y -c chaolinzhanglab mcross

If you would like to install a specific platform version e.g. 'noarch', then use one of the below commands:

conda install -y -c chaolinzhanglab/noarch mcross

conda install -y -c bioconda mcross

Manual installation

Install software packages described in prerequisites:

  • Download and install the czplib Perl library (refer to CTK documentation)
  • Download the mCross from GitHub to whatever directory you like (as an example, we use /czlab_src/github/)
cd ~/czlab_src/github/
git clone https://github.com/chaolinzhanglab/mCross.git

Add the dir to your $PATH environment variable if you would like.

Usage

You can run the following command to show descriptions of arguments, input and output format.

mCross.pl [options] <seq_file> <out_file or out_file_stem>

Arguments:

Argument Description
-l sequence extension around crosslink site
--seed top_nmer_file
--bg if top_nmer not provided, fg and bg file are used to get the list
-p pad the seed motif on both sides
-m number of mismatches allowed in the core motif
-N max number of seed words to search
--cluster-seeds cluster seed word
--xl-model crosslink model (1=simple(default), 2=nucleotide-specific)
--score-method ([log])/sqrt
--prefix prefix of the motif name
--single-output-file write all motifs to a single file
-c, cache dir path to write temporary file
-v, verbose verbose mode


mCross takes the sequences around CIMS/CITS sites as input and generate the binding motifs for each input. Please note that mCross can either accept the top n mer file or background sequence fasta file to get the top n mer list. The top n mer file is generated by counting the occourrance of n mer in the sequence around peak region or CIMS/CITS region. The top n mer file can be generated combined with our CTK toolkit(http://zhanglab.c2b2.columbia.edu/index.php/CTK_Documentation).


Peak calling and CIMS/CITS analysis

Please check http://zhanglab.c2b2.columbia.edu/index.php/ECLIP_data_analysis_using_CTK for details.

Get enriched top n mer from sequences around peak/CIMS/CITS regions

1. Extract fasta sequences from the beds files representing the peak or CIMS/CITS region. Typically, we extend 50bp both upstream and downstream around peak center or 10bp around CIMS/CITS sites.

2. Generate background sequences. In the example below, we take -550 to -450 upstream of the peak center and 450 to 550 downstream of the peak center as background sequences.

3. Calculate the enrichment score of n mer. We set n equal to 7 as an example in the following command.

word_enrich.pl -w 7 -test binom -v Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.normsk.fa  Rbfox_R2.tag.uniq.peak.sig.PH10.center.bg.100.normsk.fa Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.txt

4. Generate top n mer file as input of mCross.

gen_word_enrich_matrix.pl  peak.conf  Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.zcore.mat.txt

This script take a configuration file which includes two columns separated by tab as input:

Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.txt \tab Rbfox_peak

5. Extract the top n mer list.

Rscript topword.R Rbfox_R2.tag.uniq.peak.sig.PH10.center.100.w7.zcore.mat.txt Rbfox_peak_top7mer

Run mCross based on top n mer and CIMS/CITS sequences

With the top n mer files ready, mCross can either take CIMS or CITS sequence as input. Here we use CITS sequence as an example:

mCross.pl -l 10 -p 2 -N 10 -m 1 --cluster-seeds --seed Rbfox_peak_top7mer/top.Rbfox_peak.txt --prefix Rbfox --score-method sqrt Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa Rbfox_peakvsCITS

mCross will output a list of discovered motifs represented by TRANSFAC format in the output folder.

Alternatively, you can skip the steps above of finding top n mers by providing a background sequence file using argument --bg:

mCross.pl -l 10 -p 2 -N 10 -m 1 --cluster-seeds --bg Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.bg.21.normsk.fa--prefix Rbfox --score-method sqrt Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa Rbfox_CITS

Note:

1. In this case, top 7-mers will be found by comparing Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.21.normsk.fa as foreground and Rbfox_R2.tag.uniq.rgb.clean.CITS.s30.singleton.bg.21.normsk.fa as background. The identified top 7mers will be used as seeds for de novo motif discovery.

2. We allow the users to specify the list of seed n-mers to provide the flexibility (e.g., the users prefer to determine seeds using sequences around peaks rather than crosslink sites).

Visualization

Run mCross2logo.R in the mCross package to generate the plot.

Rscript mCross2logo.R -i Rbfox.00.mat -o Rbfox.00.pdf -s rna -v

mCrossDB

Web interface

Access mCrossBase>>>.


Download

A list of position frequency matrices for 112 unique RBPs derived from eCLIP data: download here (199 kb).