MRIN Documentation

From Zhang Laboratory

Jump to: navigation, search

Introduction

This document describes a computational method to assess a quantitative measure of mRNA integrity, named mRIN (mRNA integrity number) directly RNA-Seq data. This is done by quantitatively modeling of the 3' bias of read coverage profiles along each mRNA transcript. A per-sample summary mRIN is then derived as an indicator of mRNA degradation. This method has been used for systematic analysis of large scale RNA-Seq data of postmortem tissues, in which RNA degradation during tissue collection is particularly an issue.

Citation: Feng,H., Zhang,X., Zhang,C. †, 2015. mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA sequencing data. Nat. Comm. in press.

Versions

  • v1.2.0 ( 12-23-2015, current)
    • minor bug fixes
  • v1.1.0 ( 02-09-2015)
    • Implementation of additional filtering of genes used to calculate mRIN.
  • v1.0.0 ( 01-21-2015 )
    • The initial public release

Software installation

Prerequisites

The mRIN software package relies on other tools to align RNA-Seq reads to the reference genome and exon junctions. This can be done with a variety of available tools and our recommendation is the OLego and Quantas pipeline we developed. This pipeline has been extensively used in our work and comes with the flexibility to deal with single end or pair-end, stranded or unstranded libraries. Please see http://zhanglab.c2b2.columbia.edu/index.php/Quantas for more detail.

For this documentation, we assume the user are familiar with the command line environment of unix-like operating systems (e.g., linux or Mac OS X).

mRIN installation and preparation

mRIN is set of command line tools implemented in perl and R scripts that have been tested in unix-based environment.

It also depends on a perl library czplib, which implements various functions for genomic/bioinformatic analysis.

To install czplib, download the files from github.

Decompress it and move it to a place you like

$tar zxvf czplib.v1.0.x.tgz
$mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib

One might need to install Math:CDF package, if not already. It is available at http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm

We assume the mRIN source code is under ~/src/ for this documentation.

$tar zxvf mRIN.v1.x.x.tgz

Finally, to run the Rscript in the package, one will need to install getopt in R.

Gene models

One needs to have a list of gene models to be used for mRIN analysis. We used representative RefSeq transcripts, one for each gene, in our analysis. For genes with multiple RefSeq transcripts, the longest one is used. The representative RefSeq transcripts for mouse and human have been included as part of the software package (under the folder genes).

The users have the option to derive their own gene models, and they should be in the BED format with exon and intron structures properly specified.

Mapping of RNA-Seq reads

We assume RNA-Seq raw reads have been properly mapped to the reference genome and exon junctions, and the output of the alignment is stored in a BED file. This can be done using OLego or other software tools. See http://zhanglab.c2b2.columbia.edu/index.php/Quantas for more detail.

In the BED file, each read/fragment can have a weight if they are derived from paired-end data (e.g., probability of the fragment derived from different splice isoforms). Otherwise, each read is assumed to have a weight of 1 during read counting.

Generate the KS matrix

The first step of mRIN analysis is to generate the read coverage profile for each sample and save the output in a bedGraph format.

perl ~/src/mRIN/tag2profile.pl -big -exact -of bedgraph -v sample.bed sample.bedGraph

Alternatively, if gapless in the Quantas pipeline is used to process and combine paired end reads, each read/fragment is assigned a weight, which has to be considered during read counting. This can be done by including the "-weight" option.

perl ~/src/mRIN/tag2profile.pl -big -exact -of bedgraph -v -weight sample.bed sample.bedGraph


The read coverage profile will then be converted into a cumulative profile, from which a KS statistic is derived for each gene

mkdir cdf  #we will save them in a separate directory to be organized
mkdir ks #we will save them in a separate directory to be organized
perl ~/src/mRIN/gen_transcript_cdf.pl -v ~/src/mRIN/genes/refGene.rep.uniq.hg19.bed sample.bedGraph cdf/sample.cdf.bedGraph 
perl ~/src/mRIN/ks_test_uniform.pl -v cdf/sample.cum.bedGraph  ks/sample.ks.txt

The command lines above use representative RefSeq transcripts in human (refGene.rep.uniq.hg19.bed) for an example.

The cumulative read profile at the nucleotide resolution (sample.cdf.bedGraph) can be quite big. Alternatively, the two command lines can be combined into one line, so that one does not need to store the cumulative read profile for the sake of space.

perl ~/src/mRIN/gen_transcript_cdf.pl -v ~/src/mRIN/refGene.rep.uniq.bed  sample.bedGraph - | perl ~/src/mRIN/ks_test_uniform.pl -v - ks/sample.ks.txt

After the ks statistics are generated for all files under the ks folder, one needs to create a configuration file, which has two columns separated by tab, with the file name in the first column and the sample name in the second column.

#an example like this
sample1.ks.txt<tab>sample1
sample2.ks.txt<tab>sample2
...

The following script will combine the KS statistics for individual samples into a KS matrix:

perl ~/src/mRIN/gen_ks_matrix.pl -v -base ./ks --min-avg-cov 2 -v all_samples.conf all_samples.KS.mat.txt

Note the option --min-avg-cov specifies the minimal average read coverage required to calculate the KS statistics. For genes/samples below the threshold, a missing value will be assigned.

Estimate sample mRIN and statistical significance

This step is processed by a R script, which depends on R and Rlibrary getopt. For mRIN calculation, you can run

Rscript ~/src/mRIN/cal_mrin.R -k all_samples.KS.mat.txt -m out.mRIN.txt -G out.GIS.txt -v

In the command line above, out.mRIN.txt and out.GIS.txt are the sample mRINs including corresponding z-score and p-value and GIS scores, respectively. If you would like to perform additional filtering based on gene expression values, the gene RPKM values can be provided. An example is shown in the following command.

Rscript ~/src/mRIN/cal_mrin.R -k all_samples.KS.mat.txt -x all_samples.RPKM.mat.txt -r 2 -s 0.05 -e 0.5 -b -m out.mRIN.txt -G out.GIS.txt -v

Note the option -r(gene RPKM cutoff) must be set if option -x(gene expression matrix) is set. With this command, we removed 5% of genes with smallest standard deviation of KS statistic with option -s. Genes with strong 5' bias were removed with option -b. Genes with RPKM > 2 and with mKS esimated in > 50% of the samples were kept to calculate mRIN with option -r and -e.

These parameters can be adjusted for specific datasets although in practice, we found the the results are quite robust.