Difference between revisions of "Quantas Documentation"

From Zhang Laboratory

Jump to: navigation, search
(Created page with "==Important assumptions== # This document is for paired-end (PE) mRNA-seq libraries that has NO strand information, and the two mates are on the opposite strand (the standard ...")
 
(Through anaconda)
 
(87 intermediate revisions by 3 users not shown)
Line 1: Line 1:
==Important assumptions==
+
=Introduction=
# This document is for paired-end (PE) mRNA-seq libraries that has NO strand information, and the two mates are on the opposite strand (the standard Illumina protocol).  
+
This document describes the analysis of RNA-Seq data using [[OLego]] (for alignment), gapless (for inference of transcript structure), and countit (for quantification of gene expression and alternative splicing).  gapless and countit are collectively named Quantas.  The pipeline has been tested quite extensively, but the document is still at its relatively early stage.  Any comments are welcome and can be directed to Chaolin Zhang.
# It is IMPORTANT to note that read1 and read2 in the same pair must have the same read name.
+
 
# The focus of this document is to quantify splicing and gene expression change of annotated transcripts, although the pipeline can be expanded to examine novel transcripts as well.
+
=Versions=
 +
*v1.0.9 (12-01-2015), current
 +
**minor change of output format of expression summary files of individual samples
 +
**minor bug fixes
 +
*v1.0.8 (03-10-2015)
 +
**improved efficiency and memory usage in junction tag counting
 +
**minor bug fixes
 +
*v1.0.7 (08-14-2014)
 +
**enhanced support for different stranded library types
 +
**included additional filtering metric in differential expression analysis
 +
*v1.0.6 (04-22-2014)
 +
**improved memory usage/speed in splicing quantification
 +
**included reports of read counts of individual junctions for tandem cassette (taca)
 +
**fixed a rare bug fix in FDR calculation for diff splicing
 +
**included a glm for differential splicing test
 +
*v1.0.4 (11-10-2013)
 +
**added several scripts into the package
 +
*v1.0.3 (06-09-2013)
 +
**bug fixes
 +
*v1.0.2 (05-31-2013)
 +
**add all scripts required for polyA-seq analysis; will be documented later
 +
*v1.0.1 ( 5-23-2013 )
 +
**expanded methods to test differential splicing
 +
**included script to test differential expression
 +
 
 +
*v1.0.0 ( 04-8-2013 )
 +
**The initial Public release
 +
 
 +
=Important assumptions=
 +
* This document is for paired-end (PE) mRNA-seq libraries that have NO strand information, and the two mates are on the opposite strands (the standard Illumina protocol).  
 +
* It is IMPORTANT to note that read1 and read2 in the same pair must have the same and unique read name.
 +
* The focus of this document is to quantify splicing and gene expression change of annotated transcripts.  We are in the process of extending it to analysis of novel transcripts and alternative splicing events.
  
 
Single-end (SE) mRNA-seq libraries or strand-specific SE or PE RNA-seq libraries can also be analyzed by the same packages, but the parameters are slightly different.
 
Single-end (SE) mRNA-seq libraries or strand-specific SE or PE RNA-seq libraries can also be analyzed by the same packages, but the parameters are slightly different.
  
==Mapping RNA-seq reads to the reference genome and exon junctions==
+
=Software installation=
  
===OLego installation and preparation===
+
==OLego installation and preparation==
The program we use for read mapping is OLego, a program we developed recently and available at http://ngs-olego.sourceforge.net/.  You can download the source codes or precompiled binaries and the documentation.  The core algorithm is relatively stable now, but minor revisions are expected in the near future.  If you prefer to download the source code, it should be fairly easy to compile.
+
The program we use for read mapping is OLego, a program we developed recently and made available at http://zhanglab.c2b2.columbia.edu/index.php/OLegoThere you can download the source codes or precompiled binaries and access the more detailed documentation.   
  
The program takes i) '''read file(s)''' in either FASTA or FASTQ format, together with ii) the '''indexed genome''' and iii) (optionally) a database of known '''exon junctions'''
+
The program takes  
 +
* '''read file(s)''' in either FASTA or FASTQ format  
 +
* the '''indexed genome'''
 +
*(optionally) a database of known '''exon junctions'''
  
For the reference genome, I would recommend that one should remove random chromosomes as well as the mitochondria genome.  The program required to build the index is included in the package.  Assuming one has concatenated all chromosomes in a FASTA file mm9.fa.  The following command will index the genome.
+
For the reference genome, it is recommended that one should remove random chromosomes as well as the mitochondria genome.  The program required to build the index is included in the package.  Assuming one has concatenated all chromosomes in a FASTA file mm10.fa (note that the name of each chromosome is '''chrN''' instead of '''N''' to be consistent with the other library files).  The following command will index the genome.
  
  olegoindex -a bwtsw mm9.fa
+
  olegoindex -a bwtsw mm10.fa
  
This will produce several files with a prefix mm9.fa.  Move all the output files to a directory such as genome_index/olego/mm9
+
This will produce several files with a prefix mm10.fa.  Move all the output files to a directory such as genome_index/olego/mm10
  
A comprehensive database of exon junctions (mm9, I assume this is what you need) can be downloaded from http://geller.rockefeller.edu/Quantas/data/mm9.intron.hmr.bed.gz (5.1 Mb).  Download and decompress this file.
+
A comprehensive database of exon junctions (mm10) can be downloaded from http://zhanglab.c2b2.columbia.edu/data/OLego/mm10.intron.hmr.bed.gz.  Download and decompress this file.
  
===Mapping===
+
See the [http://zhanglab.c2b2.columbia.edu/index.php/OLego OLego website] for more details.
Now the alignment. Assuming one has a paired end sequencing lane of 100 nt x2 reads in sample.r1.fa and sample.r2.fa for the first and second read mates, respectively.
+
  
olego -v -t 16 -r olego_dir/models/mm9.cfg -j mm9.intron.hmr.bed -o sample.r1.sam genome_index/olego/mm9.fa sample.r1.fa
+
==Quantas installation and preparation==
# do the mapping with 14nt seed, allowing a total of 4 mismatches or indels (default), with 16 cpu cores, output to sample.r1.sam
+
+
olego -v -t 16 -r olego_dir/models/mm9.cfg -j mm9.intron.hmr.bed -o sample.r2.sam genome_index/olego/mm9.fa sample.r2.fa
+
# do the same thing for the other file
+
  
No matter how many threads are used, the memory footprint is in general < 4Gb.  In terms of speed, our estimate is that it can align 200 M 100x2 PE reads in ~29 hrs with 8 cores (2.0G Hz).
+
Quantas includes a package to infer transcript structure from paired-end RNA-seq data (gapless), and a package to count RNA-seq reads for each alternative splicing isoform (countit).  It also depends on a perl library czplib, which contains various functions for genomic/bioinformatic analysis.
  
This will map reads to the exons, known junctions as well as novel junctions.  By default OLego searches with perfect matches in seeds of 14 nt, allowing for a maximum of 4 substations or indels along the whole read (which should be good for 100 nt illumina reads).  For known junctions, OLego requires 5 or more nt overlap on either side of the junction.  For novel junctions, OLego requires 8 nt overlap, and considers only GT/AG splice sites, to improve the accuracy.
+
==Through anaconda==
  
===Convert SAM to BED file===
+
Below are the installation instructions for Quantas, CZPLIB (dependency) and OLego (dependency) through Anaconda.
After the alignment step, the SAM files are converted into BED files using the script <tt>sam2bed.pl</tt> included in the package.
+
perl sam2bed.pl --uniq -v sample.r1.sam sample.r1.uniq.bed
+
perl sam2bed.pl --uniq -v sample.r2.sam sample.r2.uniq.bed
+
  
This will keep only reads that are unambiguously mapped to the reference genome (single hits), which are identified by the tag “XT:A:U” in the SAM file.
+
<pre>
 +
conda config --env --append channels conda-forge
 +
conda config --env --append channels bioconda
  
 +
conda install --yes -c chaolinzhanglab quantas
 +
</pre>
  
==Inferring transcript structure between paired end reads==
 
  
Paired-ends reads could be located in different exons and since the reads are relatively short, the transcript structure between the sequenced ends could be ambiguous when alternative splicing occurs. The missing information, leveraging on the size constraints of each cDNA fragment and prior isoform abundance estimated from directly mapped junction reads, is inferred using a simple Bayes model.
+
* Install the perl library files czplib
 +
[https://github.com/chaolinzhanglab/czplib Download from github]
  
===Install gapless===
+
Decompress it and move it to a place you like
This is implemented in the <tt>gapless</tt> package, available at http://geller.rockefeller.edu/Quantas/gapless_20120612.tgz.  To setup this package properly, one needs to add the library files to the perl library search path, such as adding the following line in their login file .bash_profile.
+
  
PERL5LIB=/home/zhangc/gapless/lib
+
<pre>
 +
$tar zxvf czplib.v1.0.x.tgz
 +
$mv czplib /usr/local/lib
 +
</pre>
  
 +
Add the library path to the environment variable, so perl can find it. 
 +
<pre>
 +
PERL5LIB=/usr/local/lib/czplib
 +
</pre>
  
 
One might need to install Math:CDF package, if not already.  It is available at http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm
 
One might need to install Math:CDF package, if not already.  It is available at http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm
  
One also needs to download a database of splicing isoforms (more accurately, all exon trios, since we care about local AS events here) from http://geller.rockefeller.edu/Quantas/data/mm9.exon.trio.nr.bed.gz (16Mb).
+
*Download gapless & countit [Quantas]
  
===Run gapless===
+
[https://github.com/chaolinzhanglab/quantas Download from github]
Then run
+
  
perl gapless_huge_file.pl --split-size 20000000 -isoform mm9.exon.trio.nr.bed -E 400 -big --print-singleton -o sample_out -v sample.r1.uniq.bed sample.r2.uniq.bed
+
<pre>
 +
$tar zxvf Quantas.v1.x.x.tgz
 +
</pre>
  
Here, again it is IMPORTANT to note that read1 and read2 in the same pair must have the same read name.
 
  
-E 400 means large exons with 400 nt or more are used to estimate the distribution of insert cDNA fragment size in the library (adaptor sequences are not included in calculation here).
+
One also needs to download annotation files used by Quantas at
  
This step will try to combine read1 and read2 in the same pair if possible (i.e., they are possibly sampled from the same isoform in our database), and output two files in the <tt>sample_out</tt> dir, a bed file <tt>pair.gapless.bed</tt> with combined reads or fragments, and a text file <tt>size_dist.txt</tt> with the estimated distribution of insert cDNA fragment size in the library. When alternative splicing is observed between a pair, each possible isoform, or fragment, is assigned a probability score, so that one fragment can potentially have multiple lines.   This score was recorded in the 5th column of the BED file.
+
*mouse (mm10): http://zhanglab.c2b2.columbia.edu/data/Quantas/data/mm10.tgz (29Mb)
 +
*human (hg19): http://zhanglab.c2b2.columbia.edu/data/Quantas/data/hg19.tgz (37M)
  
==Quantify alternative splicing ==
+
(A previous version of the two packages missed some files for gene expression analysis, which is corrected now - 02/06/2014)
  
===Install countit===
+
=Mapping RNA-seq reads to the reference genome and exon junctions=
Quantification of alternative splicing and gene expression is done using the <tt>countit</tt> package, available at http://geller.rockefeller.edu/Quantas/counit_20120613.tgz.  To setup this package properly, one needs to add the library files to the perl library search path, such as the following in their login file .bash_profile, if one has not done this as described above.  These library files are the same as those described above in <tt>gapless</tt>, so there is no need to duplicate.
+
  
PERL5LIB=/home/zhangc/countit/lib
+
==Mapping==
 +
Assuming one has a paired-end sequencing lane of 100 nt x2 reads in sample.r1.fa and sample.r2.fa for the first and second read mates, respectively, derived from mouse. 
  
 +
olego -v -t 16 -r olego_dir/models/mm.cfg -j mm10.intron.hmr.bed -o sample.r1.sam genome_index/olego/mm10.fa sample.r1.fa
 +
# do the mapping with 15nt seed with 1nt overlap, allowing a total of 4 mismatches or indels (default), with 16 cpu cores, output to sample.r1.sam
 +
 +
olego -v -t 16 -r olego_dir/models/mm.cfg -j mm10.intron.hmr.bed -o sample.r2.sam genome_index/olego/mm10.fa sample.r2.fa
 +
# do the same thing for the other file
  
One also needs to download a database of different types of alternative splicing events from http://geller.rockefeller.edu/Quantas/data/mm9.ase.tar.gz (1.6Mb).
+
No matter how many threads (-t 16) are used, the memory footprint is in general < 4Gb.  That said, the exact amount of memory also depends on the seed size, the repetitive nature of reads and the exon junction database.
  
There are five files inside the tarball:
+
This will map reads to the exons, known junctions as well as novel junctions.  By default OLego searches with perfect matches in seeds of 15 nt with one 1-nt overlap.  The maximal mismatches or indels along the whole read allowed in a match is determined by the read length(4 nt for 100 nt illumina reads).  For known junctions, OLego requires 5 or more nt overlap on either side of the junction.  For novel junctions, OLego requires 8 nt overlap, and considers only GT/AG splice sites, to improve the accuracy.
  
# <tt>Mm.seq.all.cass.chrom.can.bed</tt>: cass or cassette exons
+
==Convert SAM to BED file (single-end data ONLY)==
# <tt>Mm.seq.all.taca.chrom.can.bed</tt>: taca or tandem cassette exons
+
# <tt>Mm.seq.all.alt5.chrom.can.bed</tt>: alt5 or alternative 5' ss
+
# <tt>Mm.seq.all.alt3.chrom.can.bed</tt>: alt3 or alternative 3' ss
+
# <tt>Mm.seq.all.mutx.chrom.can.bed</tt>: mutx or mutually exclusive exons
+
# <tt>Mm.seq.all.iret.chrom.can.bed</tt>: iret or retained introns
+
  
===Run countit===
+
'''Important note''': this should be '''ONLY done for single end reads'''. For paired end reads, gapless will take sam files and output BED files.
For each sample, one now needs to separate genomic reads and junction reads:
+
awk '{if(NF<=9 || $10==1) {print $0}}' sample_out/pair.gapless.bed > sample.combined.genome.bed
+
awk '{if(NF==12 && $10>1) {print $0}}' sample_out/pair.gapless.bed > sample.combined.junction.bed
+
  
For each type of AS event, run summarize_splicing.pl as follows:
+
After the alignment step, the SAM files are converted into BED files using the script <tt>sam2bed.pl</tt> included in the OLego package.
 +
perl sam2bed.pl --uniq -v sample.sam sample.uniq.bed
  
perl summarize_splicing.pl -type cass -big -v -weight -c ./cache Mm.seq.all.cass.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.cass.count.txt
+
This will keep only reads that are unambiguously mapped to the reference genome (single hits), which are identified by the tag “XT:A:U” in the SAM file.
perl summarize_splicing.pl -type taca -big -v -weight -c ./cache Mm.seq.all.taca.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.taca.count.txt
+
 
  perl summarize_splicing.pl -type alt5 -big -v -weight -c ./cache Mm.seq.all.alt5.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.alt5.count.txt
+
=Inferring transcript structure between paired end reads (paired-end data ONLY)=
  perl summarize_splicing.pl -type alt3 -big -v -weight -c ./cache Mm.seq.all.alt3.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.alt3.count.txt
+
 
  perl summarize_splicing.pl -type mutx -big -v -weight -c ./cache Mm.seq.all.mutx.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.mutx.count.txt
+
Paired-ends reads could be located in different exons and since the reads are relatively short, the transcript structure between the sequenced ends could be ambiguous when alternative splicing occurs. The missing information, leveraging on the size constraints of each cDNA fragment and prior isoform abundance estimated from directly mapped junction reads, is inferred using a simple Bayes model.
perl summarize_splicing.pl -type iret -big -v -weight -c ./cache Mm.seq.all.iret.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.iret.count.txt
+
 
 +
<pre>
 +
perl gapless/gapless_huge_file.pl -v -sam -uniq --split-size 10000000 -isoform mm10.exon.trio.hmr.nr.bed -E 400 -big --print-singleton -o gapless_out sample.r1.sam sample.r2.sam
 +
</pre>
 +
 
 +
This script takes the sam files of read1 and read2 as input (-sam). Here, again it is IMPORTANT to note that read1 and read2 in the same pair must have the same read name and order.
 +
 
 +
-uniq means only uniquely mapped reads (single hits) are kept. They are identified by the tag “XT:A:U” in the SAM file.
 +
-E 400 means large exons with 400 nt or more are used to estimate the distribution of insert cDNA fragment size in the library (adaptor sequences are not included in calculation here).
 +
 
 +
'''Important: from v1.0.7, an additional option --library-type is introduced to indicate different library types.  This is to replace the original option --ss.'''
 +
<pre>
 +
unstranded:  unstranded libraries assuming the two reads are on the opposite strand
 +
stranded-as: read1 is antisense and read2 is sense
 +
stranded-sa: read1 is sense and read2 is antisense.
 +
</pre>
 +
 
 +
This step will try to combine read1 and read2 in the same pair if possible (i.e., they are possibly sampled from the same isoform in our database), and output two files in the ''gapless_out'' dir, a bed file ''pair.gapless.bed'' with combined reads or fragments, and a text file ''size_dist.txt'' with the estimated distribution of insert cDNA fragment size in the library.  When alternative splicing is observed between a pair, each possible isoform, or fragment, is assigned a probability score, so that one fragment can potentially have multiple lines.   This score is recorded in the 5th column of the BED file.
 +
 
 +
=Quantify alternative splicing =
 +
 
 +
==Run countit==
 +
 
 +
First, one needs to prepare a configuration file cz_annotation.loc that specifies the location of AS database files. These files can be downloaded from http://zhanglab.c2b2.columbia.edu/data/Quantas/data/mm10.tgz, if not already.
 +
 
 +
<pre>
 +
mm10    as annotation/Mm.seq.all.cass.chrom.can.bed   cass
 +
mm10    as  annotation/Mm.seq.all.taca.chrom.can.bed   taca
 +
mm10    as annotation/Mm.seq.all.alt5.chrom.can.bed   alt5
 +
mm10    as  annotation/Mm.seq.all.alt3.chrom.can.bed   alt3
 +
mm10    as annotation/Mm.seq.all.mutx.chrom.can.bed   mutx
 +
mm10    as  annotation/Mm.seq.all.iret.chrom.can.bed   iret
 +
</pre>
 +
 
 +
A template of this file (mm10.conf) is included in the archive above. The third column of this file has to be modified to point to the location of the other AS annotation files in the archive.
 +
 
 +
<pre>
 +
perl countit/summarize_splicing_wrapper.pl -c ./cache -v -big -weight -conf mm10.conf -dbkey mm10 -cass -taca -alt5 -alt3 -mutx -iret gapless_out/pair.gapless.bed countit_out
 +
</pre>
 +
 
 +
Note that the <tt>-c</tt> option specifies temporary output dir, which should have sufficient space. The <tt>-weight</tt> option is specified because each fragment from gapless now has a probability score, and '''this option should NOT be used for single end RNA-Seq data'''.
  
Note that the <tt>-cache</tt> option specifies temporary output dir, and the <tt>-weight</tt> option is specified because each fragment from gapless now has a probability score.  In the output file, each line is an alternative splicing event in comparison of two isoforms.  The first 10 columns in output file for each type of AS are the same:  
+
In the output directory, there is a file for each AS event (e.g., cass.count.txt).  In these files, each line is an alternative splicing event in comparison of two isoforms.  The first 10 columns in output file for each type of AS are the same:  
 
  chrom chromStart chromEnd name score strand type isoformIDs isoform1Tags isoform2Tags
 
  chrom chromStart chromEnd name score strand type isoformIDs isoform1Tags isoform2Tags
  
The score column above is meaningless, and the first 6 columns provide coordinate information so that one can quickly go to the region in UCSC genome browser (see visualization below).
+
The score column above is reserved, and the first 6 columns provide coordinate information so that one can quickly go to the region in UCSC genome browser (see visualization below).
  
 
The last three columns are the two isoforms compared, the read count for isoform1 and the read count for isoform2, which are the most critical for quantification of splicing level.
 
The last three columns are the two isoforms compared, the read count for isoform1 and the read count for isoform2, which are the most critical for quantification of splicing level.
Line 115: Line 190:
 
  iret: retainedIntronTags junctionTags
 
  iret: retainedIntronTags junctionTags
  
In case one would like to run all types of AS events in one command line, another wrapper is provided.
+
In the command line above, you can also choose to run only certain types of AS events (e.g. -cass for cassette exons).
  
First, one needs to prepare a configuration file cz_annotation.loc that specifies the location of AS database files
+
==Summarize alternative splicing results and perform statistical tests==
#cz_annotation.loc, modify the third column accordingly
+
mm9 as annotation/Mm.seq.all.cass.chrom.can.bed cass
+
mm9 as annotation/Mm.seq.all.taca.chrom.can.bed taca
+
mm9 as annotation/Mm.seq.all.alt5.chrom.can.bed alt5
+
mm9 as annotation/Mm.seq.all.alt3.chrom.can.bed alt3
+
mm9 as annotation/Mm.seq.all.mutx.chrom.can.bed mutx
+
mm9 as annotation/Mm.seq.all.iret.chrom.can.bed iret
+
  
Then run
+
After one runs <tt>OLego</tt>, <tt>gapless</tt> and <tt>countit</tt> for each sample, the final step of AS analysis is to test differential splicing between two groups of samples.  For this purpose one needs to prepare a configuration file to specify which sample belongs to each group.
  
perl summarize_splicing_wrapper.pl -big -weight -conf cz_annotation.loc -dbkey mm9 -c ./cache -cass -taca -alt5 -alt3 -mutx -iret -v sample_out/pair.gapless.bed sample.AS.count.txt
+
<pre>
 +
#dataset.group.conf
 +
sample1_countit_out<tab>group1
 +
sample2_countit_out<tab>group1
 +
sample3_countit_out<tab>group1
 +
sample4_countit_out<tab>group2
 +
sample5_countit_out<tab>group2
 +
sample6_countit_out<tab>group2
 +
</pre>
  
The output file is a concatenation of multiple output files described above, except the header lines are removed.   
+
The first column of the file above is the path to the output directory of countitOne can then run the following line for cassette exons (and similarly for other types of AS events):
  
In the command line above, you can also choose not to run all types of AS events.
+
<pre>
 +
perl countit/test_splicing_diff.pl -type cass -v --min-cov 20 --id2gene2symbol annotation/Mm.seq.all.AS.chrom.can.id2gene2symbol dataset.group.conf dataset.diff.txt
 +
</pre>
 +
 +
This script will pool the number of reads for each isoform from each of the individual samples in the same group.  It will then perform Fisher's exact test using the total number of reads supporting each isoform, calculate exon inclusion level and the change between the two groups, etc.
  
===Summarize alternative splicing results and perform statistical tests===
+
Finally, it will filter AS events and consider those events with sufficient read coverage for multiple test correction (using the Benjamini method). 
  
After one runs <tt>OLego</tt>, <tt>gapless</tt> and <tt>countit</tt> for each sample, the output file sample.AS.count.txt is combined together, right now in excel, with statistical analysis performed in R.  For statistical analysis, the read count in each sample of the same group is pooled together. 
+
The option  (--min-cov 20) means the following requirement:
 +
<pre>
 +
junction_in + junction_skip≥20 AND junction_group1+junction_group2≥20
 +
</pre>
  
Eventually, one get an Excel spreadsheet with the following columns (using cassette exons as an example, a template can be downloaded from http://geller.rockefeller.edu/Quantas/data/example.cass.summary.xlsx):
+
The output of the file has the following columns:
  
columns A-G: LineNo chrom chromStart chromEnd name score strand
+
<pre>
columns H-I: gene_id gene_symbol type isoformIDs #Gene ID can be obtained from the first number after "-" in the name column
+
column 1: gene, Entrez gene id and gene symbol
columns L-O: group1Isoform1Tags group1Isoform2Tags group2Isoform1Tags group2soform2Tags
+
columns P-U: group1InclusionJuncction1Tags group1InclusionJunction2Tags group1SkippingJunctionTags group2InclusionJuncction1Tags group2InclusionJunction2Tags group2SkippingJunctionTags
+
columns V-AC: coverage odd_ratio P_fisher Rank Benjamini group1Inc group2Inc dI
+
  
Here we assume that we are comparing two groups of samples.  Column L-U are sums of all samples in the same group.  
+
columns 2-9: chrom  chromStart  chromEnd    name    score  strand  type    isoformIDs: the same as described in the output of countit 
 +
column 10: coverage    I_g1(group1) I_g2(group2) dI_g1_vs_g2 pvalue FDR
 +
</pre>
  
<tt>gene_id</tt> can be extracted from the <tt>name</tt> column, the first number after "-".  <tt>gene_symbol</tt> can gene be looked up using the database available at http://geller.rockefeller.edu/Quantas/data/gene2symbol.mm
+
In general, we require FDR<0.05 (or 0.01), |dI|>0.1 (or 0.2) to call significant splicing changes.
  
<tt>coverage</tt> in column V is 1 if junction_in + junction_skip≥20 AND junction_group1+junction_group2≥20, or 0 otherwise, as specified in the formula in the template.  Only lines with coverage of 1 are used for statistical test, to minimize multiple testing.
 
  
columns L-O are now pasted to a text file named data.txt to be loaded for Fisher's exact test in R, a script available from http://geller.rockefeller.edu/Quantas/fisher.test.batch.RThis will generate <tt>odds_ratio</tt> and <tt>P_fisher</tt> to be pasted to columns W and X.  
+
A caveat of the Fisher exact test described above is that it does not take into account the within-group variation. Particular caution is warranted for samples with a substantial amount of heterogeneity (e.g. see visualization below)We are working on this to address the issue.
  
Then rows are sorted first by coverage (descending), and then by P_fisher (ascending), and Benjamini FDR (column Z) is calculated by (P_fisher) *(#Lines with coverage=1) / (Rank of the line in column Y).
+
=Quantify gene expression=
  
Change of exon inclusion or dI in column AC is calculated using junction reads, as specified in the formula in the template.
+
This is also done with the <tt>countit</tt> package.
  
In general, we require FDR<0.05 (or 0.01), coverage = 1, |dI|>0.1 (or 0.2) to call significant splicing change.
+
perl countit/summarize_expression_wrapper.pl -big --cache ../cache -exon mm10.exon.uniq.core.bed -e2g mm10.exon.uniq.core.id2gene2symbol -weight -v sample_out/pair.gapless.bed sample.expr.txt
  
For other types of AS events, some extra filtering criteria might be needed, and we can discuss when we get to the point.
+
Again, the <tt>-weight</tt> option is specified because each fragment from gapless now has a probability score.  This option should not be used for single end RNA-seq data.
 
+
 
+
Two notes:
+
 
+
# A caveat of the Fisher exact test described above is that it does not take into account the within-group variation.  This is in general a not big issue for lab mice if RNA-seq libraries are prepared properly because the heterogeneity between lab mice is minimal.  So far I have not seen tools that deal with this issue effectively, but this could be due to the fact that I did not track RNA-seq literature very closely. 
+
#Also, a recent paper described a method [http://nar.oxfordjournals.org/content/early/2012/01/19/nar.gkr1291.short?rss=1 MATS] based on MCMC to estimate p-values and claims that the performance is somewhat better than Fisher's exact test in simulation.  I have not tested this, but if this is true, the test can be swathed very easily.
+
 
+
==Quantify gene expression==
+
 
+
This is also done with the <tt>countit</tt> package, but first one needs to download gene definition files from http://geller.rockefeller.edu/Quantas/data/mm9.gene.tar.gz.  There are two files in the tarball that will be used in the next step.
+
 
+
perl summarize_expression_wrapper.pl -big --cache ../cache -exon mm9.exon.uniq.core.bed -e2g mm9.exon.uniq.core.id2gene2symbol -weight -v sample_out/pair.gapless.bed sample.expr.txt
+
 
+
Again, the <tt>-weight</tt> option is specified because each fragment from gapless now has a probability score.
+
  
 
The output is like this
 
The output is like this
Line 181: Line 248:
 
Here exon_len is the total exonic length of the gene, and tag_num is the total number of exonic reads (or more precisely fragments) in the gene.  RPKM is always calculated using total number of exonic tags summed over all genes in the output file as a denominator, to avoid the complication of rRNA reads, etc.
 
Here exon_len is the total exonic length of the gene, and tag_num is the total number of exonic reads (or more precisely fragments) in the gene.  RPKM is always calculated using total number of exonic tags summed over all genes in the output file as a denominator, to avoid the complication of rRNA reads, etc.
  
 +
This should be done for all samples. 
  
This was done for all samples.  Then an R package edgeR was used to do statistical test of differential gene expression.  An example script can be downloaded from: http://geller.rockefeller.edu/Quantas/expr.edgeR.tagwise.R, which will report log2FC, p-value, and benjamini FDR
+
Differential gene expression analysis can be performed with different tools, such as edgeR (http://www.bioconductor.org/packages/2.12/bioc/html/edgeR.html). The package includes a script for this purpose. One needs to prepare a configuration file to specify which sample belongs to each group.
  
RPKM for each group needs to be recalculated after combining all samples in the same group.
+
<pre>
 +
#expr.diff.conf
 +
sample1.expr.txt<tab>group1
 +
sample2.expr.txt<tab>group1
 +
sample3.expr.txt<tab>group1
 +
sample4.expr.txt<tab>group2
 +
sample5.expr.txt<tab>group2
 +
sample6.expr.txt<tab>group2
 +
</pre>
  
 +
<pre>
 +
#this script will invoke R and requires edgeR to be installed
 +
perl countit/test_expr_diff.pl -MAplot -v expr.diff.conf group1_vs_group2.expr.diff.txt
 +
</pre>
 
We typically call significant differential expression by requiring FDR<0.01 (or 0.001), |log2FC|>1 (or 0.6 or 2), and max(group1RPKM, group2RPKM) in the top half.
 
We typically call significant differential expression by requiring FDR<0.01 (or 0.001), |log2FC|>1 (or 0.6 or 2), and max(group1RPKM, group2RPKM) in the top half.
  
==Generate bedGraph files for visualization in UCSC genome browser==
+
=Generate bedGraph files for visualization in UCSC genome browser=
  
 
In most cases, bedGraph format provides excellent visualization, especially for alternative splicing events.  The BED file after gapless analysis can be converted into a bedGraph file with a script in the <tt>countit</tt> package.
 
In most cases, bedGraph format provides excellent visualization, especially for alternative splicing events.  The BED file after gapless analysis can be converted into a bedGraph file with a script in the <tt>countit</tt> package.
  
  perl tag2profile.pl -big -weight -exact -of bedgraph -n "Sample_bed_graph" -v sample_out/pair.gapless.bed sample.bedGraph
+
  perl countit/tag2profile.pl -v -big -weight -c ./cache -exact -of bedgraph -n "Sample_bed_graph" sample_out/pair.gapless.bed sample.bedGraph
  
 
In some cases, one might also want to show the summarized number of junction reads for each exon junction.  This can be done by
 
In some cases, one might also want to show the summarized number of junction reads for each exon junction.  This can be done by
  perl tag2junction.pl -v -big -weight -c ./cache sample.combined.junction.bed sample.junction.count.bed
+
  perl countit/tag2junction.pl -v -big -weight -c ./cache sample.combined.junction.bed sample.junction.count.bed
  
 
The 5th column now contain the read count for each junction, which is also shown as part of the name column.  This file can also be loaded to the UCSC genome browser for visualization.
 
The 5th column now contain the read count for each junction, which is also shown as part of the name column.  This file can also be loaded to the UCSC genome browser for visualization.
 +
 +
For single end RNA-Seq, the -weight option should not be used.

Latest revision as of 17:29, 7 October 2022

Introduction

This document describes the analysis of RNA-Seq data using OLego (for alignment), gapless (for inference of transcript structure), and countit (for quantification of gene expression and alternative splicing). gapless and countit are collectively named Quantas. The pipeline has been tested quite extensively, but the document is still at its relatively early stage. Any comments are welcome and can be directed to Chaolin Zhang.

Versions

  • v1.0.9 (12-01-2015), current
    • minor change of output format of expression summary files of individual samples
    • minor bug fixes
  • v1.0.8 (03-10-2015)
    • improved efficiency and memory usage in junction tag counting
    • minor bug fixes
  • v1.0.7 (08-14-2014)
    • enhanced support for different stranded library types
    • included additional filtering metric in differential expression analysis
  • v1.0.6 (04-22-2014)
    • improved memory usage/speed in splicing quantification
    • included reports of read counts of individual junctions for tandem cassette (taca)
    • fixed a rare bug fix in FDR calculation for diff splicing
    • included a glm for differential splicing test
  • v1.0.4 (11-10-2013)
    • added several scripts into the package
  • v1.0.3 (06-09-2013)
    • bug fixes
  • v1.0.2 (05-31-2013)
    • add all scripts required for polyA-seq analysis; will be documented later
  • v1.0.1 ( 5-23-2013 )
    • expanded methods to test differential splicing
    • included script to test differential expression
  • v1.0.0 ( 04-8-2013 )
    • The initial Public release

Important assumptions

  • This document is for paired-end (PE) mRNA-seq libraries that have NO strand information, and the two mates are on the opposite strands (the standard Illumina protocol).
  • It is IMPORTANT to note that read1 and read2 in the same pair must have the same and unique read name.
  • The focus of this document is to quantify splicing and gene expression change of annotated transcripts. We are in the process of extending it to analysis of novel transcripts and alternative splicing events.

Single-end (SE) mRNA-seq libraries or strand-specific SE or PE RNA-seq libraries can also be analyzed by the same packages, but the parameters are slightly different.

Software installation

OLego installation and preparation

The program we use for read mapping is OLego, a program we developed recently and made available at http://zhanglab.c2b2.columbia.edu/index.php/OLego. There you can download the source codes or precompiled binaries and access the more detailed documentation.

The program takes

  • read file(s) in either FASTA or FASTQ format
  • the indexed genome
  • (optionally) a database of known exon junctions

For the reference genome, it is recommended that one should remove random chromosomes as well as the mitochondria genome. The program required to build the index is included in the package. Assuming one has concatenated all chromosomes in a FASTA file mm10.fa (note that the name of each chromosome is chrN instead of N to be consistent with the other library files). The following command will index the genome.

olegoindex -a bwtsw mm10.fa

This will produce several files with a prefix mm10.fa. Move all the output files to a directory such as genome_index/olego/mm10

A comprehensive database of exon junctions (mm10) can be downloaded from http://zhanglab.c2b2.columbia.edu/data/OLego/mm10.intron.hmr.bed.gz. Download and decompress this file.

See the OLego website for more details.

Quantas installation and preparation

Quantas includes a package to infer transcript structure from paired-end RNA-seq data (gapless), and a package to count RNA-seq reads for each alternative splicing isoform (countit). It also depends on a perl library czplib, which contains various functions for genomic/bioinformatic analysis.

Through anaconda

Below are the installation instructions for Quantas, CZPLIB (dependency) and OLego (dependency) through Anaconda.

conda config --env --append channels conda-forge
conda config --env --append channels bioconda

conda install --yes -c chaolinzhanglab quantas


  • Install the perl library files czplib

Download from github

Decompress it and move it to a place you like

$tar zxvf czplib.v1.0.x.tgz
$mv czplib /usr/local/lib

Add the library path to the environment variable, so perl can find it.

PERL5LIB=/usr/local/lib/czplib

One might need to install Math:CDF package, if not already. It is available at http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm

  • Download gapless & countit [Quantas]

Download from github

$tar zxvf Quantas.v1.x.x.tgz


One also needs to download annotation files used by Quantas at

(A previous version of the two packages missed some files for gene expression analysis, which is corrected now - 02/06/2014)

Mapping RNA-seq reads to the reference genome and exon junctions

Mapping

Assuming one has a paired-end sequencing lane of 100 nt x2 reads in sample.r1.fa and sample.r2.fa for the first and second read mates, respectively, derived from mouse.

olego -v -t 16 -r olego_dir/models/mm.cfg -j mm10.intron.hmr.bed -o sample.r1.sam genome_index/olego/mm10.fa sample.r1.fa
# do the mapping with 15nt seed with 1nt overlap, allowing a total of 4 mismatches or indels (default), with 16 cpu cores, output to sample.r1.sam

olego -v -t 16 -r olego_dir/models/mm.cfg -j mm10.intron.hmr.bed -o sample.r2.sam genome_index/olego/mm10.fa sample.r2.fa
# do the same thing for the other file

No matter how many threads (-t 16) are used, the memory footprint is in general < 4Gb. That said, the exact amount of memory also depends on the seed size, the repetitive nature of reads and the exon junction database.

This will map reads to the exons, known junctions as well as novel junctions. By default OLego searches with perfect matches in seeds of 15 nt with one 1-nt overlap. The maximal mismatches or indels along the whole read allowed in a match is determined by the read length(4 nt for 100 nt illumina reads). For known junctions, OLego requires 5 or more nt overlap on either side of the junction. For novel junctions, OLego requires 8 nt overlap, and considers only GT/AG splice sites, to improve the accuracy.

Convert SAM to BED file (single-end data ONLY)

Important note: this should be ONLY done for single end reads. For paired end reads, gapless will take sam files and output BED files.

After the alignment step, the SAM files are converted into BED files using the script sam2bed.pl included in the OLego package.

perl sam2bed.pl --uniq -v sample.sam sample.uniq.bed

This will keep only reads that are unambiguously mapped to the reference genome (single hits), which are identified by the tag “XT:A:U” in the SAM file.

Inferring transcript structure between paired end reads (paired-end data ONLY)

Paired-ends reads could be located in different exons and since the reads are relatively short, the transcript structure between the sequenced ends could be ambiguous when alternative splicing occurs. The missing information, leveraging on the size constraints of each cDNA fragment and prior isoform abundance estimated from directly mapped junction reads, is inferred using a simple Bayes model.

perl gapless/gapless_huge_file.pl -v -sam -uniq --split-size 10000000 -isoform mm10.exon.trio.hmr.nr.bed -E 400 -big --print-singleton -o gapless_out sample.r1.sam sample.r2.sam

This script takes the sam files of read1 and read2 as input (-sam). Here, again it is IMPORTANT to note that read1 and read2 in the same pair must have the same read name and order.

-uniq means only uniquely mapped reads (single hits) are kept. They are identified by the tag “XT:A:U” in the SAM file. -E 400 means large exons with 400 nt or more are used to estimate the distribution of insert cDNA fragment size in the library (adaptor sequences are not included in calculation here).

Important: from v1.0.7, an additional option --library-type is introduced to indicate different library types. This is to replace the original option --ss.

unstranded:  unstranded libraries assuming the two reads are on the opposite strand
stranded-as: read1 is antisense and read2 is sense
stranded-sa: read1 is sense and read2 is antisense.

This step will try to combine read1 and read2 in the same pair if possible (i.e., they are possibly sampled from the same isoform in our database), and output two files in the gapless_out dir, a bed file pair.gapless.bed with combined reads or fragments, and a text file size_dist.txt with the estimated distribution of insert cDNA fragment size in the library. When alternative splicing is observed between a pair, each possible isoform, or fragment, is assigned a probability score, so that one fragment can potentially have multiple lines. This score is recorded in the 5th column of the BED file.

Quantify alternative splicing

Run countit

First, one needs to prepare a configuration file cz_annotation.loc that specifies the location of AS database files. These files can be downloaded from http://zhanglab.c2b2.columbia.edu/data/Quantas/data/mm10.tgz, if not already.

mm10    as  annotation/Mm.seq.all.cass.chrom.can.bed    cass
mm10    as  annotation/Mm.seq.all.taca.chrom.can.bed    taca
mm10    as  annotation/Mm.seq.all.alt5.chrom.can.bed    alt5
mm10    as  annotation/Mm.seq.all.alt3.chrom.can.bed    alt3
mm10    as  annotation/Mm.seq.all.mutx.chrom.can.bed    mutx
mm10    as  annotation/Mm.seq.all.iret.chrom.can.bed    iret

A template of this file (mm10.conf) is included in the archive above. The third column of this file has to be modified to point to the location of the other AS annotation files in the archive.

perl countit/summarize_splicing_wrapper.pl -c ./cache -v -big -weight -conf mm10.conf -dbkey mm10 -cass -taca -alt5 -alt3 -mutx -iret gapless_out/pair.gapless.bed countit_out

Note that the -c option specifies temporary output dir, which should have sufficient space. The -weight option is specified because each fragment from gapless now has a probability score, and this option should NOT be used for single end RNA-Seq data.

In the output directory, there is a file for each AS event (e.g., cass.count.txt). In these files, each line is an alternative splicing event in comparison of two isoforms. The first 10 columns in output file for each type of AS are the same:

chrom chromStart chromEnd name score strand type isoformIDs isoform1Tags isoform2Tags

The score column above is reserved, and the first 6 columns provide coordinate information so that one can quickly go to the region in UCSC genome browser (see visualization below).

The last three columns are the two isoforms compared, the read count for isoform1 and the read count for isoform2, which are the most critical for quantification of splicing level.

Each output file has extra columns, which are different depending on the type of AS events, and also important for filtering in downstream analysis, as summarized below:

cass: exonTags inclusionJuncction1Tags inclusionJunction2Tags skippingJunctionTags
taca: exonTags inclusionJuncctionTags skippingJunctionTags
alt5: altSSDistance exonTags proximalJunctionTags distalJunctionTags
alt3: altSSDistance exonTags proximalJunctionTags distalJunctionTags
mutx: 5'ExonTags 5'ExonJunction1Tags 5'ExonJunction2Tags 3'ExonTags 3'ExonJunction1Tags 3'ExonJunction2Tags
iret: retainedIntronTags junctionTags

In the command line above, you can also choose to run only certain types of AS events (e.g. -cass for cassette exons).

Summarize alternative splicing results and perform statistical tests

After one runs OLego, gapless and countit for each sample, the final step of AS analysis is to test differential splicing between two groups of samples. For this purpose one needs to prepare a configuration file to specify which sample belongs to each group.

#dataset.group.conf
sample1_countit_out<tab>group1
sample2_countit_out<tab>group1
sample3_countit_out<tab>group1
sample4_countit_out<tab>group2
sample5_countit_out<tab>group2
sample6_countit_out<tab>group2

The first column of the file above is the path to the output directory of countit. One can then run the following line for cassette exons (and similarly for other types of AS events):

perl countit/test_splicing_diff.pl -type cass -v --min-cov 20 --id2gene2symbol annotation/Mm.seq.all.AS.chrom.can.id2gene2symbol dataset.group.conf dataset.diff.txt

This script will pool the number of reads for each isoform from each of the individual samples in the same group. It will then perform Fisher's exact test using the total number of reads supporting each isoform, calculate exon inclusion level and the change between the two groups, etc.

Finally, it will filter AS events and consider those events with sufficient read coverage for multiple test correction (using the Benjamini method).

The option (--min-cov 20) means the following requirement:

junction_in + junction_skip≥20 AND junction_group1+junction_group2≥20

The output of the file has the following columns:

column 1: gene, Entrez gene id and gene symbol

columns 2-9: chrom   chromStart  chromEnd    name    score   strand  type    isoformIDs: the same as described in the output of countit  
column 10: coverage    I_g1(group1) I_g2(group2) dI_g1_vs_g2 pvalue  FDR

In general, we require FDR<0.05 (or 0.01), |dI|>0.1 (or 0.2) to call significant splicing changes.


A caveat of the Fisher exact test described above is that it does not take into account the within-group variation. Particular caution is warranted for samples with a substantial amount of heterogeneity (e.g. see visualization below). We are working on this to address the issue.

Quantify gene expression

This is also done with the countit package.

perl countit/summarize_expression_wrapper.pl -big --cache ../cache -exon mm10.exon.uniq.core.bed -e2g mm10.exon.uniq.core.id2gene2symbol -weight -v sample_out/pair.gapless.bed sample.expr.txt

Again, the -weight option is specified because each fragment from gapless now has a probability score. This option should not be used for single end RNA-seq data.

The output is like this

#gene_id    gene_symbol tag_num exon_len    RPKM

Here exon_len is the total exonic length of the gene, and tag_num is the total number of exonic reads (or more precisely fragments) in the gene. RPKM is always calculated using total number of exonic tags summed over all genes in the output file as a denominator, to avoid the complication of rRNA reads, etc.

This should be done for all samples.

Differential gene expression analysis can be performed with different tools, such as edgeR (http://www.bioconductor.org/packages/2.12/bioc/html/edgeR.html). The package includes a script for this purpose. One needs to prepare a configuration file to specify which sample belongs to each group.

#expr.diff.conf
sample1.expr.txt<tab>group1
sample2.expr.txt<tab>group1
sample3.expr.txt<tab>group1
sample4.expr.txt<tab>group2
sample5.expr.txt<tab>group2
sample6.expr.txt<tab>group2
#this script will invoke R and requires edgeR to be installed
perl countit/test_expr_diff.pl -MAplot -v expr.diff.conf group1_vs_group2.expr.diff.txt

We typically call significant differential expression by requiring FDR<0.01 (or 0.001), |log2FC|>1 (or 0.6 or 2), and max(group1RPKM, group2RPKM) in the top half.

Generate bedGraph files for visualization in UCSC genome browser

In most cases, bedGraph format provides excellent visualization, especially for alternative splicing events. The BED file after gapless analysis can be converted into a bedGraph file with a script in the countit package.

perl countit/tag2profile.pl -v -big -weight -c ./cache -exact -of bedgraph -n "Sample_bed_graph" sample_out/pair.gapless.bed sample.bedGraph

In some cases, one might also want to show the summarized number of junction reads for each exon junction. This can be done by

perl countit/tag2junction.pl -v -big -weight -c ./cache sample.combined.junction.bed sample.junction.count.bed

The 5th column now contain the read count for each junction, which is also shown as part of the name column. This file can also be loaded to the UCSC genome browser for visualization.

For single end RNA-Seq, the -weight option should not be used.