Quantas Documentation

From Zhang Laboratory

Revision as of 11:48, 19 September 2012 by Czhang (Talk | contribs) (Created page with "==Important assumptions== # This document is for paired-end (PE) mRNA-seq libraries that has NO strand information, and the two mates are on the opposite strand (the standard ...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Important assumptions

  1. This document is for paired-end (PE) mRNA-seq libraries that has NO strand information, and the two mates are on the opposite strand (the standard Illumina protocol).
  2. It is IMPORTANT to note that read1 and read2 in the same pair must have the same read name.
  3. The focus of this document is to quantify splicing and gene expression change of annotated transcripts, although the pipeline can be expanded to examine novel transcripts as well.

Single-end (SE) mRNA-seq libraries or strand-specific SE or PE RNA-seq libraries can also be analyzed by the same packages, but the parameters are slightly different.

Mapping RNA-seq reads to the reference genome and exon junctions

OLego installation and preparation

The program we use for read mapping is OLego, a program we developed recently and available at http://ngs-olego.sourceforge.net/. You can download the source codes or precompiled binaries and the documentation. The core algorithm is relatively stable now, but minor revisions are expected in the near future. If you prefer to download the source code, it should be fairly easy to compile.

The program takes i) read file(s) in either FASTA or FASTQ format, together with ii) the indexed genome and iii) (optionally) a database of known exon junctions.

For the reference genome, I would recommend that one should remove random chromosomes as well as the mitochondria genome. The program required to build the index is included in the package. Assuming one has concatenated all chromosomes in a FASTA file mm9.fa. The following command will index the genome.

olegoindex -a bwtsw mm9.fa

This will produce several files with a prefix mm9.fa. Move all the output files to a directory such as genome_index/olego/mm9

A comprehensive database of exon junctions (mm9, I assume this is what you need) can be downloaded from http://geller.rockefeller.edu/Quantas/data/mm9.intron.hmr.bed.gz (5.1 Mb). Download and decompress this file.

Mapping

Now the alignment. Assuming one has a paired end sequencing lane of 100 nt x2 reads in sample.r1.fa and sample.r2.fa for the first and second read mates, respectively.

olego -v -t 16 -r olego_dir/models/mm9.cfg -j mm9.intron.hmr.bed -o sample.r1.sam genome_index/olego/mm9.fa sample.r1.fa
# do the mapping with 14nt seed, allowing a total of 4 mismatches or indels (default), with 16 cpu cores, output to sample.r1.sam

olego -v -t 16 -r olego_dir/models/mm9.cfg -j mm9.intron.hmr.bed -o sample.r2.sam genome_index/olego/mm9.fa sample.r2.fa
# do the same thing for the other file

No matter how many threads are used, the memory footprint is in general < 4Gb. In terms of speed, our estimate is that it can align 200 M 100x2 PE reads in ~29 hrs with 8 cores (2.0G Hz).

This will map reads to the exons, known junctions as well as novel junctions. By default OLego searches with perfect matches in seeds of 14 nt, allowing for a maximum of 4 substations or indels along the whole read (which should be good for 100 nt illumina reads). For known junctions, OLego requires 5 or more nt overlap on either side of the junction. For novel junctions, OLego requires 8 nt overlap, and considers only GT/AG splice sites, to improve the accuracy.

Convert SAM to BED file

After the alignment step, the SAM files are converted into BED files using the script sam2bed.pl included in the package.

perl sam2bed.pl --uniq -v sample.r1.sam sample.r1.uniq.bed
perl sam2bed.pl --uniq -v sample.r2.sam sample.r2.uniq.bed

This will keep only reads that are unambiguously mapped to the reference genome (single hits), which are identified by the tag “XT:A:U” in the SAM file.


Inferring transcript structure between paired end reads

Paired-ends reads could be located in different exons and since the reads are relatively short, the transcript structure between the sequenced ends could be ambiguous when alternative splicing occurs. The missing information, leveraging on the size constraints of each cDNA fragment and prior isoform abundance estimated from directly mapped junction reads, is inferred using a simple Bayes model.

Install gapless

This is implemented in the gapless package, available at http://geller.rockefeller.edu/Quantas/gapless_20120612.tgz. To setup this package properly, one needs to add the library files to the perl library search path, such as adding the following line in their login file .bash_profile.

PERL5LIB=/home/zhangc/gapless/lib


One might need to install Math:CDF package, if not already. It is available at http://search.cpan.org/~callahan/Math-CDF-0.1/CDF.pm

One also needs to download a database of splicing isoforms (more accurately, all exon trios, since we care about local AS events here) from http://geller.rockefeller.edu/Quantas/data/mm9.exon.trio.nr.bed.gz (16Mb).

Run gapless

Then run

perl gapless_huge_file.pl --split-size 20000000 -isoform mm9.exon.trio.nr.bed -E 400 -big --print-singleton -o sample_out -v sample.r1.uniq.bed sample.r2.uniq.bed

Here, again it is IMPORTANT to note that read1 and read2 in the same pair must have the same read name.

-E 400 means large exons with 400 nt or more are used to estimate the distribution of insert cDNA fragment size in the library (adaptor sequences are not included in calculation here).

This step will try to combine read1 and read2 in the same pair if possible (i.e., they are possibly sampled from the same isoform in our database), and output two files in the sample_out dir, a bed file pair.gapless.bed with combined reads or fragments, and a text file size_dist.txt with the estimated distribution of insert cDNA fragment size in the library. When alternative splicing is observed between a pair, each possible isoform, or fragment, is assigned a probability score, so that one fragment can potentially have multiple lines. This score was recorded in the 5th column of the BED file.

Quantify alternative splicing

Install countit

Quantification of alternative splicing and gene expression is done using the countit package, available at http://geller.rockefeller.edu/Quantas/counit_20120613.tgz. To setup this package properly, one needs to add the library files to the perl library search path, such as the following in their login file .bash_profile, if one has not done this as described above. These library files are the same as those described above in gapless, so there is no need to duplicate.

PERL5LIB=/home/zhangc/countit/lib


One also needs to download a database of different types of alternative splicing events from http://geller.rockefeller.edu/Quantas/data/mm9.ase.tar.gz (1.6Mb).

There are five files inside the tarball:

  1. Mm.seq.all.cass.chrom.can.bed: cass or cassette exons
  2. Mm.seq.all.taca.chrom.can.bed: taca or tandem cassette exons
  3. Mm.seq.all.alt5.chrom.can.bed: alt5 or alternative 5' ss
  4. Mm.seq.all.alt3.chrom.can.bed: alt3 or alternative 3' ss
  5. Mm.seq.all.mutx.chrom.can.bed: mutx or mutually exclusive exons
  6. Mm.seq.all.iret.chrom.can.bed: iret or retained introns

Run countit

For each sample, one now needs to separate genomic reads and junction reads:

awk '{if(NF<=9 || $10==1) {print $0}}' sample_out/pair.gapless.bed > sample.combined.genome.bed
awk '{if(NF==12 && $10>1) {print $0}}' sample_out/pair.gapless.bed > sample.combined.junction.bed

For each type of AS event, run summarize_splicing.pl as follows:

perl summarize_splicing.pl -type cass -big -v -weight -c ./cache Mm.seq.all.cass.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.cass.count.txt
perl summarize_splicing.pl -type taca -big -v -weight -c ./cache Mm.seq.all.taca.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.taca.count.txt
perl summarize_splicing.pl -type alt5 -big -v -weight -c ./cache Mm.seq.all.alt5.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.alt5.count.txt
perl summarize_splicing.pl -type alt3 -big -v -weight -c ./cache Mm.seq.all.alt3.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.alt3.count.txt
perl summarize_splicing.pl -type mutx -big -v -weight -c ./cache Mm.seq.all.mutx.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.mutx.count.txt
perl summarize_splicing.pl -type iret -big -v -weight -c ./cache Mm.seq.all.iret.chrom.can.bed sample.combined.genome.bed sample.combined.junction.bed sample.iret.count.txt

Note that the -cache option specifies temporary output dir, and the -weight option is specified because each fragment from gapless now has a probability score. In the output file, each line is an alternative splicing event in comparison of two isoforms. The first 10 columns in output file for each type of AS are the same:

chrom chromStart chromEnd name score strand type isoformIDs isoform1Tags isoform2Tags

The score column above is meaningless, and the first 6 columns provide coordinate information so that one can quickly go to the region in UCSC genome browser (see visualization below).

The last three columns are the two isoforms compared, the read count for isoform1 and the read count for isoform2, which are the most critical for quantification of splicing level.

Each output file has extra columns, which are different depending on the type of AS events, and also important for filtering in downstream analysis, as summarized below:

cass: exonTags inclusionJuncction1Tags inclusionJunction2Tags skippingJunctionTags
taca: exonTags inclusionJuncctionTags skippingJunctionTags
alt5: altSSDistance exonTags proximalJunctionTags distalJunctionTags
alt3: altSSDistance exonTags proximalJunctionTags distalJunctionTags
mutx: 5'ExonTags 5'ExonJunction1Tags 5'ExonJunction2Tags 3'ExonTags 3'ExonJunction1Tags 3'ExonJunction2Tags
iret: retainedIntronTags junctionTags

In case one would like to run all types of AS events in one command line, another wrapper is provided.

First, one needs to prepare a configuration file cz_annotation.loc that specifies the location of AS database files

#cz_annotation.loc, modify the third column accordingly
mm9 as annotation/Mm.seq.all.cass.chrom.can.bed cass
mm9 as annotation/Mm.seq.all.taca.chrom.can.bed taca
mm9 as annotation/Mm.seq.all.alt5.chrom.can.bed alt5
mm9 as annotation/Mm.seq.all.alt3.chrom.can.bed alt3
mm9 as annotation/Mm.seq.all.mutx.chrom.can.bed mutx
mm9 as annotation/Mm.seq.all.iret.chrom.can.bed iret

Then run

perl summarize_splicing_wrapper.pl -big -weight -conf cz_annotation.loc -dbkey mm9 -c ./cache -cass -taca -alt5 -alt3 -mutx -iret -v sample_out/pair.gapless.bed sample.AS.count.txt

The output file is a concatenation of multiple output files described above, except the header lines are removed.

In the command line above, you can also choose not to run all types of AS events.

Summarize alternative splicing results and perform statistical tests

After one runs OLego, gapless and countit for each sample, the output file sample.AS.count.txt is combined together, right now in excel, with statistical analysis performed in R. For statistical analysis, the read count in each sample of the same group is pooled together.

Eventually, one get an Excel spreadsheet with the following columns (using cassette exons as an example, a template can be downloaded from http://geller.rockefeller.edu/Quantas/data/example.cass.summary.xlsx):

columns A-G: LineNo	chrom	chromStart	chromEnd	name	score	strand	
columns H-I: gene_id	gene_symbol	type	isoformIDs #Gene ID can be obtained from the first number after "-" in the name column	
columns L-O: group1Isoform1Tags	group1Isoform2Tags	group2Isoform1Tags	group2soform2Tags	
columns P-U: group1InclusionJuncction1Tags	group1InclusionJunction2Tags	group1SkippingJunctionTags	group2InclusionJuncction1Tags	group2InclusionJunction2Tags	group2SkippingJunctionTags	
columns V-AC: coverage	odd_ratio	P_fisher	Rank	Benjamini	group1Inc	group2Inc	dI

Here we assume that we are comparing two groups of samples. Column L-U are sums of all samples in the same group.

gene_id can be extracted from the name column, the first number after "-". gene_symbol can gene be looked up using the database available at http://geller.rockefeller.edu/Quantas/data/gene2symbol.mm

coverage in column V is 1 if junction_in + junction_skip≥20 AND junction_group1+junction_group2≥20, or 0 otherwise, as specified in the formula in the template. Only lines with coverage of 1 are used for statistical test, to minimize multiple testing.

columns L-O are now pasted to a text file named data.txt to be loaded for Fisher's exact test in R, a script available from http://geller.rockefeller.edu/Quantas/fisher.test.batch.R. This will generate odds_ratio and P_fisher to be pasted to columns W and X.

Then rows are sorted first by coverage (descending), and then by P_fisher (ascending), and Benjamini FDR (column Z) is calculated by (P_fisher) *(#Lines with coverage=1) / (Rank of the line in column Y).

Change of exon inclusion or dI in column AC is calculated using junction reads, as specified in the formula in the template.

In general, we require FDR<0.05 (or 0.01), coverage = 1, |dI|>0.1 (or 0.2) to call significant splicing change.

For other types of AS events, some extra filtering criteria might be needed, and we can discuss when we get to the point.


Two notes:

  1. A caveat of the Fisher exact test described above is that it does not take into account the within-group variation. This is in general a not big issue for lab mice if RNA-seq libraries are prepared properly because the heterogeneity between lab mice is minimal. So far I have not seen tools that deal with this issue effectively, but this could be due to the fact that I did not track RNA-seq literature very closely.
  2. Also, a recent paper described a method MATS based on MCMC to estimate p-values and claims that the performance is somewhat better than Fisher's exact test in simulation. I have not tested this, but if this is true, the test can be swathed very easily.

Quantify gene expression

This is also done with the countit package, but first one needs to download gene definition files from http://geller.rockefeller.edu/Quantas/data/mm9.gene.tar.gz. There are two files in the tarball that will be used in the next step.

perl summarize_expression_wrapper.pl -big --cache ../cache -exon mm9.exon.uniq.core.bed -e2g mm9.exon.uniq.core.id2gene2symbol -weight -v sample_out/pair.gapless.bed sample.expr.txt

Again, the -weight option is specified because each fragment from gapless now has a probability score.

The output is like this

#gene_id    gene_symbol tag_num exon_len    RPKM

Here exon_len is the total exonic length of the gene, and tag_num is the total number of exonic reads (or more precisely fragments) in the gene. RPKM is always calculated using total number of exonic tags summed over all genes in the output file as a denominator, to avoid the complication of rRNA reads, etc.


This was done for all samples. Then an R package edgeR was used to do statistical test of differential gene expression. An example script can be downloaded from: http://geller.rockefeller.edu/Quantas/expr.edgeR.tagwise.R, which will report log2FC, p-value, and benjamini FDR

RPKM for each group needs to be recalculated after combining all samples in the same group.

We typically call significant differential expression by requiring FDR<0.01 (or 0.001), |log2FC|>1 (or 0.6 or 2), and max(group1RPKM, group2RPKM) in the top half.

Generate bedGraph files for visualization in UCSC genome browser

In most cases, bedGraph format provides excellent visualization, especially for alternative splicing events. The BED file after gapless analysis can be converted into a bedGraph file with a script in the countit package.

perl tag2profile.pl -big -weight -exact -of bedgraph -n "Sample_bed_graph" -v sample_out/pair.gapless.bed sample.bedGraph

In some cases, one might also want to show the summarized number of junction reads for each exon junction. This can be done by

perl tag2junction.pl -v -big -weight -c ./cache sample.combined.junction.bed sample.junction.count.bed

The 5th column now contain the read count for each junction, which is also shown as part of the name column. This file can also be loaded to the UCSC genome browser for visualization.