OLego Documentation
From Zhang Laboratory
Contents
What is OLego?
OLego is a program specifically designed for de novo spliced mapping of mRNA-seq reads. OLego adopts a seed-and-extend scheme, and does not rely on a separate external mapper. It achieves high sensitivity of junction detection by strategic searches with very small seeds (12-14 nt), efficiently mapped using Burrows-Wheeler transform (BWT) and FM-index. This also makes it particularly sensitive for discovering small exons. OLego is implemented in C++ with full support of multiple threading, to allow for fast processing of large-scale data.
OLego is an open source code project and released under GPLv3. The implementation of OLego relies heavily on BWA (version 0.5.9rc1, http://bio-bwa.sourceforge.net/).
Citation:
Wu,J., Anczukow,O., Krainer,A.R., Zhang,M.Q. †, Zhang,C. †, 2013. OLego: Fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. , Published online. ( PubMed Link )
Contact: Jie Wu (wuj (at) cshl dot e d u) and Chaolin Zhang (cz2294 (at) columbia dot e d u)
Versions
- v1.1.7 ( 3-21-2017 ), current
- accept gz input in mergePEsam.pl
- v1.1.6 ( 9-14-2015 )
- Add zebrafish model and exon junction files
- v1.1.5 ( 7-14-2014 )
- improvement in speed
- minor bug fixes
- v1.1.2 ( 7-1-2013 )
- Sensitivity improved in small exons and single anchor search.by allowing mismatch.
- Allows overlapping seeds to improve speed and seeding flexibility.
- Increase default seed size to 15 (max 1 nt overlapping ) to have significantly increased speed without noticeable sacrifice in sensitivity.
- A bug fixed (crashes when using -M 0 -w12)
- v1.1.1 ( 4-14-2013 )
- Improved speed by filtering simple repetitive anchors.
- Default options optimized.
- v1.1.0 ( 3-31-2013 )
- Bug fixed for duplicate entries for some reads, sensitivity improved.
- Optimized on option -W.
- Bug fixed in sam2bed.pl.
- Bug fixed for option -e.
- Bug fixed for regression_model_gen.
- Add support for gzip input file for sam2bed.pl.
- v1.0.8 ( 11-20-2012 )
- Improvement in hit clustering.
- Fixed an overcounting problem in mismatch counting.
- Fixed bug in merging step.
- Fixed bug in XS tag for extra exon body reads.
- Allows pipe input/output with "-" for some of the scripts.
- v1.0.6 ( 08-09-2012 )
- Added option –max-multi (default:1000) to avoid huge data in a single line.
- Added option –num-reads-batch.
- Fixed a bug in the junction connecting step.
- v1.0.5 ( 07-16-2012 )
- Minor bug fixed (the old code crashes in a very rare case).
- v1.0.4 ( 06-12-2012 )
- Option changes ( do single-anchor search by default now ).
- v1.0.3 ( 06-10-2012 )
- Now supports strand specific library
- Fixed bugs about XS
- v1.0.0 ( 05-15-2012 )
- The initial Public release
Prerequisites
The major programs of OLego ( olego and olegoindex ) can be installed and run on Unix-based systems (Linux or Mac OS X) with GCC compiler installed. We provided scripts for post analysis and regression model construction, and these codes may require Perl and R.
Download
Codes and binaries
The source codes and binaries are available at https://github.com/chaolinzhanglab/olego. This program is still under active development, so please check the site periodically for updates. The most update to date version can also be retrieved via git:
git clone https://github.com/chaolinzhanglab/olego.git
The main programs of OLego (olego and olegoindex ) can be installed and run on Unix-based system with GCC compiler installed. We also provide scripts for post analysis and regression model construction. These codes may require Perl and R installed.
The exon junction database
We have built a non-redundant, comprehensive exon junction database (only GT/AG splice sites for now) from RefSeq, mRNAs, and ESTs. We recommend one to provide the junction database for alignment (with -j) to obtain improved sensitivity.
Download exon junction database:
- human (hg19) (356,777 unique junctions)
- mouse (mm10) (309,670 unique junctions)
- rat (rn5) (285,745 unique junctions)
- zebfrafish (danRer10) (133,630 unique junctions).
In each of the files above, exon junctions from human, mouse and rat were lifted over to the other species and consolidated to remove redundancy.
Installation
To compile OLego on your computer, please go to the OLego directory and type:
make
If everything goes right, you will find two executable files olegoindex and olego in the folder.
We also provide binary executable files at http://sourceforge.net/projects/ngs-olego/files/ for x86_64 and i686 Linux systems.
Please feel free to report any problems you come up with.
Usage
Build the index for the genome sequence
To run OLego, you need a BWT index for the reference sequences. For the current version, the genome index used by OLego is in exactly the same format as the one used by BWA. However, this will likely change in the future. For your convience, you can build the index with olegoindex that comes with this package:
olegoindex [-a bwtsw|div|is] [-p STR] <in.fasta>
Arguments:
Argument | Description |
---|---|
<in.fasta> | This is the fasta format file with the reference sequence. Please put all the sequences (from different chromosomes ) in a single file. |
Options:
Option | Description |
---|---|
-a | BWT construction algorithm: bwtsw or is [default: bwtsw] |
-p | prefix of the index [default: the same as the fasta file name] |
Caution: please use “-a bwtsw” for long genome (like human or mouse genome).
There will be 8 files (prefix.pac, prefix.ann, prefix.amb, prefix.rpac, prefix.bwt, prefix.rbwt, prefix.sa, prefix.rsa) generated after olegoindex finishes.
Running OLego
Now you can align your mRNA-seq reads to the genome with olego:
olego [options] <prefix> <in.fastx>
Whenever possible, one should provide a database of annotated junction database for alignment (-j), because this will give higher sensitivity of mapping for junction reads. Other important parameters include the word size of seeds (-w) and the max number of mismatches (-M including both substitutions and indels) allowed. The defaults of these parameters were optimized for mammalian genomes to balance accuracy, sensitivity and speed.
For mammalian genomes, a seed size in the range of 12-15 nt is recommended (default =15 nt with 1-nt overlap). For small read length (e.g. < 50), the seed size should be picked to allow 3 or more seeds whenever possible (e.g., -w 12 for 36 nt reads). The number of seeds in each read has a significant impact on sensitivity of alignment.
The default of mismatches allowed varies dependent on the length of sequences:
17nt reads: max_diff = 1 20nt reads: max_diff = 2 45nt reads: max_diff = 3 73nt reads: max_diff = 4 104nt reads: max_diff = 5 137nt reads: max_diff = 6 172nt reads: max_diff = 7 208nt reads: max_diff = 8 244nt reads: max_diff = 9
The speed of alignment increases linearly as one use more threads (-t). Slight increase of word size (e.g., -w 15 vs -w 14) can dramatically increase the speed, without much loss in sensitivity.
The arguments and options are described in more detail as below:
Arguments:
Argument | Description |
---|---|
<prefix> | The prefix of the genome sequence index, including the path and the base name. |
<in.fastx> | Either fasta or fastq file would work as input. Note that gzipped file is also accepted. Addtionally, using "-" will make the program to read input from STDIN. |
Basic options:
Option | Description |
---|---|
-o,–-output-file | Name of the output file [ default: stdout ]. This file will be in SAM format, with some customized tags. Please see the details of the file format below. |
-j,–-junction-file | Annotation file for known exon junctions. It is in BED format and please see the junc format description below. |
-n,–-non-denovo | No de novo junction search. Note that if junction annotation file is provided by -j, these “known” junctions will still be searched. |
-t,–-num-threads | Number of threads (INT) [ default: 1 ]. OLego fully supports multiple threading, if you have multiple CPU cores on your computer, please specify the number of cores you want to use with this option. |
-r,–-regression-model | The file with the parameters for the logistic regression model. The mouse model will be used if no file is selected. The model file contains the parameters for the regression model (the coefficients, the PWM and the background ). We have provided model files for mouse and human (in the folder models). User-defined model can also be generated with the regression_model_gen scripts for any species. Please see its usage below. |
-M,–-max-total-diff | Maximum total difference between query read and reference sequence. Either INT or FLOAT number can be used for this option. An INT number will specify the maximum total edit distance allowed for each alignment. A FLOAT number will specify the fraction of missing alignments given 2% uniform base error rate. This parameter is the same as -n in BWA. [default: a FLOAT number 0.06 ] |
-w,–-word-size | The size of the seed used in junction search (INT) [ default: 15 nt with --word-max-overlap 1 nt ]. The default seed size is recommended for reads >100 nt. For shorter reads, a smaller number can be used. e.g., 12 nt with 0 nt seed overlap or 13 nt with 1 nt seed overlap for 36 nt reads. The seeds will be evenly distributed on the read from the start to the end with a maximum overlap defined by --word-max-overlap, so please try to cover the read as much as possible with a reasonable seed size and maximum seed overlap size. (14 nt with 1 nt seed overlap for 36 nt reads is a BAD example. ) |
-W,–-max-word-occ | Maximum number of matches of a seed (INT) [ default: max (1000*3^(14-word_size), 300)]. If a seed has more than this number of hits on the genome, then it will be considerred repeptive and all of its hits will be discarded. |
-m,–-max-word-diff | Maximum edit distance allowed for each seed (INT) [ default: 0 ]. Since our seed size is smaller than other programs, we recommend that the user use a small number for this option. |
-I,–-max-intron | Maximum intron size for de novo junction search (INT) [ default: 500000 ]. |
-i,–-min-intron | Minimum intron size for de novo junction search (INT) [ default: 20 ]. |
-e,–-min-exon | Minimum micro-exon size to be searched (INT) [ default: 9 ]. |
-a,–-min-anchor | Minimum anchor size in de novo single-anchor junction searches (INT) [ default: 8 ]. We define “anchor size” as the smaller number of matched nucleotides on the read at the end of the junction. |
-k,–-known-min-anchor | Minimum anchor size in single-anchor junction searches when the junction is in the annotation file specified by -j (INT) [ default: 5 ]. |
-v,–-verbose | Verbose mode [ default: false ]. |
Advanced options:
Option | Description |
---|---|
--word-max-overlap | Max number of overlaps between seeds in the seeding step.[ default: 1 ] |
–-non-single-anchor | Disable single-anchor de-novo junction search. [ default: enabled ]. |
--allow-rep-anchor | Allow anchors with simple repetitive sequences. [ default: not allow ] |
–-strand-mode | Strand mode (INT). This value should be selected from 1, 2 or 3. For strand specific RNA-seq data, please use 1 if the reads should be mapped to the FORWARD strand of the RNA, use 2 if the reads should be mapped to the REVERSE strand. If the library is not strand specific, please use 3 to allow mapping onto both strands. [ default: 3 ] |
–-max-multi | Maximum number of alignments reported for multiple mappers. [ default: 20 ] |
–-min-logistic-prob | Minimum logistic probablity for an alignment, calculated with the splice sites motif and intron size, in the range of [0,1) [ default: 0.50]. A higher number means more stringent filter, we don’t recommend using high value since more true de novo junctions will be filtered out. |
–-max-overhang | Maximum number of overhanging nucleotides allowed near the candidate exon boundary in junction searches (INT) [ default: 6 ]. After we extend the candidate exons, we search for splice sites in the overhanging regions around the candidate exon boundary. |
–-max-gapo | Maximum number of gap opens (INT) [ default: 1 ]. |
–-max-gape | Maximum number of gap extensions, -1 for disabling long gaps (INT)[ default: -1 ]. |
–-indel-end-skip | In BWT querying, do not put an indel within this number towards the ends [ default: 5 ]. |
–-gape-max-occ | Maximum occurrences for extending a long deletion in BWT querying [ default: 10 ]. |
–-penalty-mismatch | Mismatch penalty for querying involving BWT [ default: 3 ]. |
–-penalty-gapo | Gap open penalty for querying involving BWT [ default: 11 ] |
–-penalty-gape | Gap extension penalty for querying involving BWT [ default: 4 ] |
–-log-gap | log-scaled gap penalty for long deletions for querying involving BWT. |
–-num-reads-batch | This number of reads will be loaded into the memory for processing in each batch. [4*16**4 = 262144] |
–-none-stop | non-iterative mode: search for all n-difference hits in the BWT query (slooow). |
Other useful scripts
mergePEsam.pl
This script can be used to merge SAM format mapping results from paired-end reads. The two ends will be merged according to their distances and orientation. The script requires the two ends come from the same chromosome with proper orientation and the distance between them smaller than the threshold specified by option -d.
Usage:
perl mergePEsam.pl [options] <end1.sam> <end2.sam> <out.sam>
Arguments:
Argument | Description |
---|---|
<end1.sam> | The SAM format output from one end of the reads. |
<end2.sam> | The SAM format output from the other end of the reads. Please make sure the same lines in end1.sam and end2.sam are corresponded (i.e. from the same read pair ). |
<out.sam> | The output file. In SAM format. |
Options:
Option | Description |
---|---|
-d | Maximum distance between the two ends on the reference [ default:5000000 ]. |
–ss, –-same-strand | Require the read-pair mapped to the same strand. By default, we require the two ends mapped to different strands, which is the case in strandard Illumina RNA-seq data. |
–ns, –-no-strand | Do not use strand information as a filter. |
–nci, –-no-check-input | Do not check if the read names in the input files are matched. By default, the script will check if the read names from the two ends are similar to make sure the lines are correctly matched. Please use this option if your read names are in a uncomparable format. |
-v | Verbose mode [ default: false ]. |
xa2multi.pl
This script can be used to extract all the alignments after the tag “XA” in each line. The current version is from BWA package, with minor modification.
Usage:
perl xa2multi.pl in.sam >out.sam
sam2bed.pl
This script converts SAM format output from OLego into BED format file. Only the best alignment (major alignment) of each read will be used.
Usage:
perl sam2bed.pl [options] <in.sam> <out1.bed> [out2.bed]
Arguments:
Argument | Description |
---|---|
<in.sam> | The SAM input file from OLego. "-" can be used to input from STDIN. |
<out1.bed> [out2.bed] | Please specify two BED files if you want Paired-end reads output into separate BED files, otherwise, all the reads will be output into out1.bed. "-" can be used to output into STDOUT. |
Options:
Option | Description |
---|---|
-u,–-uniq | Only output uniquely mapped reads. The script identifies unique reads by the tag “XT:A:U”. |
-r,–-use-RNA-strand | Use the strand of the RNA based on the XS tag. By default, this script uses the strand of the read. |
-v | Verbose mode [ default: false ]. |
Using this script to convert SAM outputs from other programs might cause problems!
bed2junc.pl
This script can be used to retrieve unique junctions from BED format file. The number of supporting reads of each junction will be in the score (5th) column. The output file can be used as junction annotation file for OLego (option -j).
Usage:
perl bed2junc.pl <in.bed> <out.bed>
Arguments:
Argument | Description |
---|---|
<in.bed> | The input BED file with the mapping results. "-" can be used to input from STDIN. |
<out> | The output BED format file with the junctions. This file can be directly used as junction annotation file for olego. "-" can be used to output into STDOUT. |
regression_model_gen
This set of scripts can be used to generate the user-defined logistic regression model.
Usage:
perl regression_model_gen/OLego_regression.pl [options]
Options:
Option | Description |
---|---|
-g | The location of the Fasta files downloaded from UCSC genome browser, the names of the Fasta files should be something like chr1.fa etc. |
-a | BED format annotation files for the true transcripts. True junctions will be extracted from this file. |
-o | Output prefix [default: userdefined]. |
The model file will be generated in output_prefix.cache, the file name would be output_prefix.cfg.
Additional notes
File formats
SAM format
OLego outputs the alignments in SAM format (http://samtools.sourceforge.net). Its specification can be found on samtools’ website.
The following tags are used in OLego. Please pay attention to the X? tags, most of them were adopted from BWA:
Tag | Meaning |
---|---|
NM | Edit distance |
MD | Mismatching positions/bases |
X0 | Number of best hits |
X1 | Number of suboptimal hits |
XM | Number of mismatches in the alignment |
XN | Number of ‘N’s in the reference |
XO | Number of best hits |
XG | Number of gap extentions |
XT | Type: Unique/Repeat/N * |
XS | Strand of the RNA ** |
XA | Alternative hits; format: (chr,pos,CIGAR,NM,XS;) |
*“Unique” or “Repeat” is determined by the number of best hits ( top hits with the same edit distance, X0 ), NOT the total number of hits. “N” means there are more than 10 ‘N’s in the reference ( XN>10 ). ** For a junction read, the strand of the RNA is determined by the annotation if the junction is annotated, or by the splice signal if it’s a novel junction. For exonic read, the strand can not be determined (a ”.” is assigned ).
Addtional scripts have been provided in the package for processing OLego output: sam2bed.pl can be used for conversion from SAM to BED format; xa2multi.pl can extract alignments after XA tags; mergePEsam.pl can merge the two outputs from paired-end RNA-seq data.
For general processing of SAM files, please check SAMTools.
junc format
OLego takes junction annotations in junc (BED) format.
Column | Name | Description |
---|---|---|
1 | chrom | The name of the chromosome |
2 | chromStart | The starting position of the junction (intron) |
3 | chromEnd | The ending position of the junction (intron) |
4 | name | Name of the junction |
5 | score | This column is reserved as the score of the junction, the bed2junc.pl provided in the package will output evidence number in this column. |
6 | strand | Strand of the junction |
The score column is not essential.
Examples
Example 1: Standard Illumina RNA-seq data (without strand information)
Assume we have paired-end data of read length 100nt:
olegoindex -a bwtsw mm10.fa # build your BWT index, mm10.fa has all chromosomes combined olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o f.sam ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa # do the mapping allowing 4 mismatches (or indels) with 16 CPU cores, output to f.sam olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o r.sam ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa # do the same thing for the other file mergePEsam.pl f.sam r.sam merge.sam # merge both ends into merge.sam sam2bed.pl --use-RNA-strand merge.sam merge.bed # convert the SAM file to BED file bed2junc.pl merge.bed merge.junc # find the junctions in the BED file olego -v -t 16 -r mm.cfg -j merge.junc -M 4 --non-denovo -o f.remap.sam ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa olego -v -t 16 -r mm.cfg -j merge.junc -M 4 --non-denovo -o r.remap.sam ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa #This is optional: do a remapping to rescue more reads, no more de novo mapping here since we already used junction annotations.
Example 2: Strand-specific RNA-seq data
olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o f.sam --strand-mode 1 ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa # when you know that the reads should be mapped onto the forward strand of the transcripts olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o r.sam --strand-mode 2 ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa # the other end should be on the reverse strand according to the protocol
Some example commands using pipe to save space:
samtools view -uSh merge.sam |samtools sort - merge.sort # this converts the sam output into sorted bam files to save space, and the bam file should be ready for downstream analysis (e.g. Cufflinks). samtools view merge.sort.bam | sam2bed.pl -r - - | bed2junc.pl - merge.junc # this will extract all junctions from the bam output without generating sam and bed files.