OLego Documentation v1.0.6
From Zhang Laboratory
Contents
What is OLego?
OLego is a program specifically designed for de novo spliced mapping of mRNA-seq reads. OLego adopts a seed-and-extend scheme, and does not rely on a separate external mapper. It achieves high sensitivity of junction detection by strategic searches with very small seeds (12-14 nt), efficiently mapped using Burrows-Wheeler transform (BWT) and FM-index. This also makes it particularly sensitive for discovering small exons. It is implemented in C++ with full support of multiple threading, to allow for fast processing of large-scale data.
OLego is an open source code project and released under GPLv3. The implementation of OLego relies heavily on BWA (version 0.5.9rc1, http://bio-bwa.sourceforge.net/). It also uses some source code from the Jim Kent source code tree (http://genome.ucsc.edu/admin/git.html).
Versions
- v1.0.6 ( 08-09-2012 )
- Added option –max-multi (default:1000) to avoid huge data in a single line.
- Added option –num-reads-batch.
- Fixed a bug in the junction connecting step.
- v1.0.5 ( 07-16-2012 )
- Minor bug fixed (the old code crashes in a very rare case).
- v1.0.4 ( 06-12-2012 )
- Option changes ( do single-anchor search by default now ).
- v1.0.3 ( 06-10-2012 )
- Now supports strand specific library
- Fixed bugs about XS
- v1.0.0 ( 05-15-2012 )
- The initial Public release
Prerequisites
The major programs of OLego ( olego and olegoindex ) can be installed and ran on Unix-based system (Linux or MacOS) with GCC compiler installed. We provided scripts for post analysis and regression model construction, these codes may require Perl and R.
Download
The code and binary files can be found at http://sourceforge.net/projects/ngs-olego/files/ , we are regularly updating the code, so please check regularly to keep your code updated. The code can also be reached via git:
git clone git://git.code.sf.net/p/ngs-olego/code ngs-olego-code
The main programs of OLego (olego and olegoindex ) can be installed and run on Unix-based system with GCC compiler installed. We also provide scripts for post analysis and regression model construction. These codes may require Perl and R installed.
Installation
To compile OLego on your computer, please go to the OLego directory and type:
make
If everything goes right, you will find two executable files olegoindex and olego in the folder.
We also provide binary executable files at http://sourceforge.net/projects/ngs-olego/files/ for x86_64 and i686 Linux systems.
Please feel free to report any problems you come up with.
Usage
Build the index for the genome sequence
To run OLego, you need a BWT index for the reference sequences. We use exactly the same genome index used by BWA in the current version, although this will likely change in the future. For your convience, you can build the index with olegoindex that comes with this package:
olegoindex [-a bwtsw|div|is] [-p STR] <in.fasta>
Arguments:
Argument | Description |
---|---|
<in.fasta> | This is the fasta format file with the reference sequence. Please put all the sequences (from different chromosomes ) in a single file. |
Options:
Option | Description |
---|---|
-a | BWT construction algorithm: bwtsw or is [default: bwtsw] |
-p | prefix of the index [default: the same as the fasta file name] |
Caution: please use “-a bwtsw” for long genome (like human or mouse genome).
There will be 8 files (prefix.pac, prefix.ann, prefix.amb, prefix.rpac, prefix.bwt, prefix.rbwt, prefix.sa, prefix.rsa) generated after olegoindex finishes.
Running OLego
Now you can map your mRNA-seq reads to the genome with olego:
olego [options] <prefix> <in.fastx>
The arguments and options are decribed as below:
Arguments:
Argument | Description |
---|---|
<prefix> | The prefix of the genome sequence index, including the path and the base name. |
<in.fastx> | Either fasta or fastq file would work as input. |
Basic options:
Option | Description |
---|---|
-o,–-output-file | Name of the output file [ default: stdout ]. This file will be in SAM format, with some customized tags. Please see the details of the file format below. |
-j,–-junction-file | Annotation file for known exon junctions. It is in BED format and please see the junc format description below. |
-n,–-non-denovo | No de novo junction search. Note that if junction annotation file is provided by -j, these “known” junctions will still be searched. |
-t,–-num-threads | Number of threads (INT) [ default: 1 ]. OLego fully supports multiple threading, if you have multiple CPU cores on your computer, please specify the number of CPUs you want to use with this option. |
-r,–-regression-model | The file with the parameters for the logistic regression model. The mouse model will be used if no file is selected. The model file contains the parameters for the regression model (the coefficients, the PWM and the background ). We have provided model files for mouse and human. User-defined model can also be generated with the regression_model_gen scripts for any species. Please see its usage below. |
-M,–-max-total-diff | Maximum total difference between query read and reference sequence. Either INT or FLOAT number can be used for this option. An INT number will specify the maximum total edit distance allowed for each alignment. A FLOAT number will specify the fraction of missing alignments given 2% uniform base error rate. This parameter is the same as -n in BWA. [default: a FLOAT number 0.04 ] |
-w,–-word-size | The size of the seed used in junction search (INT) [ default: 14 ]. The default seed size is recommended for reads >100 nt. For shorter reads, a smaller number can be used. e.g., 12 nt for 36 nt reads. The seeds will be evenly distributed on the read from the start to the end, so please try to cover the read as much as possible with a reasonable seed size. (13 nt for 36 nt reads is a BAD example. ) |
-W,–-max-word-occ | Maximum number of matches of a seed (INT) [ default: 1000]. If a seed has more than this number of hits on the genome, then it will be considerred repeptive and all of its hits will be discarded. |
-m,–-max-word-diff | Maximum edit distance allowed for each seed (INT) [ default: 0 ]. Since our seed size is smaller than other programs, we recommend that the user use a small number for this option. |
-I,–-max-intron | Maximum intron size for de novo junction search (INT) [ default: 500000 ]. |
-i,–-min-intron | Minimum intron size for de novo junction search (INT) [ default: 20 ]. |
-e,–-min-exon | Minimum micro-exon size to be searched (INT) [ default: 9 ]. |
-a,–-min-anchor | Minimum anchor size in de novo single-anchor junction searches (INT) [ default: 8 ]. We define “anchor size” as the smaller number of matched nucleotides on the read at the end of the junction. |
-k,–-known-min-anchor | Minimum anchor size in single-anchor junction searches when the junction is in the annotation file specified by -j (INT) [ default: 5 ]. |
-v,–-verbose | Verbose mode [ default: false ]. |
Advanced options:
Option | Description |
---|---|
–-non-single-anchor | Disable single-anchor de-novo junction search. [ default: enabled ]. |
–-strand-mode | Strand mode (INT). This value should be selected from 1, 2 or 3. For strand specific RNA-seq data, please use 1 if the reads should be mapped to the FORWARD strand of the RNA, use 2 if the reads should be mapped to the REVERSE strand. If the library is not strand specific, please use 3 to allow mapping onto both strands. [ default: 3 ] |
–-max-multi | Maximum number of alignments reported for multiple mappers. [ default: 1000 ] |
–-min-logistic-prob | Minimum logistic probablity for an alignment, calculated with the splice sites motif and intron size, in the range of [0,1) [ default: 0.50]. A higher number means more stringent filter, we don’t recommend using high value since more true de novo junctions will be filtered out. |
–-max-overhang | Maximum number of overhanging nucleotides allowed near the candidate exon boundary in junction searches (INT) [ default: 6 ]. After we extend the candidate exons, we search for splice sites in the overhanging regions around the candidate exon boundary. |
–-max-gapo | Maximum number of gap opens (INT) [ default: 1 ]. |
–-max-gape | Maximum number of gap extensions, -1 for disabling long gaps (INT)[ default: -1 ]. |
–-indel-end-skip | In BWT querying, do not put an indel within this number towards the ends [ default: 5 ]. |
–-gape-max-occ | Maximum occurrences for extending a long deletion in BWT querying [ default: 10 ]. |
–-penalty-mismatch | Mismatch penalty for querying involving BWT [ default: 3 ]. |
–-penalty-gapo | Gap open penalty for querying involving BWT [ default: 11 ] |
–-penalty-gape | Gap extension penalty for querying involving BWT [ default: 4 ] |
–-log-gap | log-scaled gap penalty for long deletions for querying involving BWT. |
–-num-reads-batch | This number of reads will be loaded into the memory for processing in each batch. [4*16**4 = 262144] |
–-none-stop | non-iterative mode: search for all n-difference hits in the BWT query (slooow). |
Other useful scripts
- mergePEsam.pl
This script can be used to merge SAM format mapping results from paired-end reads. The two ends will be merged according to their distances and orientation. The script requires the two ends come from the same chromosome with proper orientation and the distance between them smaller than the threshold specified by option -d.
Usage:
perl mergePEsam.pl [options] <end1.sam> <end2.sam> <out.sam>
Arguments:
Argument | Description |
---|---|
<end1.sam> | The SAM format output from one end of the reads. |
<end2.sam> | The SAM format output from the other end of the reads. Please make sure the same lines in end1.sam and end2.sam are corresponded (i.e. from the same read pair ). |
<out.sam> | The output file. In SAM format. |
Options:
Option | Description |
---|---|
-d | Maximum distance between the two ends on the reference [ default:5000000 ]. |
–ss, –-same-strand | Require the read-pair mapped to the same strand. By default, we require the two ends mapped to different strands, which is the case in strandard Illumina RNA-seq data. |
–ns, –-no-strand | Do not use strand information as a filter. |
–nci, –no-check-input | Do not check if the read names in the input files are matched. By default, the script will check if the read names from the two ends are similar to make sure the lines are correctly matched. Please use this option if your read names are in a uncomparable format. |
-v | Verbose mode [ default: false ]. |