OLego Documentation

From Zhang Laboratory

Revision as of 13:47, 16 August 2022 by Czhang (Talk | contribs) (Versions)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

What is OLego?

OLego is a program specifically designed for de novo spliced mapping of mRNA-seq reads. OLego adopts a seed-and-extend scheme, and does not rely on a separate external mapper. It achieves high sensitivity of junction detection by strategic searches with very small seeds (12-14 nt), efficiently mapped using Burrows-Wheeler transform (BWT) and FM-index. This also makes it particularly sensitive for discovering small exons. OLego is implemented in C++ with full support of multiple threading, to allow for fast processing of large-scale data.

OLego is an open source code project and released under GPLv3. The implementation of OLego relies heavily on BWA (version 0.5.9rc1, http://bio-bwa.sourceforge.net/).


Citation:

Wu,J., Anczukow,O., Krainer,A.R., Zhang,M.Q. †, Zhang,C. †, 2013. OLego: Fast and sensitive mapping of spliced mRNA-Seq reads using small seeds. Nucleic Acids Res. , Published online. ( PubMed Link )


Contact: Jie Wu (wuj (at) cshl dot e d u) and Chaolin Zhang (cz2294 (at) columbia dot e d u)

Versions

  • v1.1.9 ( 8-16-2022 ), current
    • Together with v1.1.8, we fixed a few minor issues to improve compatibility
  • v1.1.7 ( 3-21-2017 )
    • accept gz input in mergePEsam.pl
  • v1.1.6 ( 9-14-2015 )
    • Add zebrafish model and exon junction files
  • v1.1.5 ( 7-14-2014 )
    • improvement in speed
    • minor bug fixes
  • v1.1.2 ( 7-1-2013 )
    • Sensitivity improved in small exons and single anchor search.by allowing mismatch.
    • Allows overlapping seeds to improve speed and seeding flexibility.
    • Increase default seed size to 15 (max 1 nt overlapping ) to have significantly increased speed without noticeable sacrifice in sensitivity.
    • A bug fixed (crashes when using -M 0 -w12)
  • v1.1.1 ( 4-14-2013 )
    • Improved speed by filtering simple repetitive anchors.
    • Default options optimized.
  • v1.1.0 ( 3-31-2013 )
    • Bug fixed for duplicate entries for some reads, sensitivity improved.
    • Optimized on option -W.
    • Bug fixed in sam2bed.pl.
    • Bug fixed for option -e.
    • Bug fixed for regression_model_gen.
    • Add support for gzip input file for sam2bed.pl.
  • v1.0.8 ( 11-20-2012 )
    • Improvement in hit clustering.
    • Fixed an overcounting problem in mismatch counting.
    • Fixed bug in merging step.
    • Fixed bug in XS tag for extra exon body reads.
    • Allows pipe input/output with "-" for some of the scripts.
  • v1.0.6 ( 08-09-2012 )
    • Added option –max-multi (default:1000) to avoid huge data in a single line.
    • Added option –num-reads-batch.
    • Fixed a bug in the junction connecting step.
  • v1.0.5 ( 07-16-2012 )
    • Minor bug fixed (the old code crashes in a very rare case).
  • v1.0.4 ( 06-12-2012 )
    • Option changes ( do single-anchor search by default now ).
  • v1.0.3 ( 06-10-2012 )
    • Now supports strand specific library
    • Fixed bugs about XS
  • v1.0.0 ( 05-15-2012 )
    • The initial Public release

Prerequisites

The major programs of OLego ( olego and olegoindex ) can be installed and run on Unix-based systems (Linux or Mac OS X) with GCC compiler installed. We provided scripts for post analysis and regression model construction, and these codes may require Perl and R.

Download

Codes and binaries

The source codes and binaries are available at https://github.com/chaolinzhanglab/olego. This program is still under active development, so please check the site periodically for updates. The most update to date version can also be retrieved via git:

git clone https://github.com/chaolinzhanglab/olego.git

The main programs of OLego (olego and olegoindex ) can be installed and run on Unix-based system with GCC compiler installed. We also provide scripts for post analysis and regression model construction. These codes may require Perl and R installed.

The exon junction database

We have built a non-redundant, comprehensive exon junction database (only GT/AG splice sites for now) from RefSeq, mRNAs, and ESTs. We recommend one to provide the junction database for alignment (with -j) to obtain improved sensitivity.

Download exon junction database:


In each of the files above, exon junctions from human, mouse and rat were lifted over to the other species and consolidated to remove redundancy.

Installation

To compile OLego on your computer, please go to the OLego directory and type:

make

If everything goes right, you will find two executable files olegoindex and olego in the folder.

We also provide binary executable files at http://sourceforge.net/projects/ngs-olego/files/ for x86_64 and i686 Linux systems.

Please feel free to report any problems you come up with.


Usage

Build the index for the genome sequence

To run OLego, you need a BWT index for the reference sequences. For the current version, the genome index used by OLego is in exactly the same format as the one used by BWA. However, this will likely change in the future. For your convience, you can build the index with olegoindex that comes with this package:

olegoindex [-a bwtsw|div|is] [-p STR] <in.fasta>

Arguments:

Argument Description
<in.fasta> This is the fasta format file with the reference sequence. Please put all the sequences (from different chromosomes ) in a single file.

Options:

Option Description
-a BWT construction algorithm: bwtsw or is [default: bwtsw]
-p prefix of the index [default: the same as the fasta file name]


Caution: please use “-a bwtsw” for long genome (like human or mouse genome).

There will be 8 files (prefix.pac, prefix.ann, prefix.amb, prefix.rpac, prefix.bwt, prefix.rbwt, prefix.sa, prefix.rsa) generated after olegoindex finishes.

Running OLego

Now you can align your mRNA-seq reads to the genome with olego:

olego [options] <prefix> <in.fastx>

Whenever possible, one should provide a database of annotated junction database for alignment (-j), because this will give higher sensitivity of mapping for junction reads. Other important parameters include the word size of seeds (-w) and the max number of mismatches (-M including both substitutions and indels) allowed. The defaults of these parameters were optimized for mammalian genomes to balance accuracy, sensitivity and speed.

For mammalian genomes, a seed size in the range of 12-15 nt is recommended (default =15 nt with 1-nt overlap). For small read length (e.g. < 50), the seed size should be picked to allow 3 or more seeds whenever possible (e.g., -w 12 for 36 nt reads). The number of seeds in each read has a significant impact on sensitivity of alignment.

The default of mismatches allowed varies dependent on the length of sequences:

17nt reads: max_diff = 1
20nt reads: max_diff = 2
45nt reads: max_diff = 3
73nt reads: max_diff = 4
104nt reads: max_diff = 5
137nt reads: max_diff = 6
172nt reads: max_diff = 7
208nt reads: max_diff = 8
244nt reads: max_diff = 9

The speed of alignment increases linearly as one use more threads (-t). Slight increase of word size (e.g., -w 15 vs -w 14) can dramatically increase the speed, without much loss in sensitivity.


The arguments and options are described in more detail as below:

Arguments:

Argument Description
<prefix> The prefix of the genome sequence index, including the path and the base name.
<in.fastx> Either fasta or fastq file would work as input. Note that gzipped file is also accepted. Addtionally, using "-" will make the program to read input from STDIN.

Basic options:

Option Description
-o,–-output-file Name of the output file [ default: stdout ]. This file will be in SAM format, with some customized tags. Please see the details of the file format below.
-j,–-junction-file Annotation file for known exon junctions. It is in BED format and please see the junc format description below.
-n,–-non-denovo No de novo junction search. Note that if junction annotation file is provided by -j, these “known” junctions will still be searched.
-t,–-num-threads Number of threads (INT) [ default: 1 ]. OLego fully supports multiple threading, if you have multiple CPU cores on your computer, please specify the number of cores you want to use with this option.
-r,–-regression-model The file with the parameters for the logistic regression model. The mouse model will be used if no file is selected. The model file contains the parameters for the regression model (the coefficients, the PWM and the background ). We have provided model files for mouse and human (in the folder models). User-defined model can also be generated with the regression_model_gen scripts for any species. Please see its usage below.
-M,–-max-total-diff Maximum total difference between query read and reference sequence. Either INT or FLOAT number can be used for this option. An INT number will specify the maximum total edit distance allowed for each alignment. A FLOAT number will specify the fraction of missing alignments given 2% uniform base error rate. This parameter is the same as -n in BWA. [default: a FLOAT number 0.06 ]
-w,–-word-size The size of the seed used in junction search (INT) [ default: 15 nt with --word-max-overlap 1 nt ]. The default seed size is recommended for reads >100 nt. For shorter reads, a smaller number can be used. e.g., 12 nt with 0 nt seed overlap or 13 nt with 1 nt seed overlap for 36 nt reads. The seeds will be evenly distributed on the read from the start to the end with a maximum overlap defined by --word-max-overlap, so please try to cover the read as much as possible with a reasonable seed size and maximum seed overlap size. (14 nt with 1 nt seed overlap for 36 nt reads is a BAD example. )
-W,–-max-word-occ Maximum number of matches of a seed (INT) [ default: max (1000*3^(14-word_size), 300)]. If a seed has more than this number of hits on the genome, then it will be considerred repeptive and all of its hits will be discarded.
-m,–-max-word-diff Maximum edit distance allowed for each seed (INT) [ default: 0 ]. Since our seed size is smaller than other programs, we recommend that the user use a small number for this option.
-I,–-max-intron Maximum intron size for de novo junction search (INT) [ default: 500000 ].
-i,–-min-intron Minimum intron size for de novo junction search (INT) [ default: 20 ].
-e,–-min-exon Minimum micro-exon size to be searched (INT) [ default: 9 ].
-a,–-min-anchor Minimum anchor size in de novo single-anchor junction searches (INT) [ default: 8 ]. We define “anchor size” as the smaller number of matched nucleotides on the read at the end of the junction.
-k,–-known-min-anchor Minimum anchor size in single-anchor junction searches when the junction is in the annotation file specified by -j (INT) [ default: 5 ].
-v,–-verbose Verbose mode [ default: false ].


Advanced options:

Option Description
--word-max-overlap Max number of overlaps between seeds in the seeding step.[ default: 1 ]
–-non-single-anchor Disable single-anchor de-novo junction search. [ default: enabled ].
--allow-rep-anchor Allow anchors with simple repetitive sequences. [ default: not allow ]
–-strand-mode Strand mode (INT). This value should be selected from 1, 2 or 3. For strand specific RNA-seq data, please use 1 if the reads should be mapped to the FORWARD strand of the RNA, use 2 if the reads should be mapped to the REVERSE strand. If the library is not strand specific, please use 3 to allow mapping onto both strands. [ default: 3 ]
–-max-multi Maximum number of alignments reported for multiple mappers. [ default: 20 ]
–-min-logistic-prob Minimum logistic probablity for an alignment, calculated with the splice sites motif and intron size, in the range of [0,1) [ default: 0.50]. A higher number means more stringent filter, we don’t recommend using high value since more true de novo junctions will be filtered out.
–-max-overhang Maximum number of overhanging nucleotides allowed near the candidate exon boundary in junction searches (INT) [ default: 6 ]. After we extend the candidate exons, we search for splice sites in the overhanging regions around the candidate exon boundary.
–-max-gapo Maximum number of gap opens (INT) [ default: 1 ].
–-max-gape Maximum number of gap extensions, -1 for disabling long gaps (INT)[ default: -1 ].
–-indel-end-skip In BWT querying, do not put an indel within this number towards the ends [ default: 5 ].
–-gape-max-occ Maximum occurrences for extending a long deletion in BWT querying [ default: 10 ].
–-penalty-mismatch Mismatch penalty for querying involving BWT [ default: 3 ].
–-penalty-gapo Gap open penalty for querying involving BWT [ default: 11 ]
–-penalty-gape Gap extension penalty for querying involving BWT [ default: 4 ]
–-log-gap log-scaled gap penalty for long deletions for querying involving BWT.
–-num-reads-batch This number of reads will be loaded into the memory for processing in each batch. [4*16**4 = 262144]
–-none-stop non-iterative mode: search for all n-difference hits in the BWT query (slooow).

Other useful scripts

mergePEsam.pl

This script can be used to merge SAM format mapping results from paired-end reads. The two ends will be merged according to their distances and orientation. The script requires the two ends come from the same chromosome with proper orientation and the distance between them smaller than the threshold specified by option -d.

Usage:

perl mergePEsam.pl [options] <end1.sam> <end2.sam> <out.sam>

Arguments:

Argument Description
<end1.sam> The SAM format output from one end of the reads.
<end2.sam> The SAM format output from the other end of the reads. Please make sure the same lines in end1.sam and end2.sam are corresponded (i.e. from the same read pair ).
<out.sam> The output file. In SAM format.

Options:

Option Description
-d Maximum distance between the two ends on the reference [ default:5000000 ].
–ss, –-same-strand Require the read-pair mapped to the same strand. By default, we require the two ends mapped to different strands, which is the case in strandard Illumina RNA-seq data.
–ns, –-no-strand Do not use strand information as a filter.
–nci, –-no-check-input Do not check if the read names in the input files are matched. By default, the script will check if the read names from the two ends are similar to make sure the lines are correctly matched. Please use this option if your read names are in a uncomparable format.
-v Verbose mode [ default: false ].


xa2multi.pl

This script can be used to extract all the alignments after the tag “XA” in each line. The current version is from BWA package, with minor modification.

Usage:

perl xa2multi.pl in.sam >out.sam


sam2bed.pl

This script converts SAM format output from OLego into BED format file. Only the best alignment (major alignment) of each read will be used.

Usage:

perl sam2bed.pl [options] <in.sam> <out1.bed> [out2.bed]

Arguments:

Argument Description
<in.sam> The SAM input file from OLego. "-" can be used to input from STDIN.
<out1.bed> [out2.bed] Please specify two BED files if you want Paired-end reads output into separate BED files, otherwise, all the reads will be output into out1.bed. "-" can be used to output into STDOUT.

Options:

Option Description
-u,–-uniq Only output uniquely mapped reads. The script identifies unique reads by the tag “XT:A:U”.
-r,–-use-RNA-strand Use the strand of the RNA based on the XS tag. By default, this script uses the strand of the read.
-v Verbose mode [ default: false ].


Using this script to convert SAM outputs from other programs might cause problems!

bed2junc.pl

This script can be used to retrieve unique junctions from BED format file. The number of supporting reads of each junction will be in the score (5th) column. The output file can be used as junction annotation file for OLego (option -j).

Usage:

perl bed2junc.pl <in.bed> <out.bed>

Arguments:

Argument Description
<in.bed> The input BED file with the mapping results. "-" can be used to input from STDIN.
<out> The output BED format file with the junctions. This file can be directly used as junction annotation file for olego. "-" can be used to output into STDOUT.

regression_model_gen

This set of scripts can be used to generate the user-defined logistic regression model.

Usage:

perl regression_model_gen/OLego_regression.pl [options]

Options:

Option Description
-g The location of the Fasta files downloaded from UCSC genome browser, the names of the Fasta files should be something like chr1.fa etc.
-a BED format annotation files for the true transcripts. True junctions will be extracted from this file.
-o Output prefix [default: userdefined].

The model file will be generated in output_prefix.cache, the file name would be output_prefix.cfg.


Additional notes

File formats

SAM format

OLego outputs the alignments in SAM format (http://samtools.sourceforge.net). Its specification can be found on samtools’ website.

The following tags are used in OLego. Please pay attention to the X? tags, most of them were adopted from BWA:

Tag Meaning
NM Edit distance
MD Mismatching positions/bases
X0 Number of best hits
X1 Number of suboptimal hits
XM Number of mismatches in the alignment
XN Number of ‘N’s in the reference
XO Number of best hits
XG Number of gap extentions
XT Type: Unique/Repeat/N *
XS Strand of the RNA **
XA Alternative hits; format: (chr,pos,CIGAR,NM,XS;)

*“Unique” or “Repeat” is determined by the number of best hits ( top hits with the same edit distance, X0 ), NOT the total number of hits. “N” means there are more than 10 ‘N’s in the reference ( XN>10 ). ** For a junction read, the strand of the RNA is determined by the annotation if the junction is annotated, or by the splice signal if it’s a novel junction. For exonic read, the strand can not be determined (a ”.” is assigned ).

Addtional scripts have been provided in the package for processing OLego output: sam2bed.pl can be used for conversion from SAM to BED format; xa2multi.pl can extract alignments after XA tags; mergePEsam.pl can merge the two outputs from paired-end RNA-seq data.

For general processing of SAM files, please check SAMTools.

junc format

OLego takes junction annotations in junc (BED) format.

Column Name Description
1 chrom The name of the chromosome
2 chromStart The starting position of the junction (intron)
3 chromEnd The ending position of the junction (intron)
4 name Name of the junction
5 score This column is reserved as the score of the junction, the bed2junc.pl provided in the package will output evidence number in this column.
6 strand Strand of the junction

The score column is not essential.

Examples

Example 1: Standard Illumina RNA-seq data (without strand information)

Assume we have paired-end data of read length 100nt:

olegoindex -a bwtsw mm10.fa
# build your BWT index, mm10.fa has all chromosomes combined

olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o f.sam ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa
# do the mapping allowing 4 mismatches (or indels) with 16 CPU cores, output to f.sam

olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o r.sam ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa
# do the same thing for the other file

mergePEsam.pl f.sam r.sam merge.sam
# merge both ends into merge.sam

sam2bed.pl --use-RNA-strand merge.sam merge.bed
# convert the SAM file to BED file

bed2junc.pl merge.bed merge.junc
# find the junctions in the BED file

olego -v -t 16 -r mm.cfg -j merge.junc -M 4 --non-denovo -o f.remap.sam  ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa
olego -v -t 16 -r mm.cfg -j merge.junc -M 4 --non-denovo -o r.remap.sam ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa
#This is optional: do a remapping to rescue more reads, no more de novo mapping here since we already used junction annotations.  

Example 2: Strand-specific RNA-seq data

olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o f.sam --strand-mode 1 ~/mz-local/database/mm10/genome/olego/mm10.fa f.fa
# when you know that the reads should be mapped onto the forward strand of the transcripts

olego -v -t 16 -r mm.cfg -j mm10.intron.hmr.bed -M 4 -o r.sam --strand-mode 2 ~/mz-local/database/mm10/genome/olego/mm10.fa r.fa
# the other end should be on the reverse strand according to the protocol

Some example commands using pipe to save space:

samtools view -uSh merge.sam |samtools sort - merge.sort
# this converts the sam output into sorted bam files to save space, and the bam file should be ready for downstream analysis (e.g. Cufflinks).

samtools view merge.sort.bam | sam2bed.pl -r - - | bed2junc.pl - merge.junc
# this will extract all junctions from the bam output without generating sam and bed files.