From Zhang Laboratory

Jump to: navigation, search

How to install fast_toolkit on Mac OS X

To install libgtextutils

curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.tar.bz2
tar xvjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
sudo make install

To install fastx_toolkit

curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.tar.bz2
tar xjvf fastx_toolkit-0.0.13.tar.bz2
cd fastx_toolkit-0.0.13
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
sudo make install

According to the Hannon Lab, if you get compilation errors regarding PKG_CONFIG or GTEXTUTILS not found, see this email (http://hannonlab.cshl.edu/fastx_toolkit/pkg_config_email.txt) for a possible solution. If you wish to install fastx-toolkit to a non-standard location (e.g. not /usr or /usr/local), see this email (http://hannonlab.cshl.edu/fastx_toolkit/prefix_email.txt) for tips.

No or all reads pass filtering

Specify option ‘-if solexa’ or ‘-if sanger’ for fastq_filter.pl

Historically, quality scores in fastq files were represented by numbers, which is the case for the two files used for this protocol. A more compact representation using ASCII characters with different offsets was later adopted. Illumina initially used offset 64 (i.e., Solexa fastq), but later switched to offset 33 (i.e., Sanger fastq), which is the default of this script for fastq files with encoded quality scores. Different encoding schemes can be specified by using the parameter '-if'.

Should we include multi-mapper reads

To minimize potential mapping errors, we require that each read maps to the genome unambiguously (no multiple hits). If multiple hits are allowed, it is important to assign a unique name to each hit. To map CLIP tags to loci known to have multiple copies or paralogs in the genome, such as some miRNAs and rRNA, it is recommended to build a reference sequence database after redundancies are collapsed instead of using the whole genome as the reference for mapping.

Insufficient memory

The input bed file is too large to be loaded into the memory at once:

Run the command line with option ‘-big’

Issues collapsing PCR duplicates

In the tag bed file, the 5′ column records the number of mismatches (substitutions) in each read and the read ID in the fourth column must take the form READ#x#NNNNN, where x is the number of exact duplicates and NNNNN is the bar-code nucleotide sequence. These two things are required for collapsing the potential PCR duplicates by coordinates and identifying unique CLIP tags.

CLIP tag distribution is diffuse or spiky

The tag distribution could be diffuse due to a problem in filtering or alignment or spiky due to PCR duplicates not being collapsed properly.

CIMS.pl complains of inconsistency of files ends unexpectedly

This issue can be caused by two possibilities:

1. Mutations in non-unique tags were not removed properly. The joinWrapper.py step needs to be repeated.

2. Raw reads do not have unique names or multi-mapper reads were kept.

Motif frequency is lower than expected

This could be due to low signal to noise ratio or because one strand of the sequence is not correct. Ensure that the sequences of the sense strand are obtained.