CTK FAQ

From Zhang Laboratory

Revision as of 14:54, 8 August 2016 by Czhang (Talk | contribs)

Jump to: navigation, search

How to install fast_toolkit on Mac OS X

To install libgtextutils

curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.tar.bz2
tar xvjf libgtextutils-0.6.tar.bz2
cd libgtextutils-0.6
./configure
make
sudo make install

To install fastx_toolkit

curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.tar.bz2
tar xjvf fastx_toolkit-0.0.13.tar.bz2
cd fastx_toolkit-0.0.13
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH
./configure
make
sudo make install

According to the Hannon Lab, if you get compilation errors regarding PKG_CONFIG or GTEXTUTILS not found, see this email (http://hannonlab.cshl.edu/fastx_toolkit/pkg_config_email.txt) for a possible solution. If you wish to install fastx-toolkit to a non-standard location (e.g. not /usr or /usr/local), see this email (http://hannonlab.cshl.edu/fastx_toolkit/prefix_email.txt) for tips.

No or all reads pass filtering

Specify option ‘-if solexa’ or ‘-if sanger’ for fastq_filter.pl 

Historically, quality scores in fastq files were represented by numbers, which is the case for the two files used for this protocol. A more compact representation using ASCII characters with different offsets was later adopted. Illumina initially used offset 64 (i.e., Solexa fastq), but later switched to offset 33 (i.e., Sanger fastq), which is the default of this script for fastq files with encoded quality scores. Different encoding schemes can be specified by using the parameter '-if'.

Should we include multi-mapper reads

To minimize potential mapping errors, we require that each read maps to the genome unambiguously (no multiple hits). If multiple hits are allowed, it is important to assign a unique name to each hit. To map CLIP tags to loci known to have multiple copies or paralogs in the genome, such as some miRNAs and rRNA, it is recommended to build a reference sequence database after redundancies are collapsed instead of using the whole genome as the reference for mapping.

Insufficient memory

The input bed file is too large to be loaded into the memory at once:

Run the command line with option ‘-big’

Issues collapsing PCR duplicates

In the tag bed file, the 5′ column records the number of mismatches (substitutions) in each read and the read ID in the fourth column must take the form READ#x#NNNNN, where x is the number of exact duplicates and NNNNN is the bar-code nucleotide sequence. These two things are required for collapsing the potential PCR duplicates by coordinates and identifying unique CLIP tags.

CLIP tag distribution is diffuse or spiky

The tag distribution could be diffuse due to a problem in filtering or alignment or spiky due to PCR duplicates not being collapsed properly.

CIMS.pl complains of inconsistency of files ends unexpectedly

This is because mutations in non-unique tags were not removed properly. The joinWrapper.py step needs to be repeated.

Motif frequency is lower than expected

This could be due to low signal to noise ratio or because one strand of the sequence is not correct. Ensure that the sequences of the sense strand are obtained.