CTK FAQ
From Zhang Laboratory
Maybe we should include a FAQ now? List a few questions like setting up perl environment, expected problems during analysis, etc.
Contents
- 1 How to install fast_toolkit on Mac OS X
- 2 No or all reads pass filtering
- 3 Should we include multi-mapper reads
- 4 Insufficient memory
- 5 Issues collapsing PCR duplicates
- 6 CLIP tag distribution is diffuse or spiky
- 7 CIMS.pl complains of inconsistency of files ends unexpectedly
- 8 Motif frequency is lower than expected
How to install fast_toolkit on Mac OS X
To install libgtextutils
curl -O http://hannonlab.cshl.edu/fastx_toolkit/libgtextutils-0.6.tar.bz2 tar xvjf libgtextutils-0.6.tar.bz2 cd libgtextutils-0.6 ./configure make sudo make install
To install fastx_toolkit
curl -O http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.tar.bz2 tar xjvf fastx_toolkit-0.0.13.tar.bz2 cd fastx_toolkit-0.0.13 export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig:$PKG_CONFIG_PATH ./configure make sudo make install
According to the Hannon Lab, if you get compilation errors regarding PKG_CONFIG or GTEXTUTILS not found, see this email (http://hannonlab.cshl.edu/fastx_toolkit/pkg_config_email.txt) for a possible solution. If you wish to install fastx-toolkit to a non-standard location (e.g. not /usr or /usr/local), see this email (http://hannonlab.cshl.edu/fastx_toolkit/prefix_email.txt) for tips.
No or all reads pass filtering
Specify option ‘-if solexa’ or ‘-if sanger’ for fastq_filter.pl
Historically, quality scores in fastq files were represented by numbers, which is the case for the two files used for this protocol. A more compact representation using ASCII characters with different offsets was later adopted. Illumina initially used offset 64 (i.e., Solexa fastq), but later switched to offset 33 (i.e., Sanger fastq), which is the default of this script for fastq files with encoded quality scores. Different encoding schemes can be specified by using the parameter '-if'.
Should we include multi-mapper reads
To minimize potential mapping errors, we require that each read maps to the genome unambiguously (no multiple hits). If multiple hits are allowed, it is important to assign a unique name to each hit. To map CLIP tags to loci known to have multiple copies or paralogs in the genome, such as some miRNAs and rRNA, it is recommended to build a reference sequence database after redundancies are collapsed instead of using the whole genome as the reference for mapping.
Insufficient memory
The input bed file is too large to be loaded into the memory at once:
Run the command line with option ‘-big’
Issues collapsing PCR duplicates
In the tag bed file, the 5′ column records the number of mismatches (substitutions) in each read and the read ID in the fourth column must take the form READ#x#NNNNN, where x is the number of exact duplicates and NNNNN is the bar-code nucleotide sequence. These two things are required for collapsing the potential PCR duplicates by coordinates and identifying unique CLIP tags.
CLIP tag distribution is diffuse or spiky
The tag distribution could be diffuse due to a problem in filtering or alignment or spiky due to PCR duplicates not being collapsed properly.
CIMS.pl complains of inconsistency of files ends unexpectedly
This is because mutations in non-unique tags were not removed properly. The joinWrapper.py step needs to be repeated.
Motif frequency is lower than expected
This could be due to low signal to noise ratio or because one strand of the sequence is not correct. Ensure that the sequences of the sense strand are obtained.