Sequence analysis has played a crucial role in computational biology. With the advancement of the next-generation sequencing technologies, the amount of available sequencing data is growing exponentially. Removing redundancy from such data by clustering could be crucial for reducing storage space, computational time and noise interference in some analysis methods, etc.
CD-HIT was originally developed to cluster protein sequences to create reference databases with reduced redundancy (Li et al., 2001
) and was then extended to support clustering nucleotide sequences (Li and Godzik, 2006
). Since its release, CD-HIT has become very widely used for a large variety of applications ranging from non-redundant dataset creation (Suzek et al., 2007
), protein family classifications (Yooseph et al., 2008
), artifact identification (Niu et al., 2010
), metagenomics annotation (Sun et al., 2011
), RNA analysis (Loong and Mishra, 2007
), to various prediction studies (Rubinstein and Fiser, 2008
With sequencing data rapidly growing in public data repositories as well as in individual laboratories, there has been strong demand for an enhanced CD-HIT with greater efficiency. In response to such demand, we have developed this enhanced and parallelized version of CD-HIT, to exploit the fact that multi-core machines have become very common in ordinary laboratories.
A computer cluster-based parallelization procedure for CD-HIT has been proposed in Suzek et al. (2007)
, though not fully parallelized, this procedure provides good speedup using computer cluster. Since computer clusters are not as easily available as multi-core machines, here we propose an alternative parallelization technique, which assumes shared memory model and works well on multi-core machines.