Search tips
Search criteria

Results 1-5 (5)

Clipboard (0)
more »
Year of Publication
Document Types
2.  Sensitive and fast mapping of di-base encoded reads 
Bioinformatics  2011;27(14):1915-1921.
Motivation: Discovering variation among high-throughput sequenced genomes relies on efficient and effective mapping of sequence reads. The speed, sensitivity and accuracy of read mapping are crucial to determining the full spectrum of single nucleotide variants (SNVs) as well as structural variants (SVs) in the donor genomes analyzed.
Results: We present drFAST, a read mapper designed for di-base encoded ‘color-space’ sequences generated with the AB SOLiD platform. drFAST is specially designed for better delineation of structural variants, including segmental duplications, and is able to return all possible map locations and underlying sequence variation of short reads within a user-specified distance threshold. We show that drFAST is more sensitive in comparison to all commonly used aligners such as Bowtie, BFAST and SHRiMP. drFAST is also faster than both BFAST and SHRiMP and achieves a mapping speed comparable to Bowtie.
Availability: The source code for drFAST is available at
PMCID: PMC3129524  PMID: 21586516
3.  Detection and characterization of novel sequence insertions using paired-end next-generation sequencing 
Bioinformatics  2010;26(10):1277-1283.
Motivation: In the past few years, human genome structural variation discovery has enjoyed increased attention from the genomics research community. Many studies were published to characterize short insertions, deletions, duplications and inversions, and associate copy number variants (CNVs) with disease. Detection of new sequence insertions requires sequence data, however, the ‘detectable’ sequence length with read-pair analysis is limited by the insert size. Thus, longer sequence insertions that contribute to our genetic makeup are not extensively researched.
Results: We present NovelSeq: a computational framework to discover the content and location of long novel sequence insertions using paired-end sequencing data generated by the next-generation sequencing platforms. Our framework can be built as part of a general sequence analysis pipeline to discover multiple types of genetic variation (SNPs, structural variation, etc.), thus it requires significantly less-computational resources than de novo sequence assembly. We apply our methods to detect novel sequence insertions in the genome of an anonymous donor and validate our results by comparing with the insertions discovered in the same genome using various sources of sequence data.
Availability: The implementation of the NovelSeq pipeline is available at;
PMCID: PMC2865866  PMID: 20385726
4.  Next-generation VariationHunter: combinatorial algorithms for transposon insertion discovery 
Bioinformatics  2010;26(12):i350-i357.
Recent years have witnessed an increase in research activity for the detection of structural variants (SVs) and their association to human disease. The advent of next-generation sequencing technologies make it possible to extend the scope of structural variation studies to a point previously unimaginable as exemplified by the 1000 Genomes Project. Although various computational methods have been described for the detection of SVs, no such algorithm is yet fully capable of discovering transposon insertions, a very important class of SVs to the study of human evolution and disease. In this article, we provide a complete and novel formulation to discover both loci and classes of transposons inserted into genomes sequenced with high-throughput sequencing technologies. In addition, we also present ‘conflict resolution’ improvements to our earlier combinatorial SV detection algorithm (VariationHunter) by taking the diploid nature of the human genome into consideration. We test our algorithms with simulated data from the Venter genome (HuRef) and are able to discover >85% of transposon insertion events with precision of >90%. We also demonstrate that our conflict resolution algorithm (denoted as VariationHunter-CR) outperforms current state of the art (such as original VariationHunter, BreakDancer and MoDIL) algorithms when tested on the genome of the Yoruba African individual (NA18507).
Availability: The implementation of algorithm is available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2881400  PMID: 20529927
5.  Targeted interrogation of copy number variation using SCIMMkit 
Bioinformatics  2009;26(1):120-122.
Summary: Copy number variants (CNVs) contribute substantially to human genomic diversity, and development of accurate and efficient methods for CNV genotyping is a central problem in exploring human genotype–phenotype associations. SCIMMkit provides a robust, integrated implementation of three previously validated algorithms [SCIMM (SNP-Conditional Mixture Modeling), SCIMM-Search and SCOUT (SNP-Conditional OUTlier detection)] for targeted interrogation of CNVs using Illumina Infinium II and GoldenGate SNP assays. SCIMMkit is applicable to standardized genome-wide SNP arrays and customized multiplexed SNP panels, providing economy, efficiency and flexibility in experimental design.
Availability: Source code and documentation are available for noncommercial use at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2796813  PMID: 19846438

Results 1-5 (5)