A major part of the genetic difference between individuals is encoded in the form of structural variations, such as insertions, deletions and duplications. In previous studies, hundreds of large-scale structural variants have been identified using arrayCGH (Iafrate et al.
; Sebat et al.
). Polymorphic transposon insertions (Bennett et al.
) and other short indel polymorphisms (Mills et al.
) have been discovered using capillary sequencing reads. Clone-end capillary sequencing provided a means to interrogate the intermediate size structural variations and particularly, the method is able to find relatively large insertion events (Kidd et al.
). Whole genome complete sequencing created a more detailed catalog of structural variations for single human individuals (Levy et al.
; Wheeler et al.
, 2008). For example, the de novo assembly of the Venter genome played an essential role in detecting long insertions, deletions and structural rearrangements.
Recent efforts on variant detection have been fueled up by next-gen high throughput sequencing. Using the Illumina platform for a complete sequencing of a human individual, Bentley et al.
) reported ~4-million SNPs and ~0.4-million short indels of size 1–16 bp. However, large indel events were not extensively studied by the authors. For the next-gen sequencing data, there are a number of ways to detect long variants, notably assembly (complete de novo or using unmapped reads only), read splitting, read coverage depth analysis, inconsistencies of insert sizes through paired-end mapping, etc. Among these, de novo assembly from short reads perhaps offers the best chance for long indels and structural rearrangements. There are a number of short read assemblers based on de Bruijn graphs. However, assembly of short read data is most successful when applied to small genoms like bacterial genomes (Chaisson and Pevzner, 2008
; Zerbino and Birney, 2008
). Above the eukaryotic level, it would be problematic due to repetitive genome regions. Given the current read length of 35–75 bp in the next-gen sequencing platforms, lack of high quality de novo assembly looks like to continue in the near future. Therefore, the need is eminent in the genomic community to develop read mapping related algorithms in order to reliably identify structural variants.
In this article, we present Pindel, a method that uses pattern growth algorithm to identify the break points of large deletions (1 bp–10 kb) and medium sized insertions (1–20 bp) from 36 bp paired-end short reads. We will start with introducing pattern growth for string matching. Then Pindel, the procedure of computing medium sized insertions and large deletions from paired-end short reads, will be illustrated. We test our Pindel program with simulated paired-end short reads on human chromosome X. We also report the results of Pindel using the whole human genome data of NA18507, sequenced on an Illumina platform. Finally Runtime and peak memory usage are analyzed to demonstrate the efficiency of Pindel.