One important goal in genomics is to determine the genetic differences among individuals and to understand their relationships to the phenotypic differences within a species, such as human beings. These variations consist of single nucleotide polymorphisms (SNPs) and structural variations (SVs) including short insertions/deletions (indels) and other more complex ones such as duplications and translocations. Because of the efficiency of genotyping methods and the central role they play in the genome-wide association studies, SNPs are currently the best catalogued and studied human genetic variations. Ubiquitous 1-bp indels, expansions of simple repeats and chromosomal anomalies have long been observed and acknowledged as the genetic bases for some human diseases [1
]. Except for these old discoveries, however, indels and SVs have been much less studied due to their wide size range, the multitude in their types, and the lack of an efficient genotyping method. After several recent studies, however, their genetic significance starts to be appreciated: not only do they exist in large numbers in the human populations, they may also have a more significant impact on phenotypic variation than SNPs [3
The microarray technology, array CGH, has been widely used to detect copy number variants (CNVs), a type of SV, with kilo-bases resolutions [5
]. The advancement in high throughput sequencing technologies has enabled a new set of comparative approaches for CNV calling, such as the read-depth analysis [12
], which computes the read coverage of different genomic regions, the read pair analysis, which focuses on cases where the distance between the two ends of a reads deviates more than expected when they are mapped back to the reference [4
]. Accompanying the advancement of these experimental approaches, different computational methods for SV detection and their breakpoint refinement have also been developed [18
Because indels/SVs come in various sizes, there is an additional aspect--the size coverage--to their detection. The aforementioned methods only partially address all the requirements of indel/SV detection to various degrees. For sequence insertions and deletions, indels/SVs are conventionally defined as micro-SVs of 1-10 bp and large ones over 1 kb, respectively. In the following text, wherever the context is clear we use SV as the encompassing term, subsuming small indels. Due to methodological limitations, SVs of middle lengths have only been minimally, if not at all, studied. Indeed, over the full spectrum of the SV size, only several small size spans are covered by current methods (Figure ). Moreover, SV detection approaches described above (e.g. array/read-pair/read-depth based methods) cannot accurately locate the breakpoints of the SV events, nor can they reveal the actual sequence content of insertions. Such information can only be gained via the direct analysis of the read sequences, instead of based on the statistics of the mappings of such reads.
Figure 1 The size spectrum of SVs identifiable to different methods. No method can identify SVs of all different sizes. The black bars indicate the size ranges of discoverable SVs by different methods, which include the dbSNP database, the high-resolution array (more ...)
Here we report the split-read analysis, a sequence-based method that detects SVs through direct analysis of the mapping information of how high-throughput sequencing reads are aligned to the reference genome. Using alignment of read sequences to reference genomes with gaps, the method allows the precise identification of SVs covered by such reads. Building our method directly upon BLAT, a well-established sequence alignment program, we take advantage of the speed and the sensitivity of this popular sequence-to-genome alignment tool. However, more importantly, by considering both the sequencing and mapping errors in our assessment strategy to score each initial SV call, our method also takes into account the sequencing error model (especially for next-generation sequencing technologies, which were not generally available a few years ago), and distinguishes the different confidence levels in detecting different SVs based on the characteristics of supporting reads. Compared with the read-depth and the read-pair analyses, our sequence-based method can not only pinpoint the breakpoints of SV events, but also reveal the actual sequence content of insertions. The split-read analysis has another advantage--it can cover the whole size spectrum for deletions (Figure ). We expect our method to be more useful in the future as the sequence reads become longer.
Due to both experimental and computational limitations, there are biases on multiple levels in the call sets generated by all current SV identification methods. In addition to their significantly more restricted size range of identifiable insertions than that of deletions, all current SV identification methods are sensitive to SVs of different length (Figure ), and as a result studies using them have reported different numbers of SVs. One study using the read-pair method reported 241 SVs over 8 kb in a sampled genome [7
], while another using the same approach but with a different molecular construct reported 422 and 753 SVs over 3 kb in two tested genomes [4
]. In a study of whole-genome sequencing and assembly, 835,926 indels were identified in a diploid human genome [26
]. Currently it is not known how many SVs, small or large, are in an individual human genome. Using empirical error models estimated from sequencing experiments to simulate high-throughput sequencing reads, we could not only parameterize our split-read method, but also, more importantly, quantify both false positive and false negative rates. Knowing these error rates enables us to estimate the total number of SVs of a given length in a human genome.