The relevance of genomic structural variation (SV) to human medical disorders is well-known [
1,
2] and efforts are starting to focus more systematically on characterizing SV and its implications [
3]. Recent advances in technology [
4], combined with the availability of the human genome sequence [
5], are now opening dramatic new avenues of SV research [
6-
12]. These developments collectively point to the pending feasibility of investigating SV over its entire size spectrum. The most comprehensive projects will locate and identify variants, sequence them, and finally establish their statistical characteristics within a population [
9].
Broadly speaking, SV encompasses translocations, inversions, and copy number variations and other types of inserted and deleted sequences (indels). Here, we focus on the last category, which is believed to occur the most frequently [
6]. Historically, cytogenetic techniques were used to examine instances of SV that were sufficiently coarse so as to be visible under a microscope [
13]. Array technologies were later used heavily, but these platforms were still not able to reliably capture alterations well below 40 kb [
14]. More recently, Volik
et al. [
15] proposed a procedure based on paired-end sequences that can detect much smaller variants, depending upon the type of sequence insert one employs. The scheme is remarkably straightforward in concept, relying on the fact that if the subject genome contains an insertion or deletion structural variant (ISV or DSV, respectively), the length statistics of any paired-ends aligned to a reference genome will differ from those of the progenitor library. Specifically, inserts would appear to be longer and shorter on average, respectively, for DSV and ISV (Fig. ). The method basically furnishes a metaphorical caliper for observing the tell-tale length discrepancies that characterize SV.
Although investigators are actively pursuing this technique [
6-
12], it is still somewhat new and its conceptual simplicity actually belies a number of latent complications. Alignment tasks are not trivial [
16,
17], nor are accurate descriptions of a host of statistical issues. For instance, breakpoint localization has only been examined under the idealization of constant insert lengths [
18]. Gaussian length distributions provide a much better empirical fit. Indeed, projects routinely invoke precisely this assumption, subsequently exploiting elementary Gaussian thresholds to define their SV detection framework. For example, a common rule has been to declare SV if the aligned average length differs from the library average by at least 3 standard deviations [
6,
10,
11,
15]. This threshold implies a confidence interval of slightly better than 99%, or equivalently, the chance of committing a false positive classification error of
α < 1%. Other procedures call for considering inserts more than 2 deviations from the average [
17].
In actuality, the statistical aspects of this problem are rather more complicated than what the above practices would suggest. One of the outstanding issues is coverage, which current theory ignores entirely [
6,
10]. Traditional fixed-length processing models [
19-
22] are not particularly useful here because the local covering dynamics will depend upon the variation of insert lengths in the library. While the role of variability has actually been recognized for some time [
20,
23], it has not been formally investigated much beyond the elementary uniform distribution model [
24,
25]. Consequently, there is little understanding of how the main statistical classifiers,
α and
β (Table ), are affected by Gaussian variance through the mechanism of coverage. A subtext to this point is that the statistics of ISV versus DSV are not symmetric, as is commonly assumed [
6,
10,
11,
15]. Finally, it appears that there have been no comprehensive studies related to the statistical power of the method or to how the spectrum of SV sizes can be effectively managed in a project.
| Table 1Notation for Structural Variation (SV) statistics |
All of these issues have important implications for the broader enterprise of SV research, from project planning and optimization to defining detection rules within SV algorithms. Here, we report the mathematical analyses that lead to a general a priori statistical characterization of ISV and DSV when using the length-discrepancy technique in conjunction with Gaussian libraries. We describe several novel aspects of SV detection revealed by this theory and comment on their implications for pending SV projects.