The human genome contains a diverse array of genomic variants. Among the most well-known are single nucleotide polymorphisms (SNPs), length polymorphisms of microsatellite sequences, and several types of structural variations (SVs). SVs include dosage-altering variations such as insertions and deletions, and dosage-invariant rearrangements such as inversions and translocations. Deletions and insertions larger than 1
] are often collectively referred to as copy number variations (CNVs), while smaller (<1
kb) insertions or deletions are referred to as indels. SNPs have long been thought to be the most common class of genetic variations and have been used widely in linkage and genome-wide association studies [2
]. However, it is now recognized that other types of variations are also widespread in human genomes [3
], even in the genomes of phenotypically normal individuals [4
]. The database of genomic variants (DGV), for instance, lists about 60
000 CNVs, 850 inversions and 30
000 indels identified in healthy individuals (http://projects.tcag.ca/variation/
; 25th March 2010 update).
The impact of SVs has been demonstrated in a wide range of applications including disease association studies, cancer genomics, and evolutionary studies [5–8
]. Copy number changes, especially those involving genes sensitive to a dosage effect, are likely candidates that may result in phenotypic consequences. For example, initial SV studies successfully identified common CNVs in coding regions associated with several complex disease phenotypes such as autoimmune and infectious disorders [9
] as well as those associated with behavioral variation [11
]. Small-scale deletion or duplication of conserved regulatory regions can affect the function of cis
-regulated genes leading to a disease phenotype, as shown in the case of DAX1
] and SOX9
]. In recent large-scale disease association studies, rare but statistically significant SVs were identified for complex diseases such as autism and early onset obesity [14
]. With the identified SVs encompassing or near the disease-susceptibility genes, these studies not only provide potential disease markers but also elucidate the genetic architecture of genomic disorders and complex disease traits.
Until recently, microarray-based platforms were widely used to identify CNVs. Two pioneering studies used bacterial artificial chromosome (BAC) and oligonucleotide-based microarray comparative genomic hybridization (array-CGH) [16
]; the first generation genome-wide human CNV map was also constructed using these platforms [18
]. However, BAC-based approaches cannot detect small CNVs or accurately map the boundaries of CNVs due to the large size of BACs [19
]. Even for the newer oligonucleotide-based arrays containing more than 1 million probes, the resolution is still limited to 10–20
]. Array-CGH also has several technical limitations, including intrinsic noise due to cross-hybridization and a limited dynamic range, and cannot detect dosage-invariant changes such as chromosomal translocations or inversions. As an alternative to hybridization-based methods, Sanger sequencing was also applied to identify genomic variants in normal individual genomes, for example using fosmid libraries [22
]. But the low throughput and high cost of Sanger sequencing imposed severe limitations on the number and size of detected SVs. For example, Kidd et al.
] sequenced about 1 billion base pairs per individual for genome-wide SV discovery and identified about 4000 SVs for eight individuals. This level of sequencing by the Sanger method was too expensive and time-consuming for general use.
Next-generation sequencing (NGS) has enabled cost-effective, high-throughput sequencing [24
]. The NGS platforms, first Roche 454 and later Illumina/Solexa Genome Analyzer and Applied Biosystems (ABI) SOLiD, generate orders of magnitude more sequences than the standard gel capillary-based technology. For instance, HiSeq2000, the latest model from Illumina, allows the researchers to obtain 30× coverage data for two human genomes in a single run. The NGS technology has been employed in all major areas of genetics and genomics. Among the major consoritum projects enabled by this technology are the 1000 genome project (http://www.1000genomes.org/
), which aims to provide a comprehensive catalog of human genetic variation by sequencing a large number of people, and The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov/
), which aims to generate a multi-dimensional genomic characterization of major tumor types.
These NGS platforms have also been used to examine SV. In a pioneering SV study using NGS, Korbel et al.
] sequenced over 5 billion base pairs from two human genomes using the Roche 454 platform and identified 892 indels, 122 inverstions and 283 translocations. Since then, many have investigated SVs using NGS data with various algorithms. Most current SV detection algorithms adopt the ‘comparison-versus-reference’ strategy, in which they first align the short sequencing reads from the genome of interest to a known reference genome and then analyze the mapping signatures that could indicate SVs.
One simple feature to consider is the tag density along the genomic coordinates. Regions with more reads than expected would indicate copy gains in the sequenced genome, and vice versa for copy losses. The signature left by dosage-invariant SVs are more complex and generally cannot be detected by single-end sequencing used for tag counting. In the past year or two, however, the paired-end sequencing technology and protocols have become mature enough to be commonly used for detection of SVs. Briefly, paired-end reads (called ‘mate pairs’ when there is a long insertion in between) are generated by sequencing from both ends of the DNA library fragments whose sizes are approximately known (insert size). Some paired-end sequencing protocols involve circularization of the DNA segments that can generate paired-end reads with a larger insert size (usually several kilobase). The other way is to directly sequence both ends of the size-selected DNA fragments, generating paired-end reads with tighter insert size. The advantage of a large insert size in the first technique is that it is better at detecting large SVs. The second technique, on the other hand, provides higher resolution and is more powerful for detecting smaller SVs.
In this article, we will review currently available SV-detecting algorithms that utilize NGS data (). These algorithms can be classified into two types according to the read-mapping signatures they use: algorithms that search for regions with abnormal tag counts and algorithms that survey the configurations of the paired-end mappings (PEMs). In the following, we will discuss these two classes of algorithms in two separate sections. Then, we will address the future research directions and conclude the article.