PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of wtpaEurope PMCEurope PMC Funders GroupSubmit a Manuscript
 
Nat Genet. Author manuscript; available in PMC Jun 18, 2009.
Published in final edited form as:
PMCID: PMC2698291
UKMSID: UKMS5159
Challenges and standards in integrating surveys of structural variation
Stephen W Scherer, Charles Lee, Ewan Birney, David M Altshuler, Evan E Eichler, Nigel P Carter, Matthew E Hurles, and Lars Feuk
Stephen W. Scherer and Lars Feuk are at The Centre for Applied Genomics and Program in Genetics and Genomic Biology, The Hospital for Sick Children, 14th Floor, Toronto Medical Discovery Tower, MaRS Discovery District, 101 College Street, Room 14-701, Ontario M5G 1L7, Canada. Stephen W. Scherer is in the Department of Molecular and Medical Genetics, University of Toronto, Toronto, Ontario M5G 1L7, Canada. Charles Lee is in the Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA. Ewan Birney is at the European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK.
Nigel P. Carter and Matthew E. Hurles are at the Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK. David M. Altshuler is in the Program in Medical and Population Genetics, Broad Institute of Harvard University and the Massachusetts Institute of Technology, 7 Cambridge Center, Cambridge, Massachusetts 02142, USA. Evan E. Eichler is in the Department of Genome Sciences and Howard Hughes Medical Institute, University of Washington School of Medicine, Seattle, Washington 98195, USA.
e-mail: steve/at/genet.sickkids.on.ca
Abstract
There has been an explosion of data describing newly recognized structural variants in the human genome. In the flurry of reporting, there has been no standard approach to collecting the data, assessing its quality or describing identified features. This risks becoming a rampant problem, in particular with respect to surveys of copy number variation and their application to disease studies. Here, we consider the challenges in characterizing and documenting genomic structural variants. From this, we derive recommendations for standards to be adopted, with the aim of ensuring the accurate presentation of this form of genetic variation to facilitate ongoing research.
Structural variation in the genome refers to cytogenetically visible and (more commonly) submicroscopic variants, including deletions, insertions, duplications and large-scale copy number variants — collectively termed copy number variations (CNVs) — as well as inversions and translocations (Box 1)1-3. Genome scanning technologies are now commonplace in many laboratories, allowing new structural variation to be recognized from general population surveys4-12 or studies of diseases13-21. In fact, the Database of Genomic Variants4,22 (see list of databases in Table 1) already contains entries (mainly CNVs) covering some 538 Mb (18.8% of the euchromatic genome) derived from the study of fewer than 1,000 genomes from individuals with no obvious disease phenotype.
Box 1 Terminology
Terms that are part of the current vocabulary for structural variation are in bold type below, set into the context of some key definitions and related comments
Structural variant
Structural variant is the umbrella term to encompass a group of genomic alterations involving segments of DNA typically larger than 1 kb, and which can be microscopic or submicroscopic1. We use the term as a neutral descriptor with nothing implied about frequency, association with disease or phenotype, or lack thereof. This definition of size, though perhaps somewhat arbitrary, was undertaken to accommodate this significant class of variation that spans the gap between small variants (such as variable number of tandem repeats (VNTRs)) detected with molecular genetic assays and those recognized microscopically on karyotypes. The structural variation may be quantitative (copy number variants comprising deletions, insertions and duplications) and/or positional (translocations) or orientational (inversions).
Copy number variation/variant (CNV)
We use these terms to refer to a DNA segment of at least 1 kb in size, for which copy number differences have been observed in the comparison of two or more genomes. Without further annotation, CNV carries no implication of relative frequency or phenotypic effect. These quantitative variants can be genomic copy number gains (insertions or duplications) or losses (deletions or null genotypes) relative to a designated reference genome sequence. A copy number polymorphism (CNP) is a CNV that occurs in more than 1% of the population.
CNV locus or CNV region (CNVR)
Merging of independently ascertained, but overlapping, genomic segments creates the representation of a CNV locus (that is, a segment at a fixed chromosomal position); the accumulation of data gradually will reveal the true underlying structure of the variant segment. In some cases, this will be a discrete cassette of DNA; in others, it will be a multiplex arrangement of variant units in close proximity, forming a CNV region11. A given variable segment can be detected with multiple clones in a single array or by different arrays in different studies, and its borders gradually fine-tuned with targeted assays. By their very nature, these segments may have different forms among the individuals used for their discovery.
Insertion/deletion (‘indel’)
Indel is a collective abbreviation to describe relative gain or loss of a segment of one or more nucleotides in a genomic sequence. It allows the designation of a difference between genomes in situations where the direction of sequence change cannot be inferred: for example, when a reference or ancestral sequence has not been defined. It has typically been used to denote relatively small-scale variants (particularly those smaller than 1 kb); however, we do not propose any size restriction for its use.
Segmental duplication (also called low-copy repeat (LCR) or duplicon)
A segment of DNA >1 kb in size that occurs in two or more copies per haploid genome, with the different copies sharing >90% sequence identity44,64,65. These segments can also be CNVs. The duplicated blocks predispose to nonallelic homologous recombination.
Human genome reference assembly
The standard reference DNA sequence (or assembly) of the human genome66 that is regularly curated (successive updates named ‘builds’). The assembly is derived mostly (>60%) of DNA from a bacterial artificial chromosome (BAC) library made from a single donor, with the rest of the sequence originating from a mosaic of other sources. The current assembly covers most of the euchromatic regions of the human genome, but there are still some gaps remaining, and many of these co-locate with segmental duplications and/or CNVs.
Aneuploidy, aneusomy and heteromorphism
These terms have origins in classical cytogenetics and describe structural variants at the largest end of the scale. Aneuploidy is the state of having an abnormal number of chromosomes. Similarly, is the state of having an abnormal number of chromosomes. Similarly, segmental aneusomy, in reference to a portion of a chromosome, implies abnormality. Heteromorphism (literally, ‘different form’) has come to imply normal variation, or an atypical chromosome form not associated with an abnormal phenotype. Such large-scale variants are often the basis for dysfunction owing to dosage imbalance (such as for segmental aneusomy syndromes67), but may also be part of normal functional variation.
Minor-allele frequency versus altered copy-number frequency
The minor allele is the less common allele at a polymorphic locus. The use of this term is complicated when a locus is multiallelic. Locke et al.9 proposed use of altered copy-number frequency because measurements of because measurements of copy number are on diploid samples and screening methods do not necessarily distinguish the two independent alleles. Redon et al.11 adopted the convention of assuming that the minor allele is the derived allele; thus, deletions have a minor allele of lower copy number and duplications have a minor allele of higher copy number.
Nonmendelian inheritance (also called mendelian incompatibilities or mendelian inconsistencies)
These terms refers to transmission from parent(s) to offspring in a manner that does not conform to expectations of classical allelic segregation. (Avoid the term ‘mendelian errors’.) Evidence in family studies (‘trios’ in the HapMap data) of apparent nonmendelian inheritance for a genomic segment indicates that copy number variation may be involved7,10.
Table 1
Table 1
Databases
This first round of observations came from several studies, each using a different technology platform and data processing algorithms, with different degrees of pre- and postexperimental standardization and validation. As a result, the data vary in quality and often have both high false-positive and false-negative rates. There is the very real possibility of the entire human genome soon being presented as ‘structurally variant’ in one form or another, based solely on studies of nondisease samples, which would be a distortion. It will be important for all future applications of structural variation information that the scope and detail of variants in the general population be accurately cataloged. In particular, medical genetics research — investigating structural variation profiles in individuals or clinical cohorts — will need a reliable foundation against which to interpret possible pathogenic findings in cytogenomic (Fig. 1), linkage and genome-wide association studies21,23-25.
Figure 1
Figure 1
Lexicon of genomic variation. Descriptors of variation began in the realm of cytogenetics, followed by those from the field of molecular genetics and, most recently, by technologies such as those described in this perspective, which bridge the gap for (more ...)
The field of genomic structural variation, however, is on the cusp of change. Pioneering approaches, often fragmented or fraught with technical limitations, are being supplanted by new technologies that afford much higher resolution screening of the genome at lower cost. We anticipate that, in the next year, the quantity of structural variation data will increase by orders of magnitude owing to microarray-based experiments alone, not to mention the plethora soon to flow from clone-end6,26 or whole-genome sequencing experiments27-30. Many of these studies will survey nondisease samples for structural variation discovery to create control databases. Moreover, in little more than two years from the first description of global CNV distribution4,5, the field is poised to make structural variation analyses standard in the design of all studies of the genetic basis of phenotypic variation. At this inflection point, we examine what is known about genomic structural variation, and consider perspectives and simple standards designed to safeguard integrity and maximize data utility for the immediate future.
Research into structural variation is currently at a state of development comparable to that of the earliest SNP studies. Initiatives to discover and characterize simpler structural variants — such as small insertions, deletions (indels) and balanced inversions — is likely to yield results in proportion to investment, as was the case for SNPs31-33. However, for larger and particularly for more complex structural variants, there are additional confounding factors. To provide a framework for discussion of prospective standards, we group into five categories the major issues currently curbing progress in this field. Data quality, which has impact throughout these other issues, is discussed in the subsequent subsection. The majority of the discussion pertains to the variants classed as CNVs, as these represent the predominant form studied to date. Our comments also mostly target issues related to whole-genome discovery surveys.
Terminology
The newly recognized domain of structural variation is blurring the distinction between traditional cytogenetic and molecular analyses, as it fills the (albeit narrowing) gap between the limits of resolution of these earlier approaches to genetic variation (Fig. 1). Terminology established within each camp is sometimes unwieldy in the crossover (Box 1). Moreover, there is no standard nomenclature for structural variants that fall between those that can be classified by naming systems established from the cytogenetic34,35 or mutation literature36 (for example, indels). For some terms, such as CNV, there is added complication because they are being used regularly as a descriptor in both control and disease studies, but with different meaning. Different classes of CNVs are described in Redon et al.11 and in Supplementary Figure 1 online. Nomenclature for genes encompassed by structural variants also needs to be considered, but no rules have yet been established.
Annotating complex structural variants
Many structural variants are large in size, flanked by or encompassing complex repetitive DNA sequences. They may be unbalanced in content or highly polymorphic, characteristics that pose significant challenges for detection and analysis. There are many complexities associated with classifying and characterizing CNVs (Supplementary Figs. 1, 2 and 3 online). As the precise rearrangement breakpoints are usually not resolved (because of coincidence with large repeats or because of low resolution coverage of assays), it is typically not possible to determine whether the underlying variants are identical by descent or represent independent events in close proximity to one another. Regions of high sequence identity may also cause cross-hybridization on comparative genome hybridization (CGH) platforms, leading to CNV calls in regions that are not actually variable (Supplementary Fig. 3). Determining the meiotic and mitotic characteristics of these variants — such as the de novo mutation rate, stability and level of mosaicism — can also be confounded not only by the complex nature of the underlying sequences but by technical and comparative limitations, including the source of the DNA (described below).
Technological limitations
At present, no single approach identifies all types of structural variation. Current scans of genome-wide structural variation are screening or discovery assays, and not definitive tests. In our hands, the testing of a single sample by different platforms and ‘call’ algorithms can lead to substantially different CNV call rates, owing to differing sensitivity, specificity, probe density and type of probe used (Table 2 and Supplementary Table 1 online). This matter is underscored by the relatively small degree of overlap among published datasets2,37, even when assessing identical samples7,9-11. The progress on CNV discovery to date is largely due to the availability of numerous microarray platforms, which detect quantitative imbalances. In contrast, there is currently no high-throughput, cost-effective method to scan the genome for inversions or translocations. Short of comparing ‘finished’ sequence assemblies from independent sources38,39, it can take a multitude of approaches to identify, validate and sequence the compendium of structural variation comprehensively (Table 3 and Supplementary Table 2 online). Other issues, such as relative costs of arrays and reagents and availability of specialized equipment, often limit access to the most appropriate experiments.
Table 2
Table 2
Copy number variants called on the same test sample (NA15510) using different experimental platforms and algorithmsa
Table 3
Table 3
Summary of 12 published surveys (2004-2007) of structural variation content in human genomesa
Characteristics of reference and test samples
Identification of variation requires comparison to either a reference DNA source4,5,11,40,41, a reference dataset11 or a reference genome sequence6,39,42, which has implications for experimental design and interpretation of results43. For example, at present, no standardized ‘reference’ control DNA has been adopted for laboratory experiments, and in some cases, ‘pools’ of samples or datasets are used to represent an averaged genome (Table 2). This lack of standard reference genomes can complicate both the designation of relative copy-number differences among samples from different projects and the standardization of databases (Table 1) that contain information about structural variants. Specifically, if in a single experiment it is impossible to distinguish a loss in the test sample from a gain in the reference sample, then two different studies may report the same CNV as a relative gain or loss (duplication or deletion), respectively. Moreover, using pools of DNA or their intensity outputs as hybridization controls or in comparative intensity analysis (Table 2) may lead to a decreased power to detect variants in highly polymorphic regions of the genome. In these regions, the pool will represent an intermediate between the polymorphic and nonpolymorphic states, resulting in smaller relative difference in intensity than a nonpolymorphic single reference would yield. In terms of annotating variants, the relative nature of CNV determination can pose a problem, as it leads to an overestimation of regions with both apparent gains and losses.
Ultimately, the underlying sequence characteristics of any newly identified structural variant will be compared to the human genome reference assembly. The latest release from the US National Center for Biotechnology Information (NCBI), called Build 36, is a mosaic of some 708 different sources1, and covers mainly the euchromatic portion of the genome, with some 302 known gaps (http://www.ncbi.nlm.nih.gov/). Concomitance of incomplete or falsely merged regions of the reference assembly with the position of structural variants can confound comparisons of one against the other44,45. Moreover, as many technologies use the NCBI reference sequence to guide product development, structural variants residing in the unannotated segments of the human genome may be missed (Supplementary Fig. 2). Test samples can also be from a mix of untransformed or transformed tissues, all impacting on interpretation11,46. Finally, samples used to discover structural variants from control populations may have little or no genetic (for example, parent of origin) information or phenotypic assessment protocols attached to them. So, despite common presumptions, any variant described by such studies is not necessarily either neutral or benign.
Database issues
The main sources of information for human structural variation are the Database of Genomic Variants and the Human Structural Variation Database. Both are currently limited, in that variants are simply represented as they are described in publications and overlaid on the current reference assembly, without precise location of most breakpoints. There are some unpublished data at these sites, but so far there is no active effort to standardize CNV calling or characteristics through reexamination of the original primary data. Moreover, as the human reference assembly is updated in subsequent assemblies, sites of apparent structural variation can disappear and reappear, presenting a challenge for database management. Although Ensembl and UCSC Genome Browser display data from the Database of Genomic Variants, there is currently no standard requirement to submit published structural variants to any database. Further, there is no system for naming structural variants with unique accession numbers, and surprisingly, only a proportion of studies post their raw or underlying data, and full method of interpretation, for public access.
There are also many challenges in the layout and visualization of the data. For example, it is current practice to display structural variants using estimates of start- and end-points when the breakpoint(s) are suboptimally resolved. When there are two or more overlapping variants originating from the same study, they are sometimes grouped together even if they are not identical11, and misgrouping can occur, particularly near segmental duplications. Moreover, as the number of surveys continues to grow, the CNVs discovered will become more redundant.
Presenting structural variation data in relation to the reference assembly can also be problematic1,39 because the standard browsers were not designed to display these data. This issue notwithstanding, smaller variants (usually <10 kb) are present in NCBI’s dbSNP, and a goal of the Human Structural Variation Database is to integrate structural variation data, such as fosmid paired-end sequences6, with the NCBI human reference sequence (including those regions not represented in the current assembly)26. The Database of Genomic Variants will continue to display structural variation data originating from nondisease-defined samples, but stricter criteria for inclusion, as well as assessment and annotation of the quality standards described below, will become critical aspects of the curatorial process.
To assess current practices in collection and validation of discovery data, we review and comment on 12 experimentally diverse and highly cited studies, each undertaken to search for structural variation in the human genome. In Table 3 and Supplementary Table 2, we summarize selected parameters and the strengths and weaknesses of these studies.
Genomes surveyed and reference samples
The number of genomes investigated with each study ranged from one (in sequence comparisons to reference assemblies6,39) to 270 (in three studies of the HapMap collection9-11). Appropriate attention was given to samples being from unrelated individuals or from families, and ethnic diversity was usually noted. Tissue sources of DNA were heterogeneous, and whether or not they were transformed or cultured was inconsistently documented. Phenotypic information would generally have been unknown, or assumed to be unremarkable (from ‘healthy volunteers’), although Iafrate et al. included samples with known karyotypic abnormalities as controls4, and Wong et al. used some material from cancer programs41. Each study used different reference sample(s) for genome comparison. One used pooled DNA4, three compared to the reference human genome assembly6,39,42, one made a variety of comparisons5 and the other CGH approaches each used a different single male reference sample. Future studies will increase the variety of genomes surveyed, and these would benefit from a consensus standard of documented information about their sources. In contrast, a smaller number of reference sequences would facilitate the process of collective documentation.
Primary discovery methods
Table 3 is organized according to the methods used to search for structural variants. The upper portion includes seven studies that employed CGH, each with a different array platform, encompassing a range of probe size, complexity and resolution. One approach9,40 targeted regions associated with segmental duplications, but the rest spanned the genome, with arrays carrying from 2,000 up to about 26,000 clones in genome tiling-path arrays11,41. Redon et al.11 added a second complementary screening strategy based on relative fluorescence intensities with arrays designed originally for SNP genotyping. The lower portion of Table 3 summarizes five studies with completely different strategies, based on genomic sequence comparisons. These studies used existing data from either the reference human genome sequence6,39,42 or the HapMap project7,10 to mine for deletions and other relatively small structural rearrangements. The fosmid-based approach6 and sequence comparison39 were able to discern orientational as well as quantitative variants.
Experimental quality controls
Before structural variants can be revealed by genome comparisons, positive data arising from other biological or technical causes need to be filtered. Biological differences that were variously accounted for among these studies include (i) male-female X and Y chromosome dosage differences9,11,40, (ii) somatic rearrangements of the immunoglobulin genes5,11, (iii) cell-culture artifacts such as mosaic trisomies46 and (iv) results of genomic instability of virus-transformed cell lines11. Similarly, any variation relative to a reference human genome sequence in the computational approaches must be interpreted in light of the known gaps and potential assembly artifacts1,6,39.
As these screening strategies are themselves biological, with associated technical artifacts, replication is the most important experimental tool for assessing the validity of observations, and it took many forms among these studies. Within each CGH array, clones were typically in duplicate or triplicate. Interexperimental replication involved ostensibly the same conditions and/or an experimental alternate, such as ‘dye-swap’ of the two fluorochrome labels between the test and reference samples. The means of dealing with discordant replicates was inconsistent among the studies, and sometimes difficult to discern from the publications. In most studies4,9,11,40, discordant dye-swap results were eliminated, but in Wong et al.41, only 20% of samples were assayed in both orientations. Within each study, experiments also showed variable background ‘noise’, and some studies repeated and/or deleted individual assays that did not meet a defined quality threshold. When sources of ‘noise’ are nonrandom, replication alone will reproducibly yield false positive calls, which argues for replication by diverse methods.
Other controls showed the effectiveness of the respective screening methods. Self-versus-self hybridization was used4,5,9,40 to estimate somatic effects and/or numbers of false positive calls. Two studies assayed samples with previously characterized imbalances4,40. Sharp et al.40 showed the enhanced (11-fold) effectiveness of their targeted ‘hot spot’ array relative to a genome-wide assay. Redon et al.11 evaluated concordance between their two primary platforms and undertook numerous technical replicates.
Each study defined its own algorithm for ‘calling’ differences between sample and reference as putative structural variants. As for all screening assays, they were driven to optimize both sensitivity and specificity of the ascertainment, but approaches to this balance differed. Redon et al.11 set parameters in their algorithm to allow fewer than 5% false positive ‘calls’ per experiment. Other studies set thresholds and assessed numbers of false positives retrospectively. Some reported these type I errors in relation to the number of clones in the array4,40,41 and others relative to the proportion of positive calls5,7, prohibiting a direct comparison of specificity among the various studies. Sensitivity was harder to assess, and arguably impossible without knowledge of the true (or at least gold standard-based data) underlying numbers of structural variants. Estimates ranged from 5% false negatives9 to 50% power to detect 25-kb deletions7, but sensitivity was generally compromised in favor of specificity.
Structural variants identified
Assay design had a strong impact on the type and size of structural variants detected (Fig. 1, Supplementary Fig. 2 and Table 2). All revealed quantitative variation (gains or losses), but three recognized only deletions7,8,10, and two could also detect evidence of inversions6,39. Sizes of variant segments could be as small as 1 bp with computational alignments39,42 (though many of these were smaller than our defining size threshold of 1 kb1). Small deletions were detected through haploid hybridization (70 bp-10 kb)8 or oligonucleotide (SNP) footprints (1-404 kb)7 (1-745 kb)10, and the fosmid approach revealed variants in the range of library inserts (40 kb)6. Array methods approached the larger end of the spectrum for CNVs (collectively, about 50 kb-1 Mb)4,5,9,11,40,41. BAC clone probes tend to initially overestimate the apparent size of variants, as the clones may be large relative to the variant segment(s) they harbor, and the more sensitive the platform, the greater the overestimation11,47. Oligonucleotide arrays, on the other hand, approach the boundaries of variable segments from within, and should provide more accurate size estimates as long as the region has sufficient probe density.
The architecture of a variant region can influence its apparent size. Independently discrete genomic segments whose borders overlap can form a variable region characterized as much larger than its component variants, or containing complex rearrangements of smaller independently variable elements (Supplementary Figs. 1 and 3). As a result, the basis for definitions of overlap, variants, variant regions, merged variants, locations and so forth have been discretionary and varied. The field is probably ready for functional consensus in this area.
The earliest surveys reported about 100 variants or regions4,5; more recently, Wong et al. reported a disproportionate 3,654 CNVs, from which only 800 were considered ‘high frequency’ and more likely to be true positives41. Sequence comparisons flagged many more thousands of sites39,42, albeit ones that were much smaller and often reflected sequence assembly artifacts. Each of the 12 studies in Table 3 added a majority of apparently new variant loci, though as the catalog of genomic structural variants accumulates, the number of such new additions will eventually plateau.
Validation of putative structural variants
We reemphasize that the discovery strategies in Table 3 are screening tests, which draw attention to genome segments with an increased probability of harboring true structural variation. Eventually, comprehensive sequence data will document the breadth and detail of each variable region and individual variant, as illustrated by fosmid insert sequence data6 and direct sequence assembly comparisons39. In the meantime, various validation strategies have been applied to subsets of putative variants in each of the discovery reports. These included (i) FISH of metaphase, interphase or fiber chromosomes using various clones or PCR-amplified molecules; (ii) PCR or quantitative PCR (qPCR) for allele loss or quantitative variation; (iii) multiple ascertainment, whereby considerable weight was given to whether or not a putative variant was seen in more than one individual or had been reported in previous studies; (iv) array CGH to validate computational screening results6,7 or for finer resolution of BAC-screening results by oligonucleotide arrays9,41; (v) sequence analysis of fosmid inserts to confirm calls and to assess some discordant ones6,9; (vi) allele-specific fluorescence intensities10 and (vii) familial clustering41.
These assays were variously applied to subsets of data, and outcomes were used effectively in some studies7,10,11 to further evaluate the sensitivity and specificity and/or error rates of the primary screening methods. The proportion of putative variant loci that have been individually validated by means other than multiple ascertainments remains small, presumably due to the technical challenges of the confirmatory tests. All studies provided some information about the frequency of each putative structural variant or region, both as an argument for validation and to characterize the findings. A growing consensus in the field is for more validation of variants using two or more technologies.
Based on our enumeration of the challenges facing this new field and a thorough review of published experimental designs, we provide four broad guidelines that follow the natural progression of experimentation as an initial step toward the development of standards. As the field matures, these guidelines should serve as precursors to stricter standards that undergo regular and comprehensive vetting by the community48. We are struck by the resemblance to issues raised by the MIAME (minimum information about a microarray experiment) standards49, as well as by Lander and Kruglyak50, with recommendations to find the right balance of stringency and value judgment to avoid as much error as possible without delaying discovery. The latter paper’s recommendations for modifiers (suggestive, significant, highly significant and confirmed) might well be adapted for the statistical annotation of structural variants in databases.
In their current form, the recommended standards could also serve as a checklist for reviewers and editors as they assess manuscripts that report structural variation data. Moreover, as more structural variation data are reported and the nature of the variants becomes better understood, curators of databases would be at greater liberty to accept or reject complete or partial datasets according to established quality thresholds.
1. Describing the sample
The study should report the origin of each sample (for example, new or from a repository) and all of its characteristics, including the source (for example, blood, cell line, tissue) and karyotypic status, as well as the age, sex, ethnicity and phenotype (disease or nondisease features) of the donor. For surveys aiming to capture structural variation from the general population for control databases, there should be particular emphasis on detailing the extent of phenotype investigation. The study should also accurately document the genetic relationship of samples and any manipulation of the samples such as cell-culturing conditions or whole genome amplification, including protocols for extracting and labeling samples. Previous publications using the sample and all associated aliases should be listed.
2. Reporting experiments
Upon publication, the researchers must declare all aspects of the experimental design and results, including the experimental platform (for example, all clone or sequence identifiers used in arrays), technical procedures, data extraction and processing protocols, the version of the reference genome sequence used for comparison or annotation, and all validation results. The information must be made available in a format that enables unambiguous interpretation, replication of the experiment and the opportunity for other researchers to reanalyze the data to verify the conclusions48,49. For example, many array CGH experiments are performed using different test and reference samples, a variable number of spot replicates and differential use of dyeswap replicates. These methodological details affect the interpretation of the data and inferences regarding the presence or absence of a particular structural variant. Most existing new structural variation data are being generated using microarrays; therefore, suitable repositories include the Gene Expression Omnibus (GEO)51, ArrayExpress52 and CIBEX53 databases. As more sequence data emerge in structural-variation discovery initiatives, it is important that the underlying sequences and traces be made publicly available. Similarly, methodological differences exist in alignment algorithms; in addition to simple lists of sequence differences between assemblies or traces, the underlying alignments from which these events were called should be available.
3. Quality control
All studies should apply stringent criteria to ensure an accurate empirical estimation of the performance of the detection protocol used. Ideally, the parameters of the detection should be calibrated using a limited set of test data to achieve an acceptable level of false positive among the regions that are called. There are several metrics for this estimation, for example, the false discovery rate54. Parameters should be set to maximize screening specificity (minimize false positive calls) without undue compromise to sensitivity. To simplify this process, we recommend that all studies include at least one (and preferably more) standard control sample to be used as a reference for comparison. Initially, we propose sample NA15510 from the US National Institute of General Medical Sciences (NIGMS) Human Genetic Cell Repository, as it has already been characterized using a number of platforms (Table 2), and is also now being sequenced. A second reference sample could be NA10851, as it has also been characterized extensively11.
In addition to calibrating the parameters used for CNV calling, the quality of the total set of variants called across the entire sample set should be assessed. This requires unbiased sampling of the putative variants to be validated: that is, not just assessing those called most frequently, but ensuring representation of the entire frequency distribution. Good examples from the different experimental approaches outlined in Table 3 include validation of singleton and nonsingleton error rates11, estimation of fosmid read-pair error rates by sequencing the fosmid6 and estimation of error rates using a secondary technology such as oligonucleotide arrays7. It should no longer be considered sufficient to estimate the error rates by extrapolating from self-self experiments, without confirming that the estimated error rates were indeed correct and investigating how individual experimental error rates translate into study-wide error rates.
4. Describing structural variants
The study should thoroughly report characteristics of the structural variants, including sequence content (start and end points or complete sequence content with appropriate annotation), and population frequency and distribution (if known), including samples and assays used to determine these parameters. A future challenge will be to develop standards for defining CNV regions (CNVRs)—merging data from different individuals and different surveys into a single set of CNVRs. The ideal situation would be that each ‘called’ CNVR has an audit trail of both the experimental data and the processing of the data to the final call. Robust documentation of standardized CNVRs in databases will require specific rules to be established, and although their description is beyond the scope of this Perspective, the writing of it will stimulate future discussion. For CNVs and CNVRs, the definitions and criteria used by Redon et al.11 offer a good framework to build on (also see Supplementary Fig. 1). The current limitations in breakpoint resolution make it difficult to assign specific accession numbers to CNVs. However, once structural variants are described with boundaries mapped at nucleotide resolution, identifiers should be assigned using a nomenclature similar to that currently used for SNPs.
Many of the issues confronting the field of structural variation will be resolved as advances in technology allow robust and economical analysis of structural variants at the nucleotide level in multiple genomes. Such techniques will include ‘tiling path’-coverage oligonucleotide arrays, paired-end sequence relationship comparisons, and partial or complete sequence assembly comparisons. The ultimate standard will be sequence resolution of all structural variation in a defined set of reference individuals to establish a benchmark for genotyping platforms. We do not foresee that any one approach will capture all genetic variation reliably, nor, for at least a few more years, will a single strategy predominate over microarray-based approaches. Therefore, the main challenges from this point onward will surely include managing a huge data volume, integrating information from various discovery platforms and discerning phenotypic implications. New issues will arise, such as how to best annotate structural variation data in individual diploid genome assemblies (arising from personalized sequencing projects), as well as how to put haplotypes of structural variants (with or without SNPs) into context with respect to the latest human reference sequence. Structural variation data should also assist SNP, linkage disequilibrium and gene expression determination, but new database tools will be required to fully interpret the data.
Structural variation discoveries offer the potential to bridge a long-standing gap between cytogenetic and sequence-based investigations, and unify our understanding of genetic variation. Interestingly, at the onset of writing, we tried to sidestep the topic of terminology (and nomenclature), but kept returning to it in some way or another as we worked to define and distill the breadth of issues before us. In fact, it was the issue of terminology that highlighted the extreme heterogeneity in data being published, with the related strengths, caveats and differences in the studies being attributable in part to the different backgrounds of the researchers involved.
An equally intricate issue for data integration in the future will be categorizing structural variants in terms of whether they are ‘normal’, ‘disease-causing’ or ‘phenotype-associated’, as these designations can be part of a continuous range1,24,55,56. In Table 4, we put forward ideas of annotation modifiers that will assist in maximizing the utility of structural variation information. Molecular cytogeneticists have always been faced with this dilemma and its particular implications in the prenatal or diagnostic setting. Now, with the ability to readily recognize submicroscopic and sequence-level variation, the question of how to differentiate benign and disease-associated structural changes will be increasingly important. There are already well defined examples in which the presence of a structural variant correlates directly with a syndrome or phenotype, such as the many dosage-related microdeletions and duplications that cause genomic disorders57-63 (also see the DECIPHER database). Family-based studies can demonstrate whether a change is de novo or has been inherited and, in the latter case, whether there are likely to be associated phenotypic consequences (noting there are numerous examples of variable expression of phenotype and disease in inherited chromosomal rearrangements)1,21,55. Otherwise, large population studies and control and disease reference databases will provide the best source of information about a structural variant’s frequency and likelihood of causing a phenotypic outcome.
Table 4
Table 4
Classification of modifiers used for the description of structural variationa
Notwithstanding the challenges, we believe that the recommendations presented here offer necessary first steps toward standardization of many of the variables that, if ignored, will impede progress. At the same time, we recognize that consensus is important, and that standards require time to mature before adoption and implementation48. With some ground rules now set, it is also our intention to continue discussions with the genomic structural variation research community at the most relevant meeting opportunities.
Supplementary Material
Supplementary Fig.1
Supplementary Fig.2
Supplementary Fig.3
Supplementary Table1
Supplementary Table2
ACKNOWLEDGMENTS
We thank Dr. Janet Buchanan for assistance in manuscript preparation and D. Pinto, C. Marshall, R. Redon, I. Ragoussis and A. Carson for sharing ideas and unpublished data. The work is supported by Genome Canada/Ontario Genomics Institute, The Centre for Applied Genomics, the Canadian Institutes of Health Research (CIHR), the McLaughlin Centre for Molecular Medicine, the Canadian Institute of Advanced Research and the Hospital for Sick Children Foundation. M.E.H. and N.P.C. are supported by the Wellcome Trust. L.F. is supported by CIHR and S.W.S. is an Investigator of CIHR and holds the GlaxoSmithKline/CIHR Pathfinder Chair in Genetics and Genomics at the Hospital for Sick Children and the University of Toronto.
Footnotes
COMPETING INTERESTS STATEMENT
The authors declare no competing financial interests.
Note: Supplementary information is available on the Nature Genetics website.
1. Feuk L, Carson AR, Scherer SW. Structural variation in the human genome. Nat. Rev. Genet. 2006;7:85–97. [PubMed]
2. Freeman JL, et al. Copy number variation: new insights in genome diversity. Genome Res. 2006;16:949–961. [PubMed]
3. Sharp AJ, Cheng Z, Eichler EE. Structural variation of the human genome. Annu. Rev. Genomics Hum. Genet. 2006;7:407–442. [PubMed]
4. Iafrate AJ, et al. Detection of large-scale variation in the human genome. Nat. Genet. 2004;36:949–951. [PubMed]
5. Sebat J, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305:525–528. [PubMed]
6. Tuzun E, et al. Fine-scale structural variation of the human genome. Nat. Genet. 2005;37:727–732. [PubMed]
7. Conrad DF, Andrews TD, Carter NP, Hurles ME, Pritchard JK. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 2006;38:75–81. [PubMed]
8. Hinds DA, Kloek AP, Jen M, Chen X, Frazer KA. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 2006;38:82–85. [PubMed]
9. Locke DP, et al. Linkage disequilibrium and heritability of copy-number polymorphisms within duplicated regions of the human genome. Am. J. Hum. Genet. 2006;79:275–290. [PubMed]
10. McCarroll SA, et al. Common deletion polymorphisms in the human genome. Nat. Genet. 2006;38:86–92. [PubMed]
11. Redon R, et al. Global variation in copy number in the human genome. Nature. 2006;444:444–454. [PMC free article] [PubMed]
12. Simon-Sanchez J, et al. Genome-wide SNP assay reveals structural genomic variation, extended homozygosity and cell-line induced alterations in normal individuals. Hum. Mol. Genet. 2007;16:1–14. [PubMed]
13. Vissers LE, et al. Array-based comparative genomic hybridization for the genomewide detection of submicroscopic chromosomal abnormalities. Am. J. Hum. Genet. 2003;73:1261–1270. [PubMed]
14. Locke DP, et al. BAC microarray analysis of 15q11-q13 rearrangements and the impact of segmental duplications. J. Med. Genet. 2004;41:175–182. [PMC free article] [PubMed]
15. Shaw-Smith C, et al. Microarray based comparative genomic hybridisation (array-CGH) detects submicroscopic chromosomal deletions and duplications in patients with learning disability/mental retardation and dysmorphic features. J. Med. Genet. 2004;41:241–248. [PMC free article] [PubMed]
16. de Vries BB, et al. Diagnostic genome profiling in mental retardation. Am. J. Hum. Genet. 2005;77:606–616. [PubMed]
17. Koolen DA, et al. A new chromosome 17q21.31 microdeletion syndrome associated with a common inversion polymorphism. Nat. Genet. 2006;38:999–1001. [PubMed]
18. Sharp AJ, et al. Discovery of previously unidentified genomic disorders from the duplication architecture of the human genome. Nat. Genet. 2006;38:1038–1042. [PubMed]
19. Shaw-Smith C, et al. Microdeletion encompassing MAPT at chromosome 17q21.3 is associated with developmental delay and learning disability. Nat. Genet. 2006;38:1032–1037. [PubMed]
20. Urban AE, et al. High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays. Proc. Natl. Acad. Sci. USA. 2006;103:4534–4539. [PubMed]
21. Szatmari P, et al. Mapping autism risk loci using genetic linkage and chromosomal rearrangements. Nat. Genet. 2007;39:319–328. [PubMed]
22. Zhang J, Feuk L, Duggan GE, Khaja R, Scherer SW. Development of bioinformatics resources for display and analysis of copy number and other structural variants in the human genome. Cytogenet. Genome Res. 2006;115:205–214. [PubMed]
23. Cooper GM, Nickerson DA, Eichler EE. Mutational and selective effects on copy-number variants in the human genome. Nat. Genet. 2007;39:S22–S29. [PubMed]
24. Lee C, Iafrate AJ, Brothman AR. Copy number variations and clinical cytogenetic diagnosis of constitutional disorders. Nat. Genet. 2007;39:S48–S54. [PubMed]
25. McCarroll SA, Altshuler DM. Copy-number variation and association studies of human disease. Nat. Genet. 2007;39:S37–S42. [PubMed]
26. Eichler EE, et al. Completing the map of human genetic variation. Nature. 2007;447:161–165. [PMC free article] [PubMed]
27. Shendure J, Mitra RD, Varma C, Church GM. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet. 2004;5:335–344. [PubMed]
28. Bennett ST, Barnes C, Cox A, Davies L, Brown C. Toward the $1,000 human genome. Pharmacogenomics. 2005;6:373–382. [PubMed]
29. Bentley DR. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 2006;16:545–552. [PubMed]
30. Service RF. Gene sequencing. The race for the $1000 genome. Science. 2006;311:1544–1546. [PubMed]
31. Altshuler D, et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature. 2000;407:513–516. [PubMed]
32. Mullikin JC, et al. An SNP map of human chromosome 22. Nature. 2000;407:516–520. [PubMed]
33. Sachidanandam R, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature. 2001;409:928–933. [PubMed]
34. Report of the Standing Committee on Human Cytogenetic Nomenclature, ISCN 1985. An International System for Human Cytogenetic Nomenclature. Birth Defects Orig. Artic. Ser. 1985;21:1–117. [PubMed]
35. Heim S. Genetic nomenclature: ISCN and ISGN. Pediatr. Hematol. Oncol. 1996;13:iii. [PubMed]
36. den Dunnen JT, Antonarakis SE. Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum. Mutat. 2000;15:7–12. [PubMed]
37. Eichler EE. Widening the spectrum of human genetic variation. Nat. Genet. 2006;38:9–11. [PubMed]
38. Istrail S, et al. Whole-genome shotgun assembly and comparison of human genome assemblies. Proc. Natl. Acad. Sci. USA. 2004;101:1916–1921. [PubMed]
39. Khaja R, et al. Genome assembly comparison identifies structural variants in the human genome. Nat. Genet. 2006;38:1413–1418. [PMC free article] [PubMed]
40. Sharp AJ, et al. Segmental duplications and copy-number variation in the human genome. Am. J. Hum. Genet. 2005;77:78–88. [PubMed]
41. Wong KK, et al. A comprehensive analysis of common copy-number variations in the human genome. Am. J. Hum. Genet. 2007;80:91–104. [PubMed]
42. Mills RE, et al. An initial map of insertion and deletion (INDEL) variation in the human genome. Genome Res. 2006;16:1182–1190. [PubMed]
43. Carter NP. Methods and strategies for analyzing copy number variation using DNA microarrays. Nat. Genet. 2007;39:S16–S21. [PMC free article] [PubMed]
44. Cheung J, et al. Genome-wide detection of segmental duplications and potential assembly errors in the human genome sequence. Genome Biol. 2003;4:R25. [PMC free article] [PubMed]
45. Bailey JA, Eichler EE. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet. 2006;7:552–564. [PubMed]
46. Risin S, Hopwood VL, Pathak S. Trisomy 12 in Epstein-Barr virus-transformed lymphoblastoid cell lines of normal individuals and patients with nonhematologic malignancies. Cancer Genet. Cytogenet. 1992;60:164–169. [PubMed]
47. Carson AR, Feuk L, Mohammed M, Scherer SW. Strategies for the detection of copy number and other structural variants in the human genome. Hum. Genomics. 2006;2:403–414. [PMC free article] [PubMed]
48. Burgoon LD. The need for standards, not guidelines, in biological data reporting and sharing. Nat. Biotechnol. 2006;24:1369–1373. [PubMed]
49. Brazma A, et al. Minimum information about a microarray experiment (MIAME)—toward standards for microarray data. Nat. Genet. 2001;29:365–371. [PubMed]
50. Lander E, Kruglyak L. Genetic dissection of complex traits: guidelines for interpreting and reporting linkage results. Nat. Genet. 1995;11:241–247. [PubMed]
51. Barrett T, et al. NCBI GEO: mining tens of millions of expression profiles-database and tools update. Nucleic Acids Res. 2007;35:D760–D765. [PubMed]
52. Parkinson H, et al. ArrayExpress-a public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 2007;35:D747–D750. [PubMed]
53. Ikeo K, Ishi-i J, Tamura T, Gojobori T, Tateno Y. CIBEX: center for information biology gene expression database. C. R. Biol. 2003;326:1079–1082. [PubMed]
54. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodological) 1995;57:289–300.
55. Feuk L, Marshall CR, Wintle RF, Scherer SW. Structural variants: changing the landscape of chromosomes and design of disease studies. Hum. Mol. Genet. 2006;15(special no. 1):R57–R66. [PubMed]
56. Lee JA, Lupski JR. Genomic rearrangements and gene copy-number alterations as a cause of nervous system disorders. Neuron. 2006;52:103–121. [PubMed]
57. Lupski JR, et al. DNA duplication associated with Charcot-Marie-Tooth disease type 1A. Cell. 1991;66:219–232. [PubMed]
58. Ewart AK, et al. Hemizygosity at the elastin locus in a developmental disorder, Williams syndrome. Nat. Genet. 1993;5:11–16. [PubMed]
59. Chance PF, et al. Two autosomal dominant neuropathies result from reciprocal DNA duplication/deletion of a region on chromosome 17. Hum. Mol. Genet. 1994;3:223–228. [PubMed]
60. Chen KS, et al. Homologous recombination of a flanking repeat gene cluster is a mechanism for a common contiguous gene deletion syndrome. Nat. Genet. 1997;17:154–163. [PubMed]
61. Small K, Iber J, Warren ST. Emerin deletion reveals a common X-chromosome inversion mediated by inverted repeats. Nat. Genet. 1997;16:96–99. [PubMed]
62. Potocki L, et al. Molecular mechanism for duplication 17p11.2— the homologous recombination reciprocal of the Smith-Magenis microdeletion. Nat. Genet. 2000;24:84–87. [PubMed]
63. Kurotaki N, et al. Haploinsufficiency of NSD1 causes Sotos syndrome. Nat. Genet. 2002;30:365–366. [PubMed]
64. Bailey JA, Yavor AM, Massa HF, Trask BJ, Eichler EE. Segmental duplications: organization and impact within the current human genome project assembly. Genome Res. 2001;11:1005–1017. [PubMed]
65. Bailey JA, et al. Recent segmental duplications in the human genome. Science. 2002;297:1003–1007. [PubMed]
66. International Human Genome Sequencing Consortium Finishing the euchromatic sequence of the human genome. Nature. 2004;431:931–945. [PubMed]
67. Budarf ML, Emanuel BS. Progress in the autosomal segmental aneusomy syndromes (SASs): single or multi-locus disorders? Hum. Mol. Genet. 1997;6:1657–1665. [PubMed]
68. Fiegler H, et al. Accurate and reliable high-throughput detection of copy number variation in the human genome. Genome Res. 2006;16:1566–1574. [PubMed]
69. Komura D, et al. Genome-wide detection of human copy number variations using high-density DNA oligonucleotide arrays. Genome Res. 2006;16:1575–1584. [PubMed]
70. Lin M, et al. dChipSNP: significance curve and clustering of SNP-array-based loss-of-heterozygosity data. Bioinformatics. 2004;20:1233–1240. [PubMed]
71. Nannya Y, et al. A robust algorithm for copy number detection using high-density oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res. 2005;65:6071–6079. [PubMed]
72. Colella S, et al. QuantiSNP: an objective Bayes hidden-Markov model to detect and accurately map copy number variation using SNP genotyping data. Nucleic Acids Res. 2007;35:2013–2025. [PMC free article] [PubMed]
73. Conrad DF, Hurles ME. The population genetics of structural variation. Nat. Genet. 2007;39:S30–S36. [PMC free article] [PubMed]