|Home | About | Journals | Submit | Contact Us | Français|
C.A. and E.E.E. conceived the study and wrote the manuscript. C.A. and S.S. analyzed the data.
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.
The plummeting costs and massive throughput of second-generation sequencing platforms are paving the way for de novo sequencing applications to characterize the genomes of thousands of species. Recently, researchers from the Beijing Genome Institute sequenced the cucumber genome using both capillary sequencing and Illumina technology1, and the panda genome was the first mammalian genome to be assembled using sequence data generated solely using next-generation sequencing (NGS) platforms2. An international consortium of scientists proposes more ambitious projects, such as the Genome 10K Project to sequence the genomes of 10,000 vertebrate species3. The information obtained from sequencing these genomes will help us better understand genome evolution, providing rapid access to gene models from many more organisms than previously anticipated. However, a critical assessment of NGS genome assemblies should be performed in comparison to known standards, and a robust classification of what is missing from the assemblies needs to be taken into account. Such analyses are essential to correctly perform comparative genomics studies. Typical genome assembly standards, such as complete cDNA libraries or sequences from large-insert genomic clones that sample the genome do not yet exist for newly sequenced genomes such as the cucumber and panda and are unlikely to be generated to test the assembly quality. We can, however, compare the recently generated de novo sequence assemblies of two human individuals4 with the human reference genome5,6 to assess the limitations of such genomes assembled primarily with short reads. Here we present a formal analysis of the de novo sequence assembly generated from the genome of a Han Chinese individual (YH; Supplementary Note) using the Illumina platform4 with an emphasis on the repeats and segmental duplications that cover approximately half of the human genome. In addition, we analyzed the new sequences discovered from the genome of another individual (Yoruban from Ibadan, Nigeria; NA18507) to test the utility of de novo assemblies in the characterization of new sequence insertions.
NGS technologies typically generate shorter sequences with higher error rates from relatively short insert libraries7,8. For example, one of the most commonly used technologies, Illumina’s sequencing by synthesis, routinely produces read lengths of 75–100 base pairs (bp) from libraries with insert sizes of 200–500 bp. It is, therefore, expected that assembly of longer repeats and duplications will suffer from this short read length. Similar to the whole-genome shotgun sequence (WGS) assembly algorithms that use capillary-based data such as the Celera assembler9, the predominant assembly methods for short reads are based on de Bruijn graph and Eulerian path approaches10, which have difficulty in assembling complex regions of the genome. As argued by groups that presented various implementations of this approach (for example, the algorithms named EULER-USR11, ABySS12 and SOAPdenovo4), paired-end sequence libraries with long inserts help to ameliorate this bias. However, even the longest currently available inserts (<17 kilobases with Roche 454 sequencing13) are insufficient to bridge across regions that harbor the majority of recently duplicated human genes. Criticisms of WGS assembly algorithms and characterization of various types of errors associated with them as well as requirements for better assemblies have been previously discussed14-16.
An important consideration of any sequencing project, including those that use Sanger sequencing, is DNA contamination from other organisms. Before analyzing the genomes, we searched for possible contaminants by comparing the repeat-masked YH genome against the US National Center for Biotechnology Information (NCBI) nucleotide (nt) database17 (Supplementary Note). We identified 1,033 contigs and 166 scaffolds (361 kbp) with high-identity matches to other species (Supplementary Note). Although this represents a small fraction of the total genome, nearly half the sequence (152 kbp) (Supplementary Table 1) was classified as new human sequences corresponding to ~15% of all reported insertion polymorphisms for YH (1,079 of 7,211 sequences or 3% of 5.12 Mb)18 (Supplementary Note). Similarly, 2.8% (136.6 kbp of 4.8 Mbp) of the new sequences reported using the genome of a Yoruban individual (NA18507), likely represent contamination. The majority of these contaminations had high sequence identity to Epstein-Barr virus, an agent commonly used to immortalize cell lines (Fig. 1a and Supplementary Table 1). (Note that the NA18507 genome was sequenced using DNA from a cell line, whereas the YH genome sequence was generated from blood DNA.) Thus, although de novo sequence assemblies may be an important source for the discovery of insertion polymorphisms and are complementary to clone-based methods (Fig. 1a), such sequences require particular scrutiny and additional validation because of their tendency to enrich for contamination artifacts. Discriminating such sequences before sequence assembly becomes particularly problematic when the underlying sequence read data are short.
Any WGS-based de novo sequence assembly algorithm will collapse identical repeats, resulting in reduced or lost genomic complexity14. We compared the repeat content of the YH genome and the human reference genome (build 36) generating summary statistics for various repeat classes19 (Supplementary Table 2). Although the repeat structure may vary between individuals, most retrotransposons are fixed in the human lineage20, thus we would expect to observe a similar number of base pairs corresponding to retrotransposon-derived repeats in the genome of any human individual and the reference genome assembly. We identified 420.2 Mbp of missing common repeat sequence from the YH assembly corresponding to 173.6 Mbp of missing LINE1 (L1) and 159.2 Mbp of missing Alu repeats. As highly identical sequences will be more problematic, we quantified this effect by comparing this depletion as a function of sequence divergence. The depletion of repeat sequences was enriched in L1 classes with lower sequence divergence (R2 = 0.86; Fig. 1b). We found that the depletion rose rapidly (>50%) for L1 repeat subfamilies where sequence identity exceeded 85%.
In general, most Alu subfamilies were underrepresented, but evolutionarily younger Alu repeats with higher identity to consensus sequences had high depletion rates although this trend was weak (R2 = 0.02, Supplementary Fig. 1), likely because of the shorter sequence length of the Alu repeat class. Most common repeat classes showed reduced representation in the YH genome (Supplementary Table 2).
We used the whole-genome assembly comparison (WGAC) method21 to analyze the segmental duplication content in the YH genome. Despite the fact that genomes typically contain 140.2 Mbp to 159.6 Mbp (25,914 pairwise alignments) of euchromatic segmental duplication22, we detected only 10 Mb of segmental duplications (1,652 pairwise alignments) in the YH assembly (Table 1). Although the depletion becomes more pronounced with increasing sequence identity, the number of pairwise alignments was dramatically reduced (>90%) for all classes of duplication (Fig. 1c). This is in stark contrast to capillary sequencing–based WGS assembly, which recovered a substantial fraction of duplications with less than 95% sequence identity22. We previously constructed a duplication map of the YH genome using read-depth methods and validated copy-number differences using array comparative genomic hybridization23. We discovered 92 Mb of segmental duplications (>94% sequence identity) and found that the duplication content was similar to that of other human genomes (Supplementary Fig. 2a). We did not observe the common human duplication pattern within the YH genome de novo assembly (Fig. 1d and Supplementary Fig. 2b). If we limit our analysis to those duplications commonly present in the human reference genome and duplications we detected through read-depth analysis of a capillary sequencing–based WGS dataset24 (Celera) and YH (total of 72 Mbp common duplications), we conclude that 99.4% of true pairwise segmental duplications were absent. We predict that 95.6% of the duplications in the YH de novo assembly are likely false because they did not correspond to duplications predicted by read depth or were not detected by array comparative genomic hybridization analysis23 of the YH genome.
Finally, we analyzed the impact of this genomic reduction on both gene coverage and fragmentation of genes into multiple scaffolds. We examined a nonredundant autosomal gene set (17,601 genes; Supplementary Note) and required ≥98% sequence identity between the assembly and the reference gene set. (At the exon level, we found that 93% of all coding exons (159,621 of 171,746 exons) were completely represented in the YH assembly. At the gene level, however, only 56.3% of the genes (9,909 of 17,601 genes) had sufficient representation in the assembly (≥95% of the gene). Not surprisingly, among the 2,377 protein-coding exons that were completely absent, 47.7% (1,133 of 2,377 exons) mapped to segmental duplications (Supplementary Tables 3 and 4), representing a tenfold enrichment of duplicated sequence. Although these losses would prevent appropriate annotation of at least 1,112 genes, we also noted 83 genes for which all exons were completely missing or had less than 1% of their protein-coding sequence represented. Of these genes, 81.9% (68 of 83 genes) corresponded to members of duplicated gene families, many of which are high in copy number in the YH genome, as we previously characterized23 (Supplementary Tables 3 and 4).
The analysis described above did not consider gene fragmentation (that is, parts of the same gene represented in different scaffolds). The presence of duplicated and repetitive sequences in introns complicates complete gene assembly and annotation, leading to genes being broken among multiple sequence scaffolds. To test for this effect of gene fragmentation, we calculated the minimum number of scaffolds in the YH de novo assembly required to reconstruct every human gene according to the reference genome (Supplementary Note). We found that 69.7% of the genes (12,268 of 17,601 genes) are contained in a single scaffold. Among the fragmented genes (those mapping to two or more scaffolds), we found that 42% intersect with segmental duplications (1,779 of 5,291 genes) or map to regions in which repeat content exceeded >50% (1,582 of 5,291 genes) (Supplementary Table 3). Of 11,766 nonfragmented genes with all protein-coding exons present (Supplementary Table 3), 255 were shuffled in their respective scaffolds (that is, the exons were out of order). We observed that 29 genes were fragmented into >100 scaffolds and most (93%) corresponded to duplicated genes (Table 2 and Supplementary Table 3). Among the most shattered genes with more than 200 scaffolds were two genes (HYDIN2 and PRIM2) that have high-identity segmental duplications in YH23,25. Although HYDIN2 was not present in the NCBI build 36 assembly, it is now partially represented in GRCh37 human genome assembly but not assigned to a chromosomal location.
This is a watershed moment in genomics. Although data production capabilities are substantially improved, accurately building genome assemblies and correctly annotating them remains challenging, especially among complex genomes with higher repeat and duplication content. The de novo assembly of the YH genome coupled with experimental validation of its duplication and repeat content allow us to quantify this effect. Other than contaminating sequence, the most noticeable casualties of a de novo NGS assembly are segmental duplications and larger common repeats. We found that this depletion became acute when the sequence identity exceeded 85% resulting in the loss of ~16% of the genome. This is a more considerable bias when compared to capillary sequencing–based WGS assembly of the human genome in which sequence misassembly and collapse occurred for only ~8% of the genome when duplications or repeats exceeded 95% sequence identity. In the absence of alternative NGS-based human genome assemblies with different algorithms, we cannot test the effects of assembly method, but we believe that the limitations we present in this work are due to the properties of the data and whole-genome shotgun sequencing approach in general, rather than algorithmic inefficiency.
Without complementary efforts to fully sequence complex genomes, the field of comparative genomics may face a crisis. There is the problem that although the genomes of many more species are now accessible, the portion of each genome that can be reliably accessed has diminished substantially (<80%). Such biases are ironically transforming our definition of what it means to sequence a genome and threaten to skew our understanding of organismal biology and genome evolution. Third-generation technologies, which increase read length or library insert sizes, promise to alleviate this deficit, but the issue is fundamentally greater than a technological gap. The expertise and motivation to sequence genomes to a high quality are disappearing. Large-insert clone library resources such as bacterial artificial chromosomes, required for accurate assembly of the human genome, were once a mainstay of genome sequencing projects but are now considered too costly to create or maintain for many organisms. Moreover, if ‘genome manuscripts’ can now be published without accounting for the 20% that is missing, what incentive remains to spend the additional effort and cost to sequence these genomes well? Such biases can be minimized when the genome of a closely related species finished with high-quality, clone-based sequencing is available (such as closely related nonhuman primate genomes compared against the reference human genome assembly). The problem is exacerbated when analyzing genomes without a reference index genome. In these cases, the portions that are missing or misassembled cannot be readily inferred and are invisible to the biologist. Biases against duplications and repeats, as well as fragmentation, raise questions related to the accuracy and completeness of similarly assembled genomes such as the panda genome2, as recently discussed26. It is the responsibility of the scientific community to enforce standards of quality that can be measured and assessed. In our opinion, it is critical to develop new hybrid sequencing approaches, such as multiplatform strategies including the third-generation long-read technologies, high-quality finished long-insert clones and new assembly algorithms that can accommodate these heterogeneous datasets. The genome assemblies themselves must be experimentally validated. Large-molecule, high-quality sequencing should not be abandoned until the balance between quantity and quality of genomes has been reestablished.
We thank E. Karakoc and P. Sudmant for helpful discussions, T. Marques-Bonet and J.M. Kidd for providing the nonredundant gene table, and T. Brown for proofreading the manuscript. This work was partly supported by US National Institutes of Health grant HG002385 to E.E.E. E.E.E. receives funds as an Investigator of the Howard Hughes Medical Institute.
Note: Supplementary information is available on the Nature Methods website.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details accompany the full-text HTML version of the paper at http://www.nature.com/naturemethods/.