|Home | About | Journals | Submit | Contact Us | Français|
The sequence of Ciona savignyi was determined using a whole-genome shotgun strategy, but a high degree of polymorphism resulted in a fractured assembly wherein allelic sequences from the same genomic region assembled separately. We designed a multistep strategy to generate a nonredundant reference sequence from the original assembly by reconstructing and aligning the two 'haplomes' (haploid genomes). In the resultant 174 megabase reference sequence, each locus is represented once, misassemblies are corrected, and contiguity and continuity are dramatically improved.
We describe the generation of the reference sequence for the Ciona savignyi genome. C. savignyi is among the species of sessile marine invertebrates commonly known as sea squirts. It is an important model organism  that is amenable to a variety of molecular genetic experiments . As a urochordate, it occupies an advantageous phylogenetic position, sharing conserved developmental mechanisms with vertebrates while being a substantially simpler organism both genomically and developmentally [3,4]. In addition, a draft genome sequence of a sister Ciona sp. (C. intestinalis)  is available, further enhancing the experimental and comparative value of a high-quality C. savignyi genome sequence [6,7].
The C. savignyi genome project employed a whole-genome shotgun (WGS) strategy to sequence a single, wild-collected individual to a depth of 12.7× . Assembly was complicated by an unexpected and extreme degree of heterozygosity , because current WGS assembly algorithms (including Arachne , the assembler employed for this genome) are not designed to accommodate highly polymorphic shotgun data [9-13]. The best shotgun assemblies have thus far been produced from species that exhibit a low rate of polymorphism (for instance, human ) or from inbred laboratory or agricultural strains (for example, mouse , Drosophila melanogaster , and chicken ). Assemblies of genomes from organisms with moderate or localized heterozygosity encountered significant difficulty that resulted in lower quality than expected, given the depth of sequencing [5,18-21].
An alternate WGS assembly strategy was developed for the C. savignyi genome that leveraged the extreme heterozygosity and depth of the shotgun data to force separate assembly of the two alleles  across the entire genome. In the resulting WGS assembly, nearly all loci are therefore represented exactly twice. However, the assembler had no mechanism by which to determine which contigs were allelic. Thus, the redundant WGS assembly contains no information to indicate how the constituent contigs relate to the two 'haplomes' (haploid genomes), preventing selection of a single haplome as a reference sequence. Importantly, there is also no distinction between highly similar contigs that represent two different alleles and those resulting from paralogous regions.
The redundancy of the original WGS assembly represented a practically insurmountable problem for genome annotation. Available genome data structures and browsers require a nonredundant reference sequence, and current gene prediction pipelines are highly parameterized and dependent on a hierarchy of heuristics that cannot accommodate the presence of two alleles in a single assembly [22-24]. Additionally, if a redundant gene set were to be obtained, then the lack of distinction between alleles and paralogs would significantly complicate evolutionary analyses, which are among the primary uses of the C. savignyi genome.
It was therefore imperative to generate a reference sequence for C. savignyi that could serve as a nonredundant resource and as the basis for genome annotation. We here describe how we generated the nonredundant, high-quality reference sequence, using the original WGS assembly as a starting point. Our strategy first identified allelic contigs and supercontigs in order to reconstruct the two haplomes and enable construction of a pair-wise haplome alignment. The aligned haplomes were then utilized to identify and, where possible, to correct several types of misassembly. The alignment also allowed the bridging of contig and supercontig gaps in one haplome by the other, dramatically improving long-range contiguity. Finally, the alignment was parsed to generate a composite nonredundant reference sequence that is more complete than either haplome.
We designed a semiautomated alignment pipeline to generate a nonredundant reference sequence from the original, redundant WGS assembly (Figure (Figure1).1). The pipeline is comprised of several stages and incorporates purpose-built and existing algorithms. A fully automated pipeline was not attempted because the complexity of the polymorphic assembly required manual inspection at several stages. Our strategy is best described as consisting of seven stages: identification of alignment anchors connecting allelic contigs; binning of allelic supercontigs; assignment of allelic supercontigs to haplomes; ordering and orienting the allelic contigs and supercontigs; removal of tandem misassemblies; pair-wise alignment of allelic hypercontigs; and selection of the reference sequence.
Like all WGS assemblies, the original WGS assembly of C. savignyi consists of a set of supercontigs that are comprised of ordered and oriented contigs (Figure (Figure1a).1a). Contigs are connected into supercontigs by paired sequence reads, which are obtained from opposite ends of a single clone. The original WGS assembly contains two copies of most loci, but individual contigs contain no information to indicate which of the two haplomes they belong to or any information to identify allelic contigs.
To identify high confidence allelic regions for use as anchors in later alignment steps, the original WGS contig assembly was soft-masked with a C. savignyi de novo RECON  repeat library and aligned to itself via a stringent optimization of blastn . Regions of at least 100 consecutive base pairs with exactly one high-quality blast hit were selected as allelic anchors. The requirement for exactly one hit precludes anchors between low copy repeats or duplicated regions. Anchors were filtered to remove those that lie in predominantly masked regions and between contigs in the same supercontig. (As is discussed below, a common error in WGS assembly of polymorphic genomes is tandem misassembly of alleles into the same supercontig; the 6,864 within-supercontig anchors most likely represent instances of this error.) After the filtering step, 239,635 anchors connecting 28,930 contigs remained (Figure (Figure1b1b).
In order to weight anchors for later steps, a LAGAN  global alignment was generated for each anchored contig pair, and a modified alignment score was calculated from each such alignment. The anchored contig pairs and their alignment scores were then mapped to supercontigs. A total of 3,678 supercontigs, comprising 88% of bases in the assembly, contained at least one anchor to another supercontig (Table (Table1).1). Of a total 6,411 anchored supercontig pairs, 4,546 were connected by a single contig pair, 723 by exactly two contig pairs, and the remaining 1,142 were connected by more than two contig pairs.
The anchored supercontigs were then sorted into 'bins', defined as collections of supercontigs containing both alleles of a region (Figure (Figure1c)1c) that have no assembly connections to their neighboring regions, as follows. Anchored supercontig pairs were ranked by the sum of their contig-contig LAGAN alignment scores and iteratively grouped starting with the highest ranked pair. Summing the contig LAGAN alignment scores across supercontigs and ranking supercontig pairs in order of scores in effect creates a voting scheme, wherein a spurious alignment or a small paralogous region will be outvoted by the correct allelic alignments of the surrounding sequence. Lower ranking alignments were flagged if they were not spatially consistent with a higher ranking alignment. For example, in Figure Figure22 the alignment shown in green would be flagged because it creates a linear inconsistency with the higher ranking alignment shown in blue. A total of 2,360 supercontigs comprising 85% of the original WGS assembly were thus sorted into 374 bins (Table (Table1).1). A total of 1,318 supercontigs, representing 3% of bases in the original WGS assembly, contained anchors that were overruled during the binning process, and were therefore not assigned to a bin.
Visual inspection of all bins indicated that the majority of the flagged, spatially inconsistent alignments were indeed spurious, but it also revealed loci where the independently assembled allelic supercontigs have a disagreement in long range contiguity, and hence are indicative of a major misassembly in one supercontig (Figures (Figures22 and and3a).3a). A major misassembly occurs when two distinct regions of the genome are joined (usually in a repeat), creating an artificial translocation event [9-13]. Major misassemblies are relatively rare but they are known to occur in nearly all established WGS assemblers and are extremely difficult to detect without a finished sequence or physical map [28-31]. We identified 13 alignment conflicts that were indicative of a major misassembly, and that linked 22 bins into eight 'spiders', so-called because of the branching structure created by the misassembly (Figure (Figure22).
To determine which of the conflicting supercontigs contained the misassembly, we placed genetic markers at relevant locations surrounding each alignment conflict and typed them in a mapping panel. Markers bridging a major misassembly should have no significant linkage or a genetic distance grossly out of scale to the physical distance indicated by the supercontig assembly. In six of the 13 alignment conflicts linkage analysis indicated a clear relationship between markers spanning the putative major misassembly in only one of the supercontigs, and we broke the opposite supercontig sequence accordingly. A detailed representative example is shown in Additional data file 1. In three conflicts we could not locate appropriate fully informative markers and in four conflicts the linkage data were inconclusive. In these cases we parsimoniously broke one supercontig. We note that, given the extreme polymorphism of C. savignyi, it is possible that the inconclusive linkage data reflect a polymorphic rearrangement event segregating in the population, because the individuals from the genetic cross were unrelated to the sequenced individual.
In total, supercontigs constituting 15% of the bases in the original WGS assembly lacked a unique anchor or were unplaced during the binning process, and were therefore not included in subsequent steps. We suspect that most of the unassigned sequence is comprised of uncondensed repetitive regions and hence, if fully assembled, would represent significantly less than 15% of the genome. This view is supported by several lines of independent evidence. First, 75% of the bases in the unassigned sequence are repeat masked. This is a significant enrichment compared with the original WGS assembly, in which 30% of bases are repeat masked. Second, the unassigned sequence primarily consists of short single-contig supercontigs, whose N50 is only 6 kilobases (kb). Most importantly, the unassigned contigs exhibit a striking preponderance of low sequence coverage: 27% of unassigned contigs have a maximum read coverage of two, whereas only 1.2% of the contigs that were assigned to a bin fall into this category (Figure (Figure4).4). The mean read coverage per position in the unassigned sequence is 3.7, which is well below the mean of binned contigs of 5.3 (Additional data file 2).
Supercontigs in each bin were assigned to one or the other of the two haplomes by leveraging the alignment connections between supercontigs within each bin to assign supercontigs into allelic sub-bins 'A' and 'B' (Figure (Figure1d).1d). A bipartite graph was constructed for each bin, where nodes are supercontigs and edges are alignments between them. We arbitrarily assigned one node of the most trustworthy edge (as determined by alignment score) to sub-bin A. All nodes connected to the initial node by alignment were assigned to sub-bin B. We then traversed the graph one edge at a time and assigned each unassigned node to the opposite sub-bin of the previous node. As the designation of A and B is arbitrary within each bin, the reconstructed haplomes are necessarily a mosaic of the parental contributions.
We utilized a purpose-built Java tool to inspect bins for inconsistencies between order/orientation of contigs as suggested by the original WGS assembly and that suggested by the alignments between allelic contigs. The Java graphical user interface displayed all contigs in each bin, their alignment anchors to the other allele, paired reads between distinct supercontigs, and other pertinent information. Manual inspection of all bins revealed the vast majority of inconsistencies to be due to obviously spurious alignments. However, it became clear at this point that it would not be sufficient to chain supercontigs linearly in each sub-bin to obtain the correct order and orientation in each of the two haplomes, because a substantial number of clearly correct alignments between allelic contigs did not conform to the supercontig-imposed ordering, indicating the presence of misassemblies that should be corrected.
Most disagreements could be classified into three types of misassemblies: 'drop-in' supercontigs, interleaved supercontigs, and local contig misorderings (Figure (Figure3b).3b). The most frequent type was of the drop-in variety, in which short, usually single-contig supercontigs were ordered by the alignment to a position entirely within a gap of a different supercontig (Figure (Figure3b).3b). The size of the 'drop-in' supercontig and the gap length of the supercontig into which it would be embedded (as estimated by Arachne) were often strikingly similar, and in many cases the existence of multiple, consistent paired reads between the small supercontig and the encompassing supercontig further supported the alignment-ordered placement. Interleaved supercontigs, which are characterized by the alignment-directed ordering of the terminal contig(s) of one supercontig within a supercontig gap in the adjacent supercontig, were less frequent (Figure (Figure3b).3b). Interleaved supercontig misassemblies have been observed in other WGS assemblies . The final type of misassembly detected at this stage of the pipeline, namely local contig misordering, consists of incorrectly ordered contigs within a single supercontig (Figure (Figure3b),3b), and has been reported in assemblies by all major WGS assemblers [9-13,29].
We developed an algorithm, called the Double Draft Aligner (DDA) , to order contigs automatically within each sub-bin with respect to their allelic contigs from the other sub-bin. The DDA operates on contigs rather than supercontigs to allow for the reordering of contigs that would be necessary to correct the three types of misassembly described above (Figure (Figure3b).3b). The DDA is similar to SLAGAN , with the notable exception that there is no reference sequence according to which the other input sequence is rearranged. Instead, each of the two input sequences is a set of unordered contigs and either sequence may contain a rearrangement. The DDA constructs 'alignment links' from local alignments between contigs of opposite sub-bins, and utilizes these alignment links to chain contigs iteratively within each of the two sub-bins. In the absence of an alignment link, contigs are chained according to the order within their supercontig. A detailed description of the DDA algorithm is provided in Materials and methods (below). The DDA does not chain across 'unreliable' contigs (contigs with multiple, inconsistent alignments) and a final, manual proofreading step using the Java tool mentioned above was used to correct these cases.
The effectiveness of the DDA is illustrated in Figure Figure5.5. Before the DDA is run on a bin, supercontigs from each sub-bin are unordered. After the DDA is run, the ordering of the contigs in each allele corresponds linearly to the other allele. Once ordered by the DDA, the contigs of each sub-bin were concatenated into a single 'hypercontig' (Figure (Figure1e).1e). The allelic hypercontigs constitute the reconstructed haplome assembly.
It should be noted that some differences in contig order between the haplomes are probably the result of a polymorphic rearrangement rather than a misassembly. The DDA will force contigs of a polymorphic rearrangement to correspond to the order of the more contiguous haplome, and hence introduce an artificial rearrangement in the other haplome. However, because both orderings were present in the sequenced individual and we have no information to indicate which is 'wild type', either ordering is equally legitimate for our primary goal of selecting a nonredundant reference sequence. This also applies to polymorphic inversions, which the DDA identifies but does not re-orient. All identified inversions that spanned at least one entire contig were manually inspected and re-oriented in one hypercontig.
The ordered and oriented contigs allowed identification of tandem allele misassemblies, which are known errors of polymorphic WGS assemblies . In a tandem allele misassembly, two alleles of a region are positioned adjacent to each other in the same supercontig (Figure (Figure3c).3c). The insertion of the second, misassembled allele into the supercontig creates a disparity between the length of the assembled sequence and the estimated distance to adjacent contigs provided by paired reads, because the paired reads will 'leapfrog' the misassembled allele (Figure (Figure3c).3c). The assembler is then forced to report a contig overlap to reconcile the conflicting sequence and paired read data. Tandem allele misassemblies are probably common in the original WGS assembly, because the assembler predicted a contig overlap between 5.3% of all adjacent contigs. The predicted overlaps have a total length of 9 megabases (Mb), a median length of 3.7 kb, and an N50 length of 6.4 kb (Additional data file 3). Manual inspection of a sampling of predicted overlaps revealed a strong enrichment for paired read structures, which is indicative of tandem misassembly. An additional 36% of adjacent contigs in the original WGS assembly are predicted to have a gap of length zero, which, given the inherent error in estimating insert lengths, may also include a substantial number of overlapping contigs.
We designed a tool to identify and remove tandem allele misassemblies in adjacent contigs on the basis of alignments within an allelic bin (see Materials and methods, below). A tandem misassembly was identified and removed in 26% of contigs for which the assembler had predicted an overlap, in 5% of contigs with a predicted gap of length zero, and in only 1% of contigs that had a predicted gap of positive length (Table (Table2).2). The mean and median length of tandem allele misassemblies was significantly shorter in adjacent contigs with a predicted gap of length zero, as would be expected. In addition, contig overlaps were identified and removed in 11% of adjacent contigs for which no gap estimate was available. This includes terminal contigs in adjacent supercontigs and contigs rearranged by the DDA. Overlapping regions in this category tended to be shorter and nearly identical; these most likely represent sequence from the same allele that should have been merged in the shotgun assembly, rather than a tandem misassembly.
The two reconstructed haplomes consist of 374 pairs of allelic hypercontigs, which contain a total of 336 Mb, including 13 Mb of gaps. Each pair of allelic hypercontigs was globally aligned with LAGAN to produce the final whole genome alignment of the haplomes. The total alignment length is 214 Mb, of which 118 Mb are aligned positions, 47 Mb are gapped positions corresponding to haplome-specific sequence (polymorphic insertion/deletion events such as those resulting from mobile element activity), and 47 Mb (38 Mb of sequence plus 9 Mb of supercontig gap placeholders) of sequence aligned to an assembly break in the opposite hypercontig. The haplome alignments are available on the Sidow laboratory website .
A nonredundant reference 'reftig' combining sequence from both haplomes was parsed directly from each hypercontig alignment (Figure (Figure1f).1f). Reftigs were constructed with the following priorities: to select the more reliable sequence in any given region, to extend sequence continuity by avoiding contig breaks, to minimize unnecessary switching between the hypercontigs, and to maximize the length of the reference sequence.
Before selecting the sequence for each reftig, we partitioned the hypercontig alignments into regions of high or low similarity. In regions of high similarity the only possible difference between the aligned sequences were single nucleotide polymorphisms, because gaps and contig breaks were by definition excluded from these regions. In these regions the reference sequence was selected base by base on the basis of read coverage, which we used as a proxy for sequence quality. Approximately half of the total alignment (containing about two-thirds of the bases in each haplome) was classified as highly similar. Highly similar regions had a mean length of 185 base pairs (bp) and an N50 of 330 bp.
A low similarity region was defined as the region flanked by two highly similar regions. Many of these regions contained haplome-specific sequence (polymorphic insertion/deletion events) and assembly gaps. As such, we were not always confident that the global alignment in these regions was entirely comprised of aligned allelic positions, because global aligners such as LAGAN are required to align each base and may therefore align nonhomologous bases. To avoid the creation of an artificial allele via an alignment artifact, the sequence of one hypercontig was selected for the entirety of each low similarity region. The selection was based on a set of heuristics designed to follow the priorities listed above (see Materials and methods, below). Low similarity regions accounted for approximately half of the total alignment, but contained only about one-third of the bases in each haplome. They had a mean length of 194 bp and an N50 of 2,675 bp.
The C. savignyi reference sequence represents significant improvements in contiguity, continuity, and redundancy from the original WGS assembly (Table (Table3).3). The reference sequence has a total contig length of 174 Mb contained in 374 reftigs, of which the largest 100 contain 86% of the total sequence. The reftig N50 is 1.8 Mb and the contig N50 is 116 Kb, representing threefold and sevenfold improvements in contiguity over the original assembly (Table (Table33).
The reference sequence also compares favorably with a previous nonredundant assembly that also used the original WGS assembly as a starting point ('nonredundant 1.0 assembly') . This earlier assembly was generated by selecting a path through local alignments of the original WGS assembly with itself. Alignment discrepancies between the haplomes were resolved by breaking continuity rather than by resolution with assembly or genetic data. Compared with this earlier assembly, the reference sequence represents a twofold increase in scaffold and contig contiguity (Table (Table3).3). Additionally, the reference sequence is 10% longer (Table (Table3),3), and its largest 120 reftigs contain as many bases as all 446 supercontigs of the nonredundant 1.0 assembly.
In addition to extended contiguity, the continuity of the sequence has been improved in the reference sequence (Table (Table3).3). The frequency of gap bases ('N' placeholders whose number corresponds to the estimated size of the gap between adjacent contigs) has been decreased in the reference sequence to 1.7% of total positions, or 3 Mb. In comparison, 5.2% (22 Mb) of positions in the original WGS assembly and 4.3% (6.8 Mb) of positions in the nonredundant 1.0 assembly are gap bases. Increased continuity is also evident in the significant reduction in number of contigs, and hence decreased number of contig breaks (Table (Table33).
The redundancy and completeness of the reference sequence were estimated by aligning the then available approximately 75,000 expressed sequence tags (ESTs) from C. savignyi to each assembly. Each EST was classified on whether it aligned to an assembly no, one, two, or more than two times. The EST alignments verify a significant reduction in redundancy in the reference sequence: 85% of ESTs align exactly once to the reference sequence whereas 72% align exactly twice to the original, redundant WGS assembly (Figure (Figure6).6). By this same measure, the reference sequence is slightly less complete than the original WGS assembly, because 91% of ESTs align at least once to the reference sequence whereas 94% align at least once to the WGS original assembly. However, the reference sequence recovers 3% more ESTs than the nonredundant 1.0 assembly, to which 81% of ESTs align exactly once and 88% align at least once.
Of the reference sequence, 30% is classified as repetitive by RepeatMasker , utilizing the de novo RECON C. savignyi repeat library (see Materials and methods, below). By comparison, 38% of the original WGS assembly is classified as repetitive under the same conditions. This reduction in repeat content reflects the removal of uncondensed repetitive sequence fragments in the reference sequence pipeline. An annotated subset of the RECON library is available in the Repbase  database of mobile elements. Repeatmasker utilizing the annotated Repbase library classifies 16.7% of the reference sequence as mobile element derived and provides annotation of individual mobile element classes (Table (Table4).4). Short interspersed elements (SINEs) constitute the largest class of mobile element in the C. savignyi genome, accounting for 7.5% of bases in the reference sequence, followed by unclassified elements (3.4%), long interspersed elements (LINEs) (2.0%), DNA transposons (1.8%), and long terminal repeat (LTR) elements (1.3%).
We did not detect anything unusual about the distribution of mobile elements in the reference sequence or between the aligned haplomes. The mobile element content of the two reconstructed haplomes is similar to that of the reference sequence, indicating that there was no detectable bias for or against annotated mobile element classes in the selection of the reference sequence. Overall, 248,741 Repbase mobile elements were identified in the dual haplome assembly. In total, 175,349 elements were present in the same alignment location as an annotation of the same element in the opposite haplome, and thus indicate an insertion event before the coalescence time of the two alleles. In all, 22,321 elements were aligned to alignment gaps in the opposite haplome, and therefore probably represent haplome-specific insertion events. The number of haplome-specific instances of mobile elements in each class is directly related to the total number of that element in the genome (Table (Table4).4). The remaining elements were unclassifiable because of missing sequence in the opposite haplome, fractured repeat annotation or alignment ambiguities. Detailed characterization of polymorphisms in the C. savignyi genome will be published elsewhere.
We constructed the nonredundant reference sequence of the C. savignyi genome from the initial, redundant WGS assembly. In this reference sequence, the vast majority of loci are represented exactly once. Compared with a previous nonredundant assembly , the contiguity of the sequence has been improved and identifiable misassemblies have been corrected. The reference sequence provides a valuable resource for both the Ciona research community and comparative genomics. It is the C. savignyi assembly currently available in Ensembl  and forms the basis of all currently available C. savignyi gene annotation sets . We believe that the reference sequence is of high quality; as for all unfinished assemblies, however, users should anticipate the presence of some remaining misassemblies in the sequence. In particular, apparent duplications and copy number variation should be interpreted with caution because they could represent an undetected inclusion of both alleles of a polymorphic region. Additionally, because the reference sequence is a composite of the two haplomes of the sequenced individual, the sequence across a given region may not actually be present on the same haplotype in nature.
The C. savignyi reference sequence will facilitate comparative analysis, most importantly with the C. intestinalis genome. The two Ciona spp. are morphologically extremely similar and share nearly identical embryology . C. savignyi and C. intestinalis hybrids are viable to the tadpole stage , but comparison of their genome sequence reveals a sequence divergence approximately equivalent to that seen between the human and chicken genomes. The combination of significant sequence divergence without significant functional divergence between these two species enables particularly powerful comparative sequence analysis [6,7]. To facilitate such comparisons, a whole-genome alignment of the C. savignyi reference sequence and v2.0 of the C. intestinalis assembly has been constructed and is available in the Vista genome browser [41,42] and through the Joint Genome Institute C. intestinalis genome browser . Caution should be used in interpretation of species-specific duplications, which could be due to assembly artifacts.
A parallel goal of this work was to characterize polymorphism in the C. savignyi population. The high-quality, whole-genome alignment of the haplomes has facilitated identification of polymorphisms at multiple scales, including single nucleotide polymorphisms, insertion/deletion events and inversions, and sheds light on the population dynamics of highly polymorphic genomes .
The unusually deep raw sequence coverage accomplished by the C. savignyi genome sequencing project (>12×) allowed separate assembly of the two alleles, a critically important prerequisite for generating the reference sequence with the methodology we developed. This opportunity is unlikely to be reproduced in future genome assemblies. For example, when the recently completed Sea Urchin Genome Project was faced with a comparable level of heterozygosity within the single sequenced Strongylocentrotus purpuratus individual, they elected to adopt a hybrid approach, which combined 6× WGS sequencing data with 2× coverage of a bacterial artificial chromosome (BAC) minimal tiling path . Because each BAC can only contain sequence from one of the two haplomes, the BAC sequence could then be used to separate allelic WGS reads during the assembly process. However, insights into misassemblies and the success of the general approach we described here should prove useful in informing assembly of other polymorphic species. We expect that as genome sequencing projects continue to move beyond inbred laboratory and agricultural strains, many more projects will be forced to adapt to the difficulties of polymorphic genome assembly. This has already been seen in the C. intestinalis, Candida albicans, S. pupuratus, Anopheles spp., and to a limited extent the fugu genome projects, and is anticipated to remain a significant problem as genome sequencing projects continue their rapid expansion.
During the course of describing how we generated the nonredundant reference sequence of C. savignyi, we illustrated how the difficulties inherent in a WGS assembly of a highly polymorphic genome can be turned into an advantage with respect to the quality of the final sequence. The key step that facilitates this advantage is the alignment of the haplome assemblies, which allows correction of assembly errors that would go undetected in a standard WGS assembly, and dramatic extension of the continuity and contiguity of the reference sequence. The haplome alignment is dependent on the detection of allelic contigs, which in turn depends on having forced separate assembly of the two alleles during the course of producing the initial, redundant assembly. In the case of the C. savignyi genome, this strategy was possible because of the unprecedented depth to which its genome was sequenced. We believe that less than 12× coverage would be sufficient to pursue our strategy, but exactly where the cutoff would be is an area for further investigation. We also know that the extreme heterozygosity, which extended across the entire genome of the sequenced individual, facilitated the initial, separate assembly of the two alleles, but whether this strategy would work for less extremely polymorphic genomes is also an area for future work. We hope that the methodologic insights we generated will be as useful for future genome assemblies as the reference sequence will be for experimental work in Ciona.
Repeats were identified with RepeatMasker  utilizing a de novo repeat library constructed by the RECON  program and hand curated to remove multicopy genes, tRNA, and rRNA elements. The RECON library is available from the Sidow laboratory website .
Fully informative genetic markers were designed at relevant locations surrounding each potential major misassembly and typed in 92 meioses of an outbred cross.
The original WGS contig assembly was aligned to itself with a stringent optimization of WUBLAST . If the stringent BLAST generated no hits of 100 bp with at least 95% identity or 200 bp with at least 90% identity, a second BLAST with less stringent parameters was executed. In practice, the majority of contigs without a hit in the first BLAST did not have a hit in the second BLAST either. Contig queries were soft-masked with Repeatmasker and the RECON library and the low complexity filter dust. The initial stringent BLASTN parameters were as follows: hitdist = 20, e cutoff = 1 × e-20, wink = 3, -topComboN = 1, and -Q (gap open penalty) = 40. In the second, less stringent BLAST the soft-masking parameters, topcomboN and e cutoff parameters were retained, and all other parameters were left at default.
Regions with exactly one hit to another contig for at least 100 consecutive base pairs were selected as 'anchors'. A total of 277,075 anchors connecting 33,684 contigs were identified. Of the 69,912 pieces evaluated (from a total of 66,799 contigs, because larger contigs were split into 30 kb pieces), 15,860 were completely masked or did not have an unmasked stretch of at least 11 bp on which WUBLAST could initiate a word hit. An additional 6,628 pieces did not have a single BLAST hit greater than 1 × e-20 in either BLAST. Of the remaining 47,604 pieces, 36,404 (representing 33,684 contigs) were found to have at least one anchor and 11,200 did not contain a single anchor.
A potential source of error is an anchor between uniquely aligned masked regions. Masked regions were not included in the word generation stage of BLAST but were included in the alignment extension step. It is therefore possible that an entire anchor resides within a masked region adjacent to unmasked sequence that initiated a BLAST hit. To remove this possibility we screened all 277,075 anchors for anchors that did not contain at least 100 bp of consecutive unmasked bases. In all, 37,440 anchors (13.5% of all anchors, which connected 4,754 contigs) did not pass this test and were flagged and removed from later analysis. After the removal of the masked anchors, 239,635 anchors connecting 28,930 contigs remained.
A global alignment was generated for all anchored contig pairs with the alignment program LAGAN  using default scoring parameters, except for the gap open penalty, which was decreased to -450. The alignments were rescored with the standard Smith-Waterman scoring of match = 5, mismatch = -4, gap open = -4, and gap extend per residue = -1, with the exception that terminal gaps were ignored and gap penalties were capped at 20 bp (corresponding to a score of -24). Gap penalties were capped to prevent overly penalizing aligned sequence adjacent to expected haplome-specific insertion/deletions events. LAGAN is known to produce a stereotypical error in which nonsimilar terminal regions are forced into alignment. To avoid this we ignored aligned end fragments that were less than 20 bp or less than 80% identical. Alignments with a score of less than 1,000 were considered spurious and eliminated from further analysis.
The underlying idea behind the coassembly of two alleles is that each allele can be used to establish an ordering of the contigs and supercontigs in the other allele. Each allele is now a set of contigs (contiguous stretches of DNA sequence). The contigs are ordered into supercontigs by assembly links. These assembly links are based on paired reads, and are assumed to be less reliable than the assembled sequence in the contigs that they join . To order allele A we use all contigs of allele B as ordering information, and then repeat the step to order allele B according to the contigs of allele A.
To accomplish this we used a sparse Dynamic Programming chaining algorithm . This algorithm takes each contig of the base genome (or, in the case of C. savignyi, base haplome), and tiles it with local alignment from the second genome, taking into account not only sequence similarity but also common biologic rearrangement events, such as inversions and translocations. Because the two haplomes being co-assembled are very similar, we used a very high threshold for homology.
To order the contigs of allele A we did the following; for every pair of contigs from allele A (for instance, 1A and 2A) that are aligned next to each other in the tiled alignment of the contig XB (of allele B), we add a link joining 1A and 2A, and the two contigs are now said to be joined by an alignment link. The link is directed depending on the order and orientation of 1A and 2A hits on XB. If a contig has multiple forward or backward alignment links, it is labeled unreliable, because it could be a site of a misassembly on the contig level (or a biologic rearrangement). All links to unreliable contigs are removed. After this we use the assembly links that are not contradictory to the alignment links in order to increase the contiguity of the sequence. For any contig that is missing a forward or backward link but that has one in the original Arachne supercontig, we add this link to the link graph. After this, all connected components of the link graph are joined into a new haplome supercontig. The process is repeated in order to obtain a relative ordering of the connected components. During this step, only the reliable supercontigs of allele B are used as a basis for ordering all of the supercontigs of allele A and vice versa.
Note that during this procedure we may join with an alignment link two contigs that are in the same supercontig but that have other contigs in between them. Any such contig can be separated out into a new scaffold: if the in-between contigs match any sequence, then they will be aligned separately; and if they do not match any sequence, then we use the sequence from allele B to fill the sequence gap.
A purpose-built tool was designed to identify and remove tandemly misassembled alleles in adjacent contigs. The tool operates on an allelic 'bin' in which allelic supercontigs of a region have been collected and sorted into two sub-bins, corresponding to the two alleles of that region. Each contig was aligned with the local aligner CHAOS  to the preceding contig in its sub-bin, and all hits above a threshold of 5,000 (corresponding to about 50 aligned bases) were selected. The sequence of each selected hit was aligned with CHAOS to the entirety of both sub-bins to determine whether the sequence is unique to the adjacent contigs. If the sequence had no other instances in its own sub-bin and less than two hits on the opposite sub-bin, then it was considered a potential tandem misassembly. Real duplication events or repeated sequence motifs would be present in both sub-bins at a copy number of at least two and hence excluded at this stage. The tandemly misassembled region was removed from the preceding contig if the ratio of duplicated sequence to the length of nonduplicated sequence exceeded an empirical threshold. The tool was applied to adjacent contigs within supercontigs before the DDA step, and again on all adjacent contigs within hypercontigs after DDA. Because the contigs were repeat masked, tandemly misassembled repeat regions will not be identified.
Ordered contigs in each sub-bin were concatenated into a single hypercontig. A default gap of 10 'N's was inserted between all adjacent contigs without an Arachne gap estimate. Each pair of hypercontigs was aligned with LAGAN, using default scoring parameters with the exception of the gap open penalty, which was decreased to -450. Hypercontigs were masked with the full RECON library before alignment.
Before selection of the reference sequence the hypercontig alignments were partitioned into regions of high and low similarity. High similarity regions were identified by selecting aligned regions of perfect identity as seeds and expanding the seeds with a blast-like extension. The minimum seed length was 15 bp, the match score was set to 5, and the mismatch score to -4. If the cumulative score dropped below 95, or the extension encountered a supercontig break or a gapped alignment position, the extension was terminated and retracted to the last match. Low similarity regions were defined as the region between adjacent high similarity regions.
Read coverage was calculated for all positions in the original assembly by mapping read placement information from the Arachne output files onto contigs and counting the number of reads at each position. All hypercontig bases were mapped to their position in the original assembly and assigned the corresponding read coverage.
In high similarity regions the reference sequence was selected at each position by comparing the read coverage of the aligned allelic bases and choosing the allele with read coverage closer to 6×, based on the assumption that bases with either low or extremely high read coverage are enriched for sequencing errors and assembly artifacts . If the alleles had the same read coverage, the allele selected at the previous position was selected.
In low similarity regions the sequence of one allele was selected for the entirety of the region based on the following heuristics.
If both alleles contained a contig break then the longer allele was selected.
If only one allele contained a contig break, then the unbroken allele was selected, unless the allele containing the contig break was longer than 20 kb and greater than 10 times the length of the continuous allele. This was done to avoid selecting against long regions that may have assembled in only one of the haplomes because of the draft nature of the assembly.
If neither allele contained a contig break the median read coverage of the bases in each allele was calculated. If both alleles did not have good median read coverage (3 ≤ X ≤ 15), then the allele with read coverage closer to the expected 6× coverage was selected. In a tie the longer allele was selected. If an allele was entirely gapped the read coverage of the previous position was used as a proxy. If both alleles had good median read coverage (3 ≤ X ≤ 15), then the repeat content of the region was examined. If the region was repetitive (90% of the longer allele was repeat masked) then the shorter allele was selected; otherwise the longer allele was selected.
We used the same EST set and followed the same filtering procedure (removing about 250 ESTs of less than 100 bp and about 10,000 mitochondrial ESTs), as was employed in nonredundant assembly 1.0 . We aligned the ESTs to each assembly with WUBLAST  and BLASTN in the place of BLAT , with the following parameters: -e = 10, -noseqs, -topcomboN = 1, -links, and -Q = 20. As in the report by Vinson and coworkers , all alignments in which matching bases exceeded 80% of the length of the EST were retained. Our WUBLAST yielded virtually the same number of alignments in all categories as the BLAT analysis .
The following data are available with the online version of this paper. Additional data file 1 is a figure of a representative alignment 'spider' involving sequence from three bins. Additional data file 2 is a figure displaying heavy enrichment for low coverage bases in unassigned sequence. Additional data file 3 is a figure displaying the length distribution of predicted contig overlaps in the original WGS assembly.
In this figure supercontigs are black lines denoted with purple letters. Regions of long alignment are denoted in turquoise (approximately to scale), with the length of the alignment in megabases. Regions of short alignment are denoted in pink (grossly out of scale to make them visible), with the length of the alignment in kilobases. Approximate positions of genetic markers are given by orange ovals, and their names are in blue. Linkage is shown by a red dashed line with the genetic distance indicated. Lack of a red dashed line between any two markers indicates no detectable linkage. Positions where the assembly is broken to account for the genetic map data are shown as circled lightning.
In this figure the x-axis diplays the maximum read coverage per contig and the y-axis displays the percentage of contigs in a category. Red bars indicate unassigned contigs, and blue bars indicate contigs assigned to an allelic bin.
A figure displaying the length distribution of predicted contig overlaps in the original WGS assembly.
We thank Jade Vinson for pre-publication access to the original WGS assembly, Senthil Singaravelu for visualization software, Zhirong Bao and Sean Eddy for generating the C. savignyi RECON repeat library, Mukund Sundararajan for computational help, and William Smith and Di Jiang for providing the genetic cross.
AS is supported by NIH/NIGMS and NIH/NRGRI. KS was supported by a Stanford Graduate Fellowship and the Stanford Genome Training Program (SGTP; NIH/NHGRI). MB was supported by a NSF graduate fellowship and the NSERC Discovery Grant. MH was supported by the SGTP.