We downloaded from the Trace Archive at NCBI [1
] the following numbers of raw sequences from each Drosophila species: 2,772,509 sequences from D. ananassae
; 2,445,065 from D. mojavensis
; 2,214,248 from D. simulans
; 2,061,010 from D. yakuba
; 3,359,782 from D. virilis
; 2,590,703 from D. pseudoobscura
; and 3,663,352 from D. melanogaster
. For each project, we downloaded sequences, quality values, and ancillary data (containing clone-mate information, clone insert lengths, and sometimes trimming parameters), comprising approximately 2-3 gigabytes (GB) of compressed data per genome.
For each genome, we used the nucmer program from the MUMmer package [26
] to search the complete genome of W. pipientis w
Mel against the files containing the sequences. We pulled out any single sequence ('read') with at least one 30-bp exact match to w
Mel, and with an extended match that spanned at least 65 bp. We then retrieved the 'clone mates' of each sequence: most of the reads in whole-genome sequencing projects are obtained via a double-ended shotgun method, meaning that both ends of each clone insert are sequenced. The Trace Archive contains a link to the clone mate for each read; we used this information to extract any mates that were not contained in our original screen. For example, the D. ananassae
data yielded approximately 5,000 additional reads when we pulled in the mates from the original set.
We then assembled the Wolbachia
reads in two different ways: with the Celera Assembler [29
], treating it as a normal (de novo
) whole-genome assembly, and with the AMOS-cmp assembler [30
], which assembles a genome by mapping it onto a reference. For the reference genome we used w
Mel. We used Celera Assembler on the relatively well-covered w
Ana strain; although we ran it on the w
Sim reads as well, the sequence coverage was too light to yield a good assembly. The high degree of sequence identity, at 95-100% across most regions that are shared between strains, allowed for an excellent comparative assembly of the w
Sim strain with AMOS-cmp.
The AMOS-cmp assembly of w
Sim contains 388 contigs plus another 241 singleton reads, covering 896,761 bp (see Table ). The largest contig contains 16,701 bp. Note that AMOS-cmp produces contigs but not scaffolds. The contigs can easily be aligned to the reference genome to produce scaffolds, with the caveat that any rearrangements will invalidate such scaffolding information. To avoid such problems, we ordered and oriented the contigs separately with Bambus [31
], a stand-alone genome scaffolding program, using only the clone-mate information from the original shotgun data. Bambus created 84 multi-contig scaffolds that joined together 273 of the 388 contigs, with the largest scaffold containing 50,851 bp and spanning (including estimated gaps) 54,207 bp.
For wAna, when we compared the de novo and comparative assemblies, we observed that there were multiple rearrangements in the wAna genome as compared to wMel. Our conclusion was that a comparative assembly, which relies on the genome structure of the reference, may be less accurate than a de novo assembly in the presence of extensive rearrangements, so we used the latter for our analysis.
The wAna assembly presented special challenges because of what appear to be a large number of rearrangements and polymorphisms within the sequences. The number of Wolbachia reads provided very deep coverage, which in principle should have produced a scaffold that covered nearly the entire genome. However, a large number of clone-mate links were inconsistent with one another, indicating that the reads may have been drawn from a population in which many of the individuals had genome rearrangements with respect to one another. We also found locations spanning hundreds of nucleotides where four or five individual reads had one nucleotide and the same number had a different nucleotide. These polymorphisms made it difficult to create many consistent large scaffolds. We created multiple assemblies in which we removed many of the inconsistent links, and eventually settled on the assembly presented here as the best representative of the genome possible given the diversity in the data. The wAna assembly has three large scaffolds of 460 kb, 157 kb, and 121 kb respectively, with all remaining scaffolds less than 20 kb in length. We also include a list of all the individual sequences, including those not incorporated into contigs, in our Additional data files.
To annotate the resulting sets of contigs, we used Glimmer [32
] to make initial gene calls and BLAST [34
] to search those calls against a comprehensive protein database. Regions with no gene calls were searched as well in all six reading frames using Blastx.
All the predicted genes in wAna, wSim, and wMoj were searched against wMel using Blastn. The results of these searches were used to determine what genes are absent in the wAna, wSim, and wMoj assemblies. DNA sequence matches at 80% identity for 80% length of the smaller of the genes were determined to be conserved and are plotted in Figure . Regions A and B in Figure were identified in this manner. To identify the unique genes in the wAna, wSim, and wMoj assemblies, all predicted proteins were searched against the wMel proteins using Blastp. Proteins in the new genomes were considered unique (or highly divergent) when the best match in wMel had an E-value greater than 10-15.
To create the multiple alignments of the 90 sequences that were shared by all four organisms, we searched the 114 sequences in w
Moj against the w
Ana, and w
Sim genome assemblies, again using nucmer. We used the output of nucmer to extract from each genome the appropriate matching sequence, and we fed the results to the overlapper (hash-overlap) from the AMOS assembler [30
] to generate all pairwise sequence alignments.
All ankyrin repeat domain proteins identified by automated annotation were compiled and an alignment and tree were constructed using ClustalW [35
]. The ankyrin repeat domain is a degenerate repeat [36
], so no attempt was made to cluster proteins where the ankyrin repeat motifs were removed.
The whole-genome shotgun assemblies, with annotation, have been deposited at DDBJ/EMBL/GenBank under the project accession AAGB00000000 (wAna) and AAGC00000000 (wSim). The versions described in this paper are the first versions, AAGB01000000 and AAGC01000000. The sequences and annotation for wMoj have consecutive accessions AY897435 through AY897548. The unassembled wMoj reads are also available from the Trace Archive and from the Additional data files for this paper.