We have constructed six BAC libraries for three lepidopteran model species (2 moths and 1 butterfly). These libraries not only have large-insert sizes (150 – 175 kb) and deep genome coverage (13 × – 17 ×), but also have a low level of insert-empty clones (<5%) and no detected contamination with DNA from organelles and microbes potentially living on the source insects, as indicated by BES analysis. Moreover, the genome coverage and quality of the libraries have been verified independently by screening high-density filters of the libraries with a set of single-copy genes or ESTs. The observation that none of the libraries was contaminated with microbial DNA potentially carried by the source insects was expected, because the self-contained non-feeding pupal stage used as a DNA source for the library construction had purged their guts at the end of larval development. However, we did observe 6, 21 and 20 short sequences in the He, Hv and Ms BESs, respectively, which were homologous to viral, bacterial, and fungal sequences (Tables and [see Additional files
5,
6 and
7]). We believe that the homologues are real, but not from sample contamination because they sit in the middle of BESs. These results perhaps provide a line of preliminary evidence for the presence of microbial sequences in these lepidopteran genomes, possibly by horizontal transfer. Similar findings have been obtained in
B. mori [
35]. On the other hand, considering the small fraction (~0.5%) of the BAC libraries sampled, a more direct test of organelle contamination could be accomplished by using mitochondrial sequences as probes for hybridization. Furthermore, since the libraries of each species were constructed with two restriction enzymes (
EcoRI and
BamHI) complementary in the GC content of their restriction sites, the genome coverage should be much better distributed along the genome than those constructed with a single enzyme [
13,
36]. Therefore, these libraries could provide useful resources for comprehensive genomics research of the three model lepidopterans.
The libraries, library filters and individual clones have been distributed to a number of laboratories and are presently being used for following studies: 1) walking to wing colour patterning genes from closely linked AFLP sequences in
H. erato [
17,
37]; 2) testing synteny between
M. sexta,
B. mori, and
H. melpomene by chromosomal fluorescence
in situ hybridization using BACs containing orthologous genes as probes [
5,
38]; 3) analysis of full-length coding and regulatory regions for the
M. sexta Broad gene (L. Riddiford, personal communication); and 4) analysis of
H. virescens HR16 putative odor receptor sequences (F. Gould, personal communication).
The results of this study (Tables , and ) have provided a snapshot of the basic characters of the genomes of a group of ditrysian moths and butterflies which diverged from each other at least 50–60 million years ago [
18]. First, the genomes of all three species are AT-rich (64–68%), with the genome of the butterfly (
H. erato) having an AT content more than 3% higher than those of the moths (
M. sexta and H. virescens). Second, the results show that all three insect genomes contain relatively small fractions of repeat elements (3–10%), including retro-transposons, transposons, simple repeats, and low complexity repeats. These results are in agreement with the small genomes of the species (400–500 Mb/1C) which generally tend to contain smaller fractions of repeat elements. Of these three insect species, the butterfly genome contains 3–5-fold more repeat elements (10.01% all repeats), especially low complexity repeats, than the two moth genomes. Papa reported that the total repetitive sequences accounted for about 26% of the genomic regions linked to wing pattern variation in
H. erato [
37]. The difference could be an effect of more
H. erato-specific repeats documented, sampling of a specific region with a higher average repeat density, or both. Third, whereas the three insect genomes all contain a small number (<1%) of retro-elements, DNA transposons and simple repeats, retro-elements seem much more abundant than DNA transposons, and the butterfly genome is two-fold richer in simple repeats than the two moth genomes. Compared with published information from
B. mori, the finding of such a low percentage of repeat contents in these three lepidopteran species is surprising, especially for
M. sexta, which is in the same superfamily as the silkworm, Bombycoidea. Xia et al. [
9] estimated about 20% of the
B. mori genome to be composed of "transposable elements;" further, early work based on Cot hybridization kinetics estimated about 45% of the silkworm genome to be composed of repetitive sequences [
39]. More recently Osanai-Futahashi et al. reported that the TEs made up 35% of the silkworm genome and contributed greatly to the genome size [
40]. One may argue that we might simply have not identified all the relevant repeats in the BESs, but our argument is supported by the following evidence. The genome of the butterfly,
H. erato, contains extremely large numbers (1059 of 1364 hits) of small duplicated sequences or "novel repeats" (not registered in GenBank) which are homologous to three completely sequenced BAC clones (118 kb of AEHM-41C10, 112 kb of AEHM-46M10, and 118 kb of AEHM-7G12) of
H. melpomene. This in turn indicates the presence of novel repetitive or duplicated sequences in the
H. melpomene genome [see Additional file
5]. Large-scale end sequencing of the complete BAC libraries will uncover more detailed aspects of these butterfly and moth genomes, and provide more information for fundamental studies of lepidopteran insects in general.
The BLAST analysis of the sampled BESs has also provided insights into the evolution of these insect genomes. It is not surprising to find the top hits are to the sequences of lepidopteran species, but it is quite surprising that the highest numbers of M. sexta BES hits were to the sequences of other animals and plants rather than to B. mori (Tables and ). This finding suggests that although all the genomes have undergone changes since the split from the most recent common ancestor, they may have done so along different trajectories, with the M. sexta genome retaining some sequences in common with plants and animals that have been either lost or modified to a greater extent in H. virescens and H. erato. Such a hypothesis can only be tested when more genomic data are available for these lepidopteran insects. Moreover, the BESs of the butterfly (H. erato) are well-matched only to the sequences of H. melpomene. This suggests that not only is the butterfly more related to H. melpomene than to the two moth species, as expected, but this group has also diverged to a greater extent, resulting in a higher level of species- or evolutionary lineage-specific sequences. This argument is further supported by the finding that 27 of the 76 species having sequence matches to the BESs of H. erato (35.5%) were from other Lepidoptera. This number is 6% higher than that of M. sexta but 9% less than that of H. virescens. By contrast, the total number of top hits to lepidopteran species for H. virescens BESs was 41, or 17 and 14 more than for M. sexta and H. erato, respectively. Therefore, the genome of H. virescens may be a better representative of the genomes of Lepidoptera as a whole (Table ).
One may argue that the RepeatMasker program might not mask the repeat sequences completely because of the limited amount of repeat elements available in the public database; however, this does not appear to have affected the BLAST results significantly. For instance,
B. mori represents the species having the most sequence information in GenBank among the lepidopteran species; however, we found significantly different hits, 23, 264 and 41 for the BESs of
H. erato,
H. virescens, and
M. sexta, respectively (Table ). Moreover, there are many more
Drosophila spp. sequences in GenBank than for any other insect; however, we only observed limited numbers of
Drosophila sequence hits: 25, 16 and 36 for the BESs of
H. erato,
H. virescens, and
M. sexta, respectively (not shown); there were no large (≥ 200 bp) hits for any
Drosophila sequence, and only one large (≥ 200 bp) hit each for
Apis mellifera (honey bee), for the BESs of
H. virescens and
M. sexta, even though the honey bee genome is also fully sequenced [see Additional files
5,
6 and
7]. Similarly, a large number of top BLASTx hits were to protein sequences in lepidopteran species (12/32) or other insects (17/32), such as
Acyrthosiphon pisum (pea aphid) and
Tribolium castaneum (red flour beetle), of which relatively few were to
Drosophila spp (3/32).