Although our preliminary sequencing of the Megaselia scalaris genome resulted in extremely low-coverage (between 0.05× and 0.10×), we were able to perform a number of bioinformatic analyses that provided useful information for characterizing this genome as well as generating various genomic resources. We were able to characterize numerous repetitive sequences in the genome, including some with homology to known elements and some that have not been characterized previously. Useful resources such as a nearly complete mitochondrial genome sequence and microsatellite markers were also easily developed from the GSS data. Moreover, partial sequences for hundreds of orthologs of Drosophila and Anopheles genes were generated.
An assumption laden in some of our analyses is that the genome survey sequences studied are "random" segments from across the genome. We cannot exclude the possibility that certain regions of the genome were more or less likely to be surveyed due to features such as GC-content. Indeed, we observed coverage of the mitochondrial genome was lower or missing in the most extremely A-T rich regions. This bias may have resulted from the sequencing process itself or issues with sample preparation and/or library generation. Nonetheless, the approaches here provide a first, albeit imperfect, approximation of various features of a previously unexplored genome, and several of our conclusions do not depend upon a truly random sampling of the genome.
was chosen for partial genome sequencing because of its interesting natural history and potential to become a model species in ecology and evolutionary biology. Previous work on M. scalaris
has already revealed much about its ecology, development, sex-determination system, and life cycle [reviewed in [9
]]. The species is widely distributed and many aspects of its ecology are peculiar. For example, M. scalaris
larvae are notable for the wide range of organic matter on which they can feed; reportedly the widest range of any insect [9
]. Because larvae are also facultative parasites, they can enter open wounds and therefore pose some threat to human health, especially in the developing world [17
A complete M. scalaris
genome sequence would also strengthen comparative and evolutionary genomic studies of the Dipterans. While there are completed genomes for 12 Drosophila
] and the mosquitoes Anopheles gambiae
] and Aedes aegypti
], no genome sequence is currently available for any Dipteran species outside of the Drosophilids
and mosquitoes. A phorid fly such as M. scalaris
would also serve as a good outgroup in comparative genomic studies of the Drosophilids. For example, the genome of M. scalaris
could facilitate the identification of regulatory elements and assessing patterns of evolution, as has been recently suggested also for Tephritids [23
Applications of Low-coverage Genome Sequence
We anticipate that researchers studying a wide range of non-model taxa will be drawn to newer, less-expensive genome sequencing technologies, often for generating microsatellites [2
] or other markers [24
] to survey population variability and connectivity, phylogenetic position, and other questions. Based on our study of M. scalaris
, using 454 pyrosequencing to sequence genomic DNA appears to be an effective strategy for generating low-coverage sequence data, with read-lengths amenable for assembly or BLAST [25
] analyses. Sequence reads also appear to be distributed throughout the genome, allowing for partial coverage of many functional elements and hundreds of orthologs of known genes. Thus, low-depth sequencing provides mostly new sequence and avoids the high redundancy seen in large-scale genome projects.
The ability to find repetitive sequences is another important test of the applicability of survey sequencing since identifying and masking repetitive sequences can be crucial for accurately estimating genome coverage, identifying low-copy "gene space", and assembling large contigs. We identified over 100 M. scalaris
transposable element copies by homology searches, most of which were LTR retroelements and non-LTR retrotransposons. These REs could be masked in future genomic work in M, scalaris
, facilitating assembly of the short sequence reads obtained through 454 or other short-read sequencing. Low-coverage genome surveys therefore appear to be an effective way to identify repetitive sequences, as several previous studies have successfully identified repetitive sequences with low genome coverage in other systems [6
While available programs like RepeatMasker (Smit and Green, unpublished data) and others can identify previously known REs, identifying novel REs in unassembled genomes remains problematic. Our REFinder.plx program was designed to quickly identify as many novel REs in unassembled genomes as possible. We further validated this program by applying it to comparable GSS from a species with a fully sequenced and assembled genome, Drosophila pseudoobscura
, and identifying known elements. However, it was not designed to detect all classes of transposable elements and, because the program works by assembling and identifying potentially repetitive sequences in contigs, it can only identify REs in tandem arrays. It should also be noted that our program was not designed to identify higher-order repeats or identify the exact boundaries of REs. Other programs for de novo
detection of REs, such as ReAS [28
] or ReRep [29
], may provide better detection of other classes of repeats, such as interspersed elements, in low-coverage genome surveys. It is also possible that some REs we detected are hybrids of different elements or that some non-repetitive flanking ends of REs were incorporated. Nonetheless, it provides a useful starting point for characterizing a novel genome of its repetitive element content.
Since no attempt to remove mtDNA from nuclear DNA was made prior to sequencing, mtDNA sequences were present in high copy number, which allowed us to assemble most of the M. scalaris
mt genome. Even more encouraging was that we were able to assemble a complete mt genome at 20× coverage from the D. pseudoobscura bogotana
GSS reads. This suggests that low-coverage genome surveys can also be an easy way of obtaining mtDNA sequences for phylogenetic studies and markers for population genetic studies. The proportion of mitochondrial traces was 0.5% (648/129,080) for the M. scalaris
GSS and 1.3% (1299/98,451) for D. p. bogotana
, consistent with the estimated greater nuclear genome size of the former (330-540 megabases vs. 185 megabases [30
While it would be helpful to know exactly how much sequence data is needed to completely cover a mt genome, this cannot be easily quantified. Based on a binomial distribution, the expected coverage of a target sequence given a certain depth of coverage or level of redundancy, R, can be approximated by the equation: E(Coverage) = 1 - e-R
Based on this relationship, for a 15 kb mt genome and a mean sequence read length of 200 bp, approximately 500 reads of mitochondrial sequence are needed to obtain full coverage. However, this approximation will not hold if sequence reads are nonrandomly distributed over the target sequence. For instance, a bias towards sequence reads being in G-C rich regions across the M. scalaris mt genome likely explains why we did not obtain the sequence of the A-T rich mitochondrial control region even though we recovered 648 mt sequence reads, far more than theory suggests are necessary. The amount of sequence required for full coverage of a mt genome therefore depends on biases in sequencing and DNA preparation, as well as biological differences among organisms (or even tissues) in mt copy number.
The point raised above for mt genome sequencing brings up a more general caveat for researchers using low-coverage GSS strategies. With low depths of coverage, the probability of obtaining complete coverage of any target sequence becomes exceedingly low. This holds true for coding sequences in the nuclear genome as well as organellar genomes. If specific sequences are the ultimate goal of genome sequencing, then more directed approaches would be more appropriate than our random GSS approach.