|Home | About | Journals | Submit | Contact Us | Français|
We describe the draft genome of the microcrustacean Daphnia pulex, which is only 200 Mb and contains at least 30,907 genes. The high gene count is a consequence of an elevated rate of gene duplication resulting in tandem gene clusters. More than 1/3 of Daphnia’s genes have no detectable homologs in any other available proteome, and the most amplified gene families are specific to the Daphnia lineage. The co-expansion of gene families interacting within metabolic pathways suggests that the maintenance of duplicated genes is not random, and the analysis of gene expression under different environmental conditions reveals that numerous paralogs acquire divergent expression patterns soon after duplication. Daphnia-specific genes – including many additional loci within sequenced regions that are otherwise devoid of annotations – are the most responsive genes to ecological challenges.
Daphnia pulex, or the waterflea, is a keystone species of freshwater ecosystems – a principal grazer of algae, a primary forage for fish (1) and a sentinel of lentic inland waters. Their populations are defined by the boundaries of ponds and lakes, are sensitive to modern toxicants in the environment, and thus are used to assess the ecological impact of environmental change (2-3). Daphnia exhibit a range of context-dependent development of specialized phenotypes, such as switching between clonal and sexual reproduction in response to environmental conditions (4). They are phenotypically plastic, in that some species alter diurnal migration behavior and develop exaggerated morphological defenses in response to predators (5). Physiological responses to abiotic environmental fluctuations can include the rapid rise of hemoglobin levels when ambient oxygen levels fall (6). The genus Daphnia is speciose with multiple lineages independently colonizing and adapting to diverse habitats (7). Their short generation time, large brood sizes and ease of laboratory and field manipulation have assured their importance for setting regulatory standards by environmental protection agencies, for testing chemical safety, for monitoring water quality (2-3) and as a model for ecological and evolutionary research (8).
Daphnia pulex, as a crustacean arthropod, is the closest ally to the insects (9) and thus allows the cataloguing of genes that likely evolved in the pancrustacean ancestor of at least some lineages of insects and Crustacea (Fig. S1). Although the branchiopod D. pulex represents only a single crustacean lineage – which contains over 40,000 known species with striking levels of phenotypic diversity – the genus and its order (the Cladocera) date to the Permian (10).
Because Daphnia’s ecology is superbly understood, access to its genome sequence (Fig. S2; Table S1) allows studying environmental influences on gene functions in ways that are difficult in even the best-developed genomic model species. Traits observed in laboratories are likely a small subset of the phenotypic variation that is expressed in natural ecosystems, and a focus on laboratory studies may partly explain why over 50% of many eukaryotic genomes are without experimentally determined functional annotations (11).
The D. pulex genome was assembled using JAZZ (12) from 1,554,564 quality-filtered nuclear sequence reads (8.7-fold coverage) from a naturally inbred isoclonal daphniid dubbed “The Chosen One” (TCO; SOM I.1). The v1.1 draft genome assembly comprises19,008 contigs arranged within 5,191 scaffolds that sum to a genome size of ~200 Mb (Table S2). Two-hundred-eighty scaffolds link to construct 118 super-scaffolds (Tables S3-4). Microsatellite markers (13) place 73 large scaffolds (73.9 Mb total) on the 12 chromosomes (Table S5). We estimate that the draft assembly is high quality and includes approximately 80% of Daphnia’s nuclear genome (SOM I.2; Tables S6-7; Figs. S3-5). We determine that 3,598missing regions (59%) contain duplicated genes while others are heterochromatic regions, including the centromeres and telomeres. We estimate that 25% of the genome may be heterochromatic (Table S8, Fig. S6). The ends of D. pulex chromosomes appear to consist of long stretches of TTAGG repeats with flanking regions (30-40 Kb) internal to these repeats consisting of repetitive sequences, including at least two kinds of satellite sequences (SOM 1).
A minimum set of 30,907 protein-encoding genes were predicted for D. pulex, with 26,867 gene models having the following support (Tables S9-14; Fig. S7): (1) 145,578 ESTs from 37 separate conditions validating 10,578 genes; (2) whole-genome tiling microarrays examining gene expression under six different conditions that detect 186,269 Transcriptionally Active Regions (TARs) validating 57,294 exons from 14,135 genes (additional TARs suggest gene models not yet included within the minimum set); (3) similarity to proteins from other (non-daphniid) genomes that detects 19,641 D. pulex genes (blast e < 10-5); (4) 18,765 genes identified in protein similarity searches against a preliminary draft genome sequence for D. magna (SOM 2), which belongs to a separate subgenus (7); (5) more than 11,000 D. pulex peptide sequences detected by tandem mass spectrometry, of which 93% map to 1,273 gene models in the minimum set; (6) 716 highly conserved single-copy eukaryotic genes, of which D. pulex is missing only two (Table S15), confirming that expected genes are included in the assembly; and (7) 13,105 loci identified as paralogs by nucleotide sequence similarity searches for each predicted gene against the complete gene list (e < 10-20). Measures of the relative rate of non-synonymous nucleotide substitutions to the substitution rate at synonymous sites (Ka/Ks) indicate that the paralogs within our gene set generally show evidence of purifying selection (Fig. S8).
To ensure that the gene-count was not inflated by the erroneous assembly of alleles of the same locus as unique gene copies, we conducted comparative genomic hybridizations of labeled TCO DNA on microarrays. We detected no correlation between the read coverage and the mean fluorescing units of probes representing genes (Fig. S9). Counts can also be inflated by inclusion of pseudogenes. However, manual annotations suggest that pseudogenes account for only 4-6% of large paralogous family members in Daphnia [see companion studies (14)].
Many non-protein-encoding genes were also identified in the D. pulex genome (SOM 3). Fifty miRNA genes are annotated and 27 are validated using tiling microarrays (Table S16; Fig. S10). We estimate 468 rRNA loci and find 3,798 transfer-RNA (tRNA) genes. As in Drosophila melanogaster and Caenorhabditis elegans, these loci are clustered (Fig. S11). Transposable elements constitute 9.4% of the assembled genome (Table S17) consisting of 275 families of retrotransposons (Class I) and DNA transposons (Class II) (Table S18). Intra-element pair-wise divergence among termini for intact elements of Long Terminal Repeat (LTR) retrotransposons ranges from 0-25.3% among the three superfamilies, BEL, gypsy and copia (averaging 2%), indicating many recent transpositions (Fig. S12).
Comparison with gene-structure statistics for insects, nematode and mouse, reveals reduced intron size in Daphnia (Table S19; Fig. S13), resulting in a mean gene span of approximately 1,000 bp shorter than the mean Drosophila gene length. However, average protein length is similar in these two species. Aside from introns, most other structures of the D. pulex genome are approximately equal in size or in number to those of the nematode, or exceed measurements in other species. The reduced intergenic regions compared to insects may partly be attributed to smaller repeated elements (Table S19).
The average length of EST-validated D. pulex introns is 170 bp; only 10% of introns are larger than 210 bp. The intron density of Daphnia pulex genes is similar to that of Apis mellifera, having >2× more introns per gene than Drosophila. Approximately 50% of introns are shared among respective orthologs in Daphnia and Apis (Tables S19-23; Fig. S14). The Daphnia lineage shows an estimated intron gain/loss ratio substantially greater than 1 (Table S24; Fig. S15). We estimate that 78% of these intron gains are unique to this lineage and that 22% occurred in parallel with gains in other lineages (Fig. S16).
Daphnia’s gene catalog shows more universal bilaterian genes than other arthropods (8,096; black in Fig. 1A) and thus shares the highest number of genes with human (Table S25). Only 1,383 genes (4.5%) appear pancrustacean (green in Fig. 1A). Remarkably, over 36% of the minimal set of D. pulex genes have no detectable homology to those in the other species (Fig. 1A), which can partly be explained by the disproportionate expansion of gene families distinctive to this crustacean lineage (χ2 = 450.55, p < 0.0001; Table S26; Fig. 1B) and fast divergence for some genes (enlarged beige fraction in Fig. 1A). A phylogenetic accounting of the expansions and contractions of all gene families within pancrustacean and representative deuterostome genomes (Tables S27-28) suggests a net increase in the number of paralogs within the lineage leading to Daphnia (Fig. 1C). By reconstructing gene-family histories across a phylogeny (SOM IV.2), we count 17,424 new and 1,079 lost genes in the branch leading to Daphnia. By contrast, the sum of inferred gains and loss along the longest series of branches in the insect phylogeny – originating from the shared pancrustacean ancestor with Daphnia – only reaches 8,981 gained loci with 3,040 gene losses. Therefore, the overall elevated Daphnia gene count appears to result from both gaining and retaining more genes.
To better understand gene duplication in the Daphnia genome, we examined the age distribution of gene duplicates, by estimating Ks for 66,502 pair-wise combinations of paralogs showing >40% sequence similarity, and by comparing this distribution to that of 12,570 nematode and 64,783 human gene pairs (Fig. 1D). The single-pair duplicates within the youngest cohort (Ks < 0.01) suggests that D. pulex genes duplicate at a rate 3× greater than those measured for fly and nematode, and 30% greater than human, even when we exclude nearly identical gene copies that may be biased by gene conversion (Table S29; Figs. S17-18).
In the genomes of many species, new duplicate genes are found in clusters (Fig. S19) (15). The D. pulex genome shows ~ 20% (Table S30) of all genes tightly arranged in clusters of 3 to 80 paralogs, and with elevated numbers of tandemly duplicated genes at intervening intervals of 1,000 to 2,000 bp (Fig. S20). The age distribution and positioning of gene duplicates indicate that Daphnia has not experienced whole-genome duplication, but the genome is instead characterized by a high and historically steady rate of tandem duplication (Fig. 1D).
Nine gene families have expanded independently in Daphnia and aquatic lineages including vertebrates (Tables S26, S31). These include photo-reactive or photo-responsive gene families (cryptochromes, opsins, G proteins). The D. pulex genome shows 46 opsins (Table S32; Figs. S21-22) of which 42 derive from two rhabdomeric subfamilies, one ciliary pteropsin subfamily, and a newly discovered lineage that forms a sister group to rhabdomeric opsins that we name arthropsins (SOM 4). Arthropsins are ancestral to the chordate melanopsin lineage and thus appear to have been retained in Daphnia, despite their loss from all other available bilateral animal genomes. The expansion of these gene families suggest that adaptations to a more complex light regime in aquatic environments (16-17) can be influential in shaping the gene content of these organisms.
Tandemly duplicated gene clusters are predisposed to homogenization by gene conversion and unequal crossing-over (18). If common, concerted evolution can maintain sequence and functional similarities among paralogs. We examined copied DNA segments among all paralogs in the genome (SOM V.1) and observed that 47% of the genes show tracts of non-allelic gene conversion compared to 12-18% of genes in five Drosophila species (Tables S33-38; Figs. S23-24). Thus, concerted evolution is affecting more than 1 Mb (8%) of all protein-coding sequences in Daphnia, especially when duplicates are oriented on the same strand, with a similar conversion rate (converted pairs of paralogs/total pairs of paralogs analyzed) and number of events per pair as Drosophila. The greater proportion of converted genes in D. pulex is mainly attributed to the greater number of targets for gene conversion within the genome, including tandemly duplicated gene clusters with intervening genes. Conversion events in Daphnia are less common among the youngest duplicates, and within gene families containing only two paralogs.
One example of widespread gene conversion is found in the di-domain hemoglobin genes. Hemoglobin levels in the hemolymph of daphniids can rise by more than one order of magnitude in response to reduced oxygen availability in aquatic habitats, which fluctuates in diurnal and seasonal cycles (Fig. 2A). In Daphnia, a tandemly duplicated gene cluster of hemoglobin (Hb) genes contributes to the protein’s varying composition (19). We sequenced and assembled the full D. magna cluster to compare with the arrangement of eight clustered D. pulex hemoglobin genes (Figs. S25-27). (D. pulex also has three non-clustered Hb genes.) Notably, the two species show almost identical gene arrangements within an interval of ~23.5 Kb (Table S39) except for the obvious absence of Hb6 from the D. magna cluster (Fig. 2B). In both species, a non-coding RNA gene interrupts the cluster between Hb4 and Hb5, and hypoxia response elements plus ancillary sequences are preserved within the regulatory regions of each gene. Thus, the duplication and subsequent divergence of hemoglobins must have occurred prior to the divergence time of D. pulex and D. magna.
However, a phylogenetic analysis of protein-coding sequences (SOM V.2) suggests that most hemoglobin genes have duplicated independently within each species (Fig. 2C). A separate phylogenetic reconstruction using sequences from intergenic regions recovers a tree that is consistent with duplication prior to speciation (Fig. 2D). Because the support values at nodes for both trees are equally strong, we conclude that gene conversion tracts are homogenizing the protein coding regions. The hemoglobin gene clusters in both species are homologs because of ancestral gene duplications, yet the duplication history of genes is obfuscated by independent gene conversions facilitated by their ordered arrangement in the genomes.
Gene duplication is an important source of evolutionary novelty. After duplication, one copy is commonly disabled by mutation and becomes a pseudogene. This fate may be avoided if selection maintains both copies via gene dosage, novel function, or by subdividing the gene’s original function into multiple components (20). We conducted microarray experiments to determine the magnitude of functional divergence among paralogs, then traced (21) and tested (22) whether their patterns of gene transcription differ in 1 to 12 ecologically relevant conditions as a function of Ks (Table S40; SOM VI.1). As expected, many recent duplicates (Ks < 0.05) have indistinguishable gene expression patterns for the tested conditions (47%; Fig. 3A). Within many gene families, divergence in expression patterns correlate with age (Figs. S28-29). We found that long-wavelength opsins most similar in sequence have the same expression patterns (correlation > 0.9) but then diverge in their response to shared conditions as they age, at an estimated rate of 0.6% per 10% synonymous nucleotide substitutions. A similar pattern is observed for the di-domain hemoglobins, albeit with more rapid divergence in expression.
In contrast to the steady expression divergence of many duplicates, we observed an equally large fraction of recently arisen paralogs – with nearly identical sequences – that differ in their expression in at least one condition (Fig. 3A). While we could confidently detect locus specific expression for only a fraction of the youngest duplicates represented on the microarray (Table S40), a plot of the maximum difference in the expression response of paralogs to an identical condition suggests that, on average, newly duplicated genes may differ in expression by as much as 1.9 fold (Fig. 3B). These may be cases where new regulatory programs were created by the gene duplication itself; through a failure to copy regulatory elements or when a duplicate is integrated within a new genomic location (23).
Gene conversion, homogenizing non-regulatory nucleotide sequences, can contribute to this class of highly Differentially Expressed (DE) paralogs at low sequence divergence (Ks). We tested whether gene conversion accounts for the differences in the evolutionary rates of expression divergence by comparing duplicates (Ks < 2) on the basis of their structural arrangements in the genome (SOM VI.2). Neighboring paralogs within tandem gene clusters were just as likely to diverge in expression as dispersed duplicates outside of clusters (χ2 = 0.027, p = 0.87). Globally, gene conversion reduces the expression-level divergence of paralogs (χ2 = 11.9, p = 0.0005; Table S41), yet we detected no significant impact on the observed fractions of divergently expressed paralogs when we removed duplicated genes with signatures of gene conversion (Table S42). Although adjacent genes are often co-expressed (24), the local placement of genes within tandem gene clusters has no clear effect on gene expression divergence in D. pulex. We thus conclude that paralogs, even in tandem, frequently acquire divergent expression patterns at, or soon after, the time of duplication.
To investigate the functional role of paralogs and their preservation, we examined interacting genes with known function. A total of 1,908 genes representing 563 enzymes were charted onto the global metabolic pathway for D. pulex by referencing the metabolic enzyme networks of three insects and four vertebrates (Fig. 4; SOM VII.1). Of these, 38 gene families were amplified in Pancrustacea, of which 32 are expanded in the lineage leading to Daphnia (Figs. S30-31; Tables S43-44). Half (19/38) of the amplified genes are non-randomly clustered within seven distinct pathways (p < 0.03 by exact binominal test and p < 0.03 by network permutation analysis; Fig. 4 panels A-G; Fig. S32). These data, showing co-expansion of genes within pathways, suggest that duplicated genes can be interdependent.
A study of the expression-patterns of duplicated genes from this metabolic network (SOM VII.2) reveals greater average similarity between genes from co-expanding and interacting families (same KEGG map ID in Table S43) than between genes from non-associating families (t = 3.30, p = 0025). This pattern suggests non-independent functional divergence of expanding genes within pathways (e.g. Tables S45-48; Figs. S33-34). One example involves nine clades of fucosyltransferase paralogs that share 95% amino acid similarity (colored lines in Fig. S35) and have independently diversified to express seven transcriptional profiles shared with interacting glycosyltransferase paralogs. Such a pattern of co-divergence suggests a decoupling of duplication history and functional association. To test this prediction, we estimated the ratio of among-group variance to total variance in differential expression (Dst) for phylogroups of fucosyltransferase paralogs and for expression profile clusters (SOM VII.2). We detect no significant subdivision of expression patterns for fucosyltransferase paralogs based on phylogeny (blue nodes in Fig. S35; Dst = 0.0042, p = 0.89). By contrast, clusters based on transcriptional profiles, and including distantly related paralogs and interacting glycosyl transferase paralogs, show significant subdivision (Dst = 0.0836, p = 0.002).
The D. pulex genome contains many duplicated genes with unknown homology. Although this may diminish with the availability of more crustacean genomes, these unknown genes appear to play important roles in the animal’s ecology. ESTs from 37 cDNA libraries representing transcriptomes of daphniids exposed to biotic ecological challenges, abiotic ecological stressors and different life-history stages in laboratory environments (Table S10) show that genes unique to the Daphnia lineage, and genes that reside within tandemly duplicated gene clusters, are significantly over-represented within transcriptomes under ecological conditions (Fig. 5A; Table S49; χ2 = 265.1 p = 2.66 × e-58 and χ2 = 41.0 p = 1.23 × e-09, respectively). Whole-genome tiling-expression microarray experiments show differential expression to be twice as frequent in genomic regions devoid of gene models (intergenic) when D. pulex are exposed to ecological conditions compared to conditions of life-history (Table S50; Fig. S36).
We count 34,844 transcriptionally active regions (TARs) within unannotated regions of the genome, showing predictable exon-intron intervals supporting additional gene models not yet included within the minimum set (TAR-genes, Table S12) and that are condition-dependent in their regulation. By partitioning the differentially expressed genome by experimental conditions, between 72% and 85% of the transcriptome uniquely responded to one of the three conditions (Fig. 5B). In all, 73% of differential regulation under biotic or abiotic stressors requires additional gene models or extensions.
Daphnia pulex paralogs follow different evolutionary trajectories that are determined, in part, by their initial transcriptional expression patterns. At least half appear to acquire divergent expression patterns at or near the time of origin. Interacting and co-expanding genes can also appear to be co-diverging in their responses to environmental conditions. These observations suggest that the persistence of this distinctive class of functionally divergent gene duplicates is due to preservation by entrainment (PBE). Entrainment is defined as the process of increasing the initial probability of preserving a duplicated gene through its functional interaction with existing or newly interacting genes sharing regulatory programs. Because biological processes can be governed by interdependent regulation of interacting genes, there are three likely evolutionary outcomes for these interacting duplicated genes (Fig. 6). Genes with expression patterns unchanged at the time of duplication may continue to share the condition-specific regulation of existing interacting genes (Fig. 6A). In this scenario, selection for gene dosage may increase the probability that gene duplicates are preserved (25). Alternatively, duplicates may initially have divergent expression patterns, but have inappropriate transcriptional responses to environmental conditions, or lack appropriately co-regulated interacting genes (Fig. 6B). Duplicates within this category are most likely lost. In contrast, genes with divergent expression patterns at the time of duplication, yet with regulation sufficiently similar to the expression patterns of a different interacting gene, may have combined products that are beneficial under a distinct environmental condition (Fig. 6C). In this scenario, the likelihood for preservation of these new gene duplicates is increased. Thus, when genes are advantageous at the time of duplication, their coding regions are subject to purifying selection from the start, and are entrained to a distinct regulatory pattern dictated by condition-specific gene-gene interactions. Although the likelihood of converging on a beneficial gene expression profile near the time of duplication is very small, in the case of Daphnia, PBE is facilitated by the high rate of gene duplication, resulting in co-regulated interacting genes that can potentially define environment-specific transcriptomes, which may increase with the complexity of interactions between organisms and their environments.
In conclusion, by examining genome structure and the functional responses of genes to environmental conditions within species with tractable ecologies, we further our understanding of gene-environment interactions in an evolutionary context. Many responsive genes to ecological conditions have unknown function, and information from laboratory model species may be insufficient because of a lack of homology or experimentally demonstrated functions in response to the environment. Thus, ecological genomics requires empirical annotations of new genome sequences from a broader diversity of species, tested under a variety of natural conditions.
We thank Marvin Frazer (JGI), Peter Cherbas (CGB), Roland Green and Tsetska Takova (Roche NimbleGen, Inc.). The work conducted by the U.S. Department of Energy Joint Genome Institute (JGI) was supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, and in collaboration with the Daphnia Genomics Consortium (DGC). This project was also supported by major NSF grants 0221837 and 0328516, and NIH grant IR24GM07827401A1. Coordination infrastructure for the DGC is provided by The Center for Genomics and Bioinformatics (CGB) at Indiana University, which is supported in part by the METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. Additional contributions and acknowledgements are provided in the SOM. Our work benefits from, and contributes to the Daphnia Genomics Consortium.
Daphnia pulex genome assembly V1.1 and annotations are deposited at DDBJ/EMBL/GenBank under the accession ACJG00000000. ESTs (FE274839-FE425949) are in GenBank. Microarray platforms GPL11200-GPL11201 and data GSE25823 are deposited at NCBI GEO.