|Home | About | Journals | Submit | Contact Us | Français|
The social amoebae are exceptional in their ability to alternate between unicellular and multicellular forms. Here we describe the genome of the best-studied member of this group, Dictyostelium discoideum. The gene-dense chromosomes encode ~12,500 predicted proteins, a high proportion of which have long repetitive amino acid tracts. There are many genes for polyketide synthases and ABC transporters, suggesting an extensive secondary metabolism for producing and exporting small molecules. The genome is rich in complex repeats, one class of which is clustered and may serve as centromeres. Partial copies of the extrachromosomal rDNA element are found at the ends of each chromosome, suggesting a novel telomere structure and the use of a common mechanism to maintain both the rDNA and chromosomal termini. A proteome-based phylogeny shows that the amoebozoa diverged from the animal/fungal lineage after the plant/animal split, but Dictyostelium appears to have retained more of the diversity of the ancestral genome than either of these two groups.
The amoebozoa are a richly diverse group of organisms whose genomes remain largely unexplored. The soil-dwelling social amoeba Dictyostelium discoideum has been actively studied for the past fifty years and has contributed greatly to our understanding of cellular motility, signalling and interaction1. For example, studies in Dictyostelium provided the first descriptions of a eukaryotic cell chemo-attractant and a cell-cell adhesion protein2, 3.
Dictyostelium amoebae inhabit forest soil consuming bacteria and yeast, which they track by chemotaxis. Starvation, however, prompts the solitary cells to aggregate and to develop as a true multicellular organism, producing a fruiting body comprised of a cellular, cellulosic stalk supporting a bolus of spores. Thus, Dictyostelium has evolved mechanisms that direct the differentiation of a homogeneous population of cells into distinct cell types, regulate the proportions between tissues and orchestrate the construction of an effective structure for the dispersal of spores4. Many of the genes necessary for these processes in Dictyostelium were also inherited by metazoa and fashioned through evolution for use within many different modes of development.
The amoebozoa are also noteworthy as representing one of the earliest branches from the last common ancestor of all eukaryotes. Each of the surviving branches of the crown group of eukaryotes provides an example of the ways in which the ancestral genome has been sculpted and adapted by lineage-specific gene duplication, divergence and deletion. Comparison between representatives of these branches promises to shed light not only on the nature and content of the ancestral eukaryotic genome, but on the diversity of ways in which its components have been adapted to meet the needs of complex organisms. The genome of Dictyostelium, as the first free-living protozoan to be fully sequenced, should be particularly informative for these analyses.
An international initiative to sequence the genome of Dictyostelium discoideum AX4 (references 5, 6) was launched in 1998. The high repeat-content and A+T-richness of the genome (the latter rendering large-insert bacterial clones unstable) posed severe challenges for sequencing and assembly. The response to these challenges was to use a whole-chromosome shotgun (WCS) strategy, partially purifying each chromosome electrophoretically and treating it as a separate project. This approach was supported by novel statistical tools to recover chromosome specificity from the impure WCS libraries, and by highly detailed HAPPY maps that provided a framework for sequence assembly. These approaches have enabled the completion of this difficult genome to a high standard, and are likely to be valuable in tackling the many other genomes which present challenges of composition and complexity.
To support sequence assembly, we made high-resolution maps of the chromosomes using HAPPY mapping7–9, which relies on analysing the sequence content of single DNA molecules prepared by limiting dilution. A total of 3902 markers selected mostly from the emerging shotgun data were mapped, and maps of all six chromosomes were assembled (see Methods; Table 1;Fig. SI 1; Table SI 1).
Two strategies were used to recover chromosome-specific data from impure WCS libraries (see Methods). The first - employed for chromosomes 1, 2 and 3 - used enrichment of the respective libraries as the main statistical indicator of the chromosomal assignment of contigs, and on HAPPY maps to guide assembly. The second - for chromosomes 4, 5 and most of 6 - used mapping data to chromosomally assign sequences initially, with detailed HAPPY maps being used to validate final assemblies. A 1508kb portion of chromosome 6 was sequenced as a pilot project using a combination of approaches (see Methods).
Repetitive tracts complicated assembly. For chromosomes 1, 2 and 3, inspection of polymorphisms, combined with HAPPY maps, allowed unambiguous assembly in many cases. For chromosomes 4, 5 and 6, low-coverage sequencing of AX4-derived YACs alleviated the problems by providing a local dataset within which the troublesome repeat element was present as a single copy. Nevertheless, some repeat tracts proved intractable and remain as gaps. Thirty-four unlinked (‘floating’) contigs of >1kb, totalling 225,339bp, remain unpositioned in the genome, but can be provisionally assigned to specific chromosomes based on their content of reads from the WCS libraries. Most or all of these floating contigs are bounded by repetitive regions. The chromosome 2 sequence in the current assembly supercedes that previously published9, having benefited from further HAPPY mapping and manual sequence finishing. The six chromosomal assemblies span 33,817kb (Table 1), including ~156kb in the form of clone-, sequence- and repeat-gaps. Assuming that the majority of floating contigs lie beyond the termini of the assemblies, the total genome size is estimated at 34,042,810bp. In estimating the completeness of the sequence, we note that of 967 well-characterized D. discoideum genes, 957 (99%) were found initially in the assemblies. Of the remaining ten, seven (cupE, trxA, trxB, trxC, staA, staB, cinB) have close matches, suggesting that their Genbank entries may contain errors or represent alternative alleles. Only three (fcpA, wasA and roco5) had no matches in the initial assemblies, though the first two of these were recovered by searches of unincorporated sequence followed by local reassembly. Of 133,168 ‘qualified’ D. discoideum AX4 ESTs (expressed sequence tags of >200bp and >20% G+C, and not matching mitochondrial sequence; reference 10 and H. Urushihara et al. unpublished), 128,207 (96.3%) are found in the assemblies (the higher proportion of missing sequences amongst the ESTs probably reflects the higher error rate inherent in EST data).
We conclude that the current assembly represents >>95% of chromosomal sequence (less than 1% of which is in floating contigs) and ≥99% of genes, the majority of missing sequence comprising complex or simple repeats. The most stringent test of the medium- to long-range accuracy of the assembly comes from comparison with the HAPPY maps. This is particularly true for chromosomes 4, 5 and 6, where HAPPY markers were used to nucleate contigs but not to guide their assembly or ordering, specifically to allow such a comparison to be made without circularity of argument. As can be seen, good agreement between map and sequence confirms the accuracy of the assembly (Fig. 1).
The genome is A+T-rich (77.57%) and of broadly uniform composition, apart from the more G+C-rich repeat-dense regions (Fig. 2). On a finer scale, nucleotide composition tracks the distribution of exons (see below). Amongst dinucleotides, CpG is under-represented, not just in absolute terms but relative to its isomer GpC (the former occurring only 62% as often as the latter). This bias normally reflects cytosine methylation at CpG sequences, promoting their mutation to TpG (which isoverrepresented relative to GpT by 38%). Hence, these observations suggest that cytosine methylation may occur in Dictyostelium, contrary to earlier findings11.
Simple sequence repeats (SSRs) are more abundant in Dictyostelium than in any other sequenced genome, comprising > 11% of bases (Fig. SI 2). In non-coding sequence, tracts of dinucleotides or longer motifs occur every 392bp on average and comprise 6.4% of the bases. There is a bias towards repeat units of 3–6 bases, whereas dinucleotide tracts predominate in most other genomes. Homopolymer tracts are also abundant, comprising a further 16% of non-coding sequence. The base composition of non-coding SSRs and homopolymer tracts (99.2% A+T) is even more biased than that of their surrounding sequence, suggesting that either selection or the mechanism of repeat expansion favours A+T-rich repeats.
Notably, SSRs are also abundant in protein-coding sequence, occurring on average every 724bp within exons. We consider these coding SSRs in further detail below in the context of proteins.
The genome is known to be rich in transposable elements (TEs)9, 12. Completion of the sequence confirms the earlier observation that TEs of the same type are clustered, suggesting their preferential insertion within similar resident elements. However, none of the elements appears to use a specific sequence as a target for insertion: they insert at random within other elements of the same type. Non-LTR (long terminal repeat) retrotransposons are known to insert next to tRNA genes: we find many such instances (Fig. 2), but again no specific sequences were identified as insertion targets.
The sequenced genome encodes 390 tRNAs, a number at the upper end of the eukaryotic spectrum (e.g. Plasmodium falciparum=43, Drosophila melanogaster=284, human=496). Allowing for the normal wobble rules in codon-anticodon pairing13, 14 every sense codon can be decoded, apart from the rare alanine codon GCG; we infer that the missing tRNA(s) lie in one or more gaps in the sequence. We also find a possible selenocysteine tRNA in the genome, as well as corresponding selenocysteine insertion targets in two predicted proteins (Supplementary Information; Fig. SI 3).
Dictyostelium, in common only with Acanthamoeba castellanii15, has been shown to lack certain apparently essential tRNAs in its mitochondrial genome16. It therefore seems likely that at least some chromosomally-encoded tRNAs (those for valine, threonine, asparagine and glycine, as well as one arginine and two serine tRNAs) are imported into the mitochondria.
Though the gross distribution of tRNAs is uniform, their organisation on a finer scale is striking: about 20% occur as pairs or triplets with identical anticodons (and usually 100% sequence identity), separated by <20kb and often by <5kb (Fig. 2). There are 41 such groups in the genome; a random distribution would produce few, if any. This pattern is unique amongst sequenced genomes, and suggests a wave of recent duplications. However, tRNA pairs are found in tandem, converging and diverging orientations with comparable frequencies, suggesting no straightforward duplication mechanism; nor is there usually duplication of extensive flanking sequences. Whether the preference of TRE elements for inserting adjacent to tRNAs is related to the large number and unusual distribution of the latter is unclear.
In Dictyostelium, rRNA genes lie on an 88-kb palindromic extrachromosomal element17, present at ~100 copies per nucleus (Fig. 2). Evidence exists also for chromosomal copies: at least the central 3.2 kb of the element is located17 on chromosome 4, whilst chromosome 2 carries both a partial rDNA sequence and a 5S rRNA pseudogene9, 18.
In this study, two unanchored contigs assigned to chromosomes 4/5 were found to contain junctions between rDNA sequences and complex repeats - each of which would confound attempts to extend the sequence and integrate these contigs into the assemblies. We postulate that these contigs represent the junctions between a ‘master copy’ of the rDNA and the remainder of chromosome 4 (Fig. 2). One contig contains sequence matching a region of G+C-rich repeats near the centre of the palindrome, whilst the other matches sequence near the tip of the palindrome arm, adjacent to the one unclosed gap in the rDNA element sequence17. This gap is believed to represent a tandem array of short repeats, probably added post-synthetically to the extrachromosomal elements.
The structure of this master copy suggests a mechanism for generating the extrachromosomal copies by a process of transcription, hairpin formation and second-strand synthesis (Fig. 2). This process would account for the complete absence of sequence variation between the two arms of the palindrome.
Centromeres mobilize eukaryotic chromosomes during cell division but vary widely in their structure and organisation19, making them difficult to identify. Each Dictyostelium chromosome carries a single cluster of repeats rich in DIRS (Dictyostelium intermediate repeat sequence) elements20, 21 near one end22, and this sole but striking structural consistency suggests that these clusters may serve as centromeres. Although the repetitive nature of the chromosomal termini impeded their assembly, most of the cluster on Chromosome 1 was assembled (Fig. 3) and shows a complex pattern of DIRS and related Skipper elements, each preferentially associated with others of the same type. Frequent insertions and partial deletions have created a mosaic with little long-range order.
In Dictyostelium cells demonstrating condensed chromosomes characteristic of mitosis, DIRS-element probes hybridise to one end of each chromosome (Fig. SI 4), consistent with the mapping data. DIRS-like elements in other species are more uniformly scattered along the chromosomes23 suggesting that their restricted distribution in Dictyostelium chromosomes is functionally important. Further, the DIRS-containing ends of the chromosomes cluster not only during mitosis, but also during interphase (Fig. SI 4), as has been observed for centromeres in Schizosaccharomyces pombe24.
No G+T-rich telomere-like motifs were identified in the sequence. However, earlier findings22 suggested that the chromosomes terminate in the same G+A-rich repeat motif which caps the extrachromosomal rDNA element. We therefore surveyed all shotgun sequence to identify reads containing a junction between complex repetitive elements and rDNA-like sequence. Only 556 such reads were identified, of which 221 could be built into 13 contigs which we refer to as 'C/R (complex-repeat/rDNA) junctions'.
Of the 13 junctions, two represented already-known regions lying internally in the chromosomal assemblies. Of the remaining 11, one had twice the sequence coverage of the others, suggesting that it represents two distinct but identical portions of the genome (a possibility supported by the fact that another two of the junctions differed from each other by only two bases). Hence, we infer that the 11 remaining contigs represent 12 distinct junctions between repetitive elements and rDNA-like sequences - potentially one for every chromosomal end.
Based on their content of sequence-reads from each of the whole-chromosome libraries, we assigned two of the C/R contigs to each of the chromosomes. Chromosomes 4 and 5 cannot be distinguished in this way but three junctions, including the one believed to be present as two copies, are assigned to this chromosome pair. The point in the rDNA palindrome which is represented differs from one junction to the next (Fig. SI 5), but several junctions fall at common parts of the palindrome. This may reflect a preference in the mechanism which forms or maintains the junctions, or may result from an homogenizing recombination between them or with other rDNA sequences. Certainly the low frequency of differences between the rDNA components of the junction fragments and the extrachromosomal rDNA element argues for some process that limits or rectifies mutation. At each junction, we see only the rDNA sequence that immediately adjoins the complex repeat, since further assembly is precluded by the multicopy nature of rDNA. Therefore we cannot tell whether each junctional rDNA sequence extends to the telomere-repeat-carrying tip of the rDNA palindrome sequence, nor whether other sequences lie beyond the rDNA components.
HAPPY mapping of markers derived from six of these C/R junctions confirmed not only the chromosomal assignments which had been made based on the origins of their component sequences, but also their locations at the termini of the mapped regions of the chromosomes. For the other junctions, the absence of unique sequence features precluded such mapping. Taken as a whole, this evidence strongly suggests that rDNA-like elements form part of the telomere structure in D. discoideum and that common mechanisms stabilise both the extrachromosomal rDNA element and the chromosomal termini.
Chromosome 2 of D. discoideum AX4 carries a perfect inverted 1.51Mb duplication (Fig. 2; references 9, 25, and this work). This duplication, containing 608 genes, is known25 to be absent from the wild-type isolate NC4 and from one of its direct descendents (AX2), but present in another (AX3); AX4 in turn is derived from AX3. The sequences adjoining the right-hand end of the duplication - a partial copy of a DIRS element (and a partial DDT-A element) and a region identical to part of the rDNA palindrome, both at about 3.74Mb (Fig. 2) – have been implicated in centromeric and telomeric functions, respectively, elsewhere in the genome.
We propose that this duplication arose from a 'breakage-fusion-bridge' cycle as first described in maize26, and since observed in many genomes. The nearby DIRS and rDNA components, in this view, represent abortive attempts to stabilize the halves of the broken chromosome by establishing new telomeres and centromeres, followed by re-fusion of the pieces to create a restored and enlarged chromosome (Fig. SI 6).
Chromosome 2 (the largest of the chromosomes, even discounting the duplication in AX4) may be prone to breakage: in the Bonner isolate of NC4, maintained in vegetative growth for 50 years, chromosome 2 is represented by two smaller fragments27. Comparison with more recent data22 indicates that the break-point in NC4-Bonner lies in the same region as the duplication in AX4, suggesting that NC4-Bonner underwent the early stages of this process, but that the chromosome fragments were stabilized and maintained after the initial breakage. Preliminary results (not shown) from HAPPY mapping also suggest that, whilst wild-type isolates V12M2 and NC4 both lack the duplication seen in AX4, NC4 may carry a duplication of ~300kb near the opposite end of chromosome 2.
Prediction of protein-coding genes (see Methods) was performed on the complete set of chromosomes and floating contigs (Table 2). In assessing the completeness and accuracy of the predictions, we find that of the 957 well-characterized D. discoideum genes that are present in the current sequence, 823 (86%) are predicted as transcripts with structures matching the experimentally determined ones. For a further 123 (13%), the predicted transcript differs from the experimentally determined one, about half of these differing only in their 5' boundary; the remaining 11 (1%), though present in the sequence, were not predicted as transcripts. Similarly, of the 128,207 qualified ESTs present in the current sequence, 127,097 (99.1%) fall within predicted transcripts. Combining our estimate of sequence coverage (above) with these estimates of the success of gene prediction, we infer that approximately 98% of all D. discoideum genes are present in the predicted set.
The level of over-prediction, conversely, is harder to estimate: prediction was performed generously to ensure that most true genes were represented. Of the 13,541 predicted proteins, 47.5% are represented by qualified ESTs, reflecting the inevitable bias in EST sampling. Amongst the shortest predicted proteins, fewer are represented by ESTs (e.g. 21% of those of <60 amino acids), due at least partly to a higher level of overprediction. Based on the simplifying assumption that 50% of all genes <100 amino acids are mis-predictions, we estimate the true number of genes at roughly 12,500. This number is closer to that seen in multicellular organisms than in most unicellular eukaryotes (Table 2). The same relative complexity is seen in the total numbers of amino acids encoded by each genome, which are less biased by shorter (and more dubious) predictions. Introns in Dictyostelium are few and short, and intergenic regions are small, producing a compact genome of which 62% encodes protein.
Genes are distributed approximately uniformly across the genome (Fig. 2). Although we do not see widespread clustering of genes with coordinated expression patterns (see Methods), we do find statistically significant (p<0.01) clusters of genes expressed predominantly at some developmental stages or in specific cell types (Fig. 2).
Codon usage in Dictyostelium favours codons of the form NNT or NNA over their NNG or NNC synonyms, the bias being even greater than in the A+T-rich Plasmodium genome. Comparison of tRNA and codon frequencies (Table SI 2) reveals a similar picture to that in human28 and other eukaryotes, suggesting that the same use is made of 'wobble' and of base modifications (for example, of adenine to inosine in some tRNAs) to expand the effective repertoire of tRNAs.
As in Plasmodium29, the extreme A+T-richness is reflected not just in the choice of synonymous codons, but also in the amino acid composition of the proteins. Amino acids encoded solely by codons of the form WWN (W=A or T; N=any base; these are Asn, Lys, Ile, Tyr and Phe), are much commoner in Dictyostelium proteins than in human ones; the reverse is true for those encoded solely by SSN codons (S=C or G; these are Pro, Arg, Ala and Gly).
The predicted gene set of Dictyostelium is rich in relatively recent gene duplications. Of the 13,498 predicted proteins analysed, 3663 fall into 889 families clustered by BLAST-P similarities of e<10−40. Most (538) families contain only two members, but 351 families contain between three and 81 proteins (Table SI 3). Hence, 2774 (20%) of all predicted proteins have arisen by relatively recent duplication, potentially accounting for much of Dictyostelium's excess over typical unicellular eukaryotes.
We tried to infer the mechanisms by which such duplications arise and propagate in the genome. Where members of a family are clustered on one chromosome, the physical distance between family members often (23 of 86 families examined) correlates strongly with their evolutionary divergence (Methods). Where a family is split between different chromosomes, members on the same chromosome are often (23 of 50 families examined) more related to each other than to members on different chromosomes; the reverse is never observed.
These findings suggest that three processes combine to account for most duplications in Dictyostelium: tandem duplication, local inversion, and inter-chromosomal exchange. In this model, gene families expand by tandem duplication of either single genes or blocks containing several consecutive genes, as in an earlier model30; inversions within these expanding clusters may reverse local gene order. An elegant illustration of these two processes is provided by a cluster of Acetyl-coA synthetases on chromosome 2 (Fig. 4). The third process - exchange of segments between chromosomes - may fragment these clusters at any stage. If such an interchromosomal exchange splits a gene family early in its expansion, then each of the two resulting sub-families has a long subsequent period of evolution independently of the other, so similarities will be greatest between genes on the same chromosome. If, conversely, the split occurs later then all family members - whether on the same or on different chromosomes - will tend to resemble each other equally closely. We cannot exclude the possibility of duplication occasionally creating a second copy of a gene, or group of genes, directly on a different chromosome from the first. However, all instances that we have examined can be accounted for without such intermolecular duplication.
Tandem repeats of trinucleotides (and of motifs of 6, 9, 12 etc bases) are unusually abundant in Dictyostelium exons, and naturally correspond to repeated sequences of amino acids. However, at the protein level the situation is even more extreme: there are many further amino-acid repeats which use different synonymous codons, and so do not arise from perfect nucleotide repeats. Amongst the predicted proteins, there are 9582 simple-sequence repeats of amino acids (homopolymers of length ≥10, or ≥5 consecutive repeats of a motif of two or more amino-acids). Of these, the most striking are polyasparagine and polyglutamine tracts of ≥20 residues, present in 2,091 of the predicted proteins. Also abundant are low-complexity regions such as QLQLQQQQQQQLQLQQ: there are 2379 tracts of ≥15 residues composed of only two different amino acids. In total, repeats or simple-sequence tracts of amino acids (even by these conservative definitions) occur in 34% of predicted proteins and encode 3.3% of all amino acids.
It seems likely that these repeats have arisen through nucleotide expansion, but have been selected at the protein level. Evidence for the latter is that any given trinucleotide repeat occurs predominantly in only one of the three reading frames. For example, the repeat ...ACAACAACAACA... is usually translated as polyglutamine ([CAA]n) rather than as polythreonine ([ACA]n) or polyasparagine ([AAC]n). Further evidence comes from the many trinucleotide repeats which have apparently mutated to produce only synonymous codons (e.g. ...GAT,GAC,GAT,GAT,GAC,..., translated as polyaspartate). Moreover, the distribution of repeats and simple-sequence tracts is non-random: most proteins either have no such features (66% of proteins) or have two or more (18% of proteins), suggesting that they are tolerated only in certain types of protein. The polyasparagine- and polyglutamine-containing proteins appear to be over-represented in protein kinases, lipid kinases, transcription factors, RNA helicases and mRNA binding proteins such as spliceosome components (Fig. SI 9). Protein kinases and transcription factors are also over-represented in the polyasparagine- and polyglutamine-containing proteins of S. cerevisiae, so it is possible that these homopolymers serve some functional role in these protein classes. A more detailed analysis of amino acid homopolymers is given in Supplementary Information (tables SI 4–6, Figs. SI 7–10).
The organisms that diverged from the last common ancestor of all eukaryotes followed different evolutionary paths, but all retained the basic properties of eukaryotic cells. Their genomes have been sculpted by chromosomal deletions and duplications that led to lineage specific gene family expansions, reductions and losses, as well as genes with novel functions32, 33. Our analysis of Dictyostelium’s proteome shows that similar mechanisms have shaped its genome, augmented by horizontal gene transfer from bacterial species.
Using morphological criteria, early workers were unsure whether to classify Dictyostelids as fungi or protozoa34. Molecular methods indicated that they were amoebozoa and also suggested that Dictyostelium diverged from the line leading to animals at about the same time as plants35, 36. A study of more than 100 proteins suggested that Dictyostelium diverged after the plant/animal split, but before the divergence of the fungi37. The recent finding of a gene fusion encoding three pyrimidine biosynthetic enzymes, shared only by Dictyostelium, fungi and metazoa, indicates that the amoebozoa are a true sister group of the fungi and metazoa38.
To examine the phylogeny of Dictyostelium on a genomic scale, we applied an improved method for predicting orthologous protein clusters to complete eukaryotic proteomes39 (for details, see Supplementary Information). The data were used to construct a phylogenetic tree that confirms the divergence of Dictyostelium along the branch leading to the metazoa soon after the plant/animal split (Fig. 5). Despite the earlier divergence of Dictyostelium, many of its proteins are more similar to human orthologues than are those of S. cerevisiae, probably due to higher rates of evolutionary change along the fungal lineage. Whether the greater similarity between amoebozoa and metazoa proteins translates into a generally higher degree of functional conservation between them compared to the fungi remains to be seen.
To examine shared functions, we defined eukaryote-specific Superfamily and Pfam protein domains and sorted them according to their presence or absence within 12 completely sequenced genomes to arrive at their distribution amongst the major organismal groups (see Supplementary Information; Fig. SI 11; Tables SI 7–10). The plants, metazoa, fungi and Dictyostelium all share 32% of the eukaryotic Pfam domains (Fig. 6). The protein domains present in Dictyostelium, metazoa and fungi, but absent in plants, are interesting because they probably arose soon after plants diverged and before Dictyostelium diverged from the line leading to animals. The major classes of domains in this group of proteins include those involved in small and large G-protein signalling (e.g., RGS proteins), cell cycle control and other domains involved in signalling (Tables SI 8 and SI 9). It also appears that glycogen storage and utilization arose as a metabolic strategy soon after the plant/animal divergence since glycogen synthetase seems to have appeared in this evolutionary interval.
Particularly striking are the cases where otherwise ubiquitous domains appear completely absent in one group or another. For instance, Dictyostelium appears to have lost the genes that encode collagen domains, the circadian rhythm control protein Timeless and basic helix-loop-helix transcription factors (Table SI 7). Metazoa, on the other hand, appear to have lost receptor histidine kinases that are common in bacteria, plants and fungi, while Dictyostelium has retained and expanded its complement to 14 members40.
An important motivation for sequencing the Dictyostelium genome was to aid the discovery of proteins that would facilitate studies of orthologues in human, with possible implications for human health. Although orthologues of human genes implicated in disease are of course present in many species, Dictyostelium provides a potentially valuable vehicle for studying their functions in a system which is experimentally tractable and intermediate in complexity between the yeasts and the higher multicellular eukaryotes. To assess the usefulness of Dictyostelium for investigating the functions of genes related to human disease we used the protein sequences of 287 confirmed human disease genes as queries and carried out a systematic search for putative orthologues in the Dictyostelium proteome41. At a stringent threshold value of E≤10−40, we identified 64 such proteins. Of these, 33 were similar in length to the human protein and had similarity extending over >70% of the two proteins (Table 3). The number of Dictyostelium orthologues of human disease genes is lower than in D. melanogaster or C. elegans but higher than in S. cerevisiae or S. pombe. Of the 33 putative orthologues of confirmed human disease genes in Dictyostelium, five are absent in both S. cerevisiae or S. pombe (E-value ≤ 10−30), a further four are absent from S. cerevisiae and two are not found in S. pombe.
The acquisition of genes by horizontal transfer from one species to another (HGT) has become increasingly recognized as a mechanism of genome evolution42–44. We identified eighteen potential HGTs by screening Dictyostelium protein domains that are similar to bacteria-specific Pfam domains and that have phyletic relationships consistent with HGT (see Supplementary Information). They encode protein domains that appear to have replaced functions, added new functions or evolved into novel functions (Table 4). The thy1 gene, which encodes an alternative form of thymidylate synthase (ThyX), appears to have replaced the endogenous gene: the conventional thymidylate synthase (ThyA) is not present45. Other HGT domains also have established functions, which are presumably retained and give Dictyostelium the ability to degrade bacterial cell walls (dipeptidase), scavenge iron (siderophore), or resist the toxic effects of tellurite in the soil (terD). Still other HGTs have become embedded within Dictyostelium genes that encode larger proteins. An example of this is the Cna B domain that is found within four large predicted proteins, one of which, colossin A, is predicted to be 1.2 MDa (Fig. SI 12).
Dictyostelium faces many complex ecological challenges in the soil. Amoebae, fungi and bacteria compete for limited resources in the soil while defending themselves against predation and toxins. For instance, the nematode C. elegans is a competitor for bacterial food and a predator of Dictyostelium amoebae, but also a potential dispersal agent for Dictyostelium spores46. Dictyostelium has expanded its repertoire of several protein classes that are likely to be crucial for such inter-species interactions and for survival and motility in this complex ecosystem.
A small number of natural products have already been identified from Dictyostelium, but the gene content suggests it is a prolific producer of such molecules. Some of them may act as signals during development, such as the dichloro-hexanophenone DIF-1, but others are likely to mediate currently unknown ecological interactions47. Many antibiotics and secondary metabolites destined for export are produced by polyketide synthases, modular proteins of around 3,000 amino acids48. We identified 43 putative polyketide synthases in Dictyostelium (see Supplementary Information). By contrast, S. cerevisiae completely lacks polyketide synthases and Neurospora crassa has only seven. In addition, two of the Dictyostelium proteins have an additional chalcone synthase domain, representing a type of polyketide synthase most typical of higher plants and found to be exclusively shared by Dictyostelium, fungi and plants. In addition to polyketide synthases, the predicted proteome has chlorinating and dechlorinating enzymes as well as O-methyl transferases, which could increase the diversity of natural products made. Thus, Dictyostelium appears to have a large secondary metabolism which warrants further investigation.
ATP-binding cassette (ABC) transporters are prevalent in the proteomes of soil microorganisms and are thought to provide resistance to xenobiotics through their ability to translocate small molecule substrates across membranes against a substantial concentration gradient49–52. There are 66 ABC transporters encoded by the genome, which can be classified according to the subfamilies defined in humans, ABCA-ABCG, based on domain arrangement and signature sequences53. At least twenty of them are expressed during growth and are probably involved in detoxification and the export of endogenous secondary metabolites.
Curiously, many of the predicted cellulose degrading enzymes in the proteome (see Supplementary Information) that have secretion signals are expressed in growing cells that do not produce cellulose54. The proteome also has one xylanase enzyme that can degrade the xylan polymers that are often found associated with the cellulose of higher plants. Perhaps Dictyostelium uses these enzymes to degrade plant tissue into particles that are then taken up by cells. These enzymes may also aid in the breakdown of cellulose-containing microorganisms upon which Dictyostelium feeds. Alternatively, these enzymes may promote the growth of bacteria that can serve as food, since Dictyostelium’s habitat also contains cellulose-degrading bacteria.
During both growth and development, Dictyostelium amoebae display motility that is characteristic of human leukocytes55. As a consequence, studies of Dictyostelium have contributed significantly to cytoskeleton research56. Dictyostelium’s survival depends on an ability to efficiently sense, track and consume soil bacteria using sophisticated systems for chemotaxis and phagocytosis. Its multicellular development depends on chemotactic aggregation of individual amoebae and the coordinated movement of thousands of cells during fruiting body morphogenesis. The proteome reveals an astonishing assortment of proteins that are used for robust, dynamic control of the cytoskeleton during these processes. As suggested from the functional parallels to human cells, these proteins are most similar to metazoan proteins in their variety and domain arrangements (Fig. 7; Table SI 11). Surprisingly, although the actin cytoskeleton has been studied for over twenty-five years, 71 putative actin-binding proteins apparently escaped classical methods of discovery. For example, actobindins had not been previously recognized in Dictyostelium. Curiously, the actin depolymerisation factor (ADF) and calponin homology (CH) domain proteins appear to have diversified by domain shuffling, a substantial fraction having domain combinations unique to Dictyostelium (Fig. SI 13; Table SI 12). In addition to 30 actin genes, there are also orthologues of all actin-related protein (ARP) classes present in mammals, as well as three founding members of a new class (Fig. SI 14).
Cytoskeletal remodelling during chemotaxis and phagocytosis is regulated by a considerable number of upstream signaling components. Of the 15 Rho family GTPases in Dictyostelium, some are clear Rac orthologues and one belongs to the RhoBTB subfamily57. However, the Cdc42 and Rho subfamilies characteristic of metazoa and fungi are absent, as are the Rho subfamily effector proteins. The activities of these GTPases are regulated by two members of the RhoGDI family, by components of ELMO1/DOCK180 complexes and by a surprisingly large number of proteins carrying RhoGEF and RhoGAP domains (>40 of each), most of which show domain compositions not found in other organisms. Remarkably, Dictyostelium appears to be the only lower eukaryote that possesses class I PI 3-kinases, which are at the crossroad of several critical signalling pathways (for details of the regulators and their effectors, see Table SI 13)58. The diverse array of these regulators and the discovery of many additional actin-binding proteins suggests that there are many aspects of cytoskeletal regulation that have yet to be explored.
The evolution of multicellularity was arguably as significant as the origin of the eukaryotic cell in enabling the diversification of life. The common unicellular ancestor of the crown group of organisms must have posessed the basic machinery to regulate nutrient uptake, metabolism, cellular defense and reproduction, and it is likely that these mechanisms were adapted to integrate the functions of cells in multicellular organisms. Dictyostelium achieved multicellularity through a different evolutionary route from the plants and animals, yet the ancestors of these respective groups most likely started with the same endowment of genes and faced the same problem of achieving cell specialization and tissue organisation.
When starved, Dictyostelium develops as a true multicellular organism, organizing distinct tissues within a motile slug and producing a fruiting body comprised of a cellular, cellulosic stalk supporting a bolus of spores4. Thus, Dictyostelium has evolved differentiated cell types and the ability to regulate their proportions and morphogenesis. A broad survey of proteins required for multicellular development shows that Dictyostelium has retained cell adhesion and signalling modules normally associated exclusively with animals, while the structural elements of the fruiting body and terminally differentiated cells clearly derive from the control of cellulose deposition and metabolism now associated with plants. The Dictyostelium genome offers a first glimpse of how multicellularity evolved in the amoebozoan lineage. In the following sections, we consider some of the systems which are particularly relevant to cellular differentiation and integration in a multicellular organism.
The needs of multicellular development add greatly to those of chemotaxis in demanding dynamically controlled and highly selective signalling systems. G-protein coupled cell surface receptors (GPCRs) form the basis of such systems in many species, allowing the detection of a variety of environmental and intra-organismal signals such as light, Ca2+, odorants, nucleotides and peptides. They are subdivided into six families which, despite their conserved secondary domain structure, do not share significant sequence similarity59. Until recently, in Dictyostelium only the seven CAR/CRL (cAMP receptor/ cAMP receptor-like) family GPCRs had been examined in detail60, 61. Surprisingly, a detailed search uncovered 48 additional putative GPCRs of which 43 can be grouped into the secretin (family 2), metabotropic glutamate/GABA B (family 3) and the frizzled/smoothened (family 5) families of receptors (Fig. 8; see also Supplementary Information). The presence of family 2, 3 and 5 receptors in Dictyostelium was surprising because they had been thought to be animal-specific. Their occurrence in Dictyostelium suggests that they arose before the divergence of the animals and fungi and were later lost in fungi and that the radiation of GPCRs predates the divergence of the animals and fungi. The putative secretin family is particularly interesting because these proteins were thought to be of relatively recent origin, appearing closer to the time of the divergence of animals62. The Dictyostelium protein does not contain the characteristic GPCR proteolytic site, but its transmembrane domains are clearly more closely related to secretin GPCRs than to other families (Fig. 8). Many downstream signalling components that transduce GPCR signals could also be recognized in the proteome, including heterotrimeric G-protein subunits (14 Gα, two Gβ and one Gγ proteins) and seven regulators of G-protein signalling (RGS), most similar to the R4 subfamily of mammalian RGS proteins.
In animals, SH2 domains act as regulatory modules of proteins in intracellular signalling cascades, interacting with phosphotyrosine-containing peptides in a sequence-specific manner. Dictyostelium is the only organism, outside the animal kingdom, where SH2 domain-phosphotyrosine signalling has been proven to occur63. What have been lacking in Dictyostelium are the other components of such signalling pathways - equivalents of the metazoan SH2 domain-containing receptors, adaptors and targeting proteins. Three newly predicted proteins are strong candidates for these roles (Fig. SI 15). One of them, CblA, is highly related to the metazoan cbl proto-oncogene product. This is entirely unexpected because it is the first time that a cbl homologue has been observed outside the animal kingdom. The Cbl protein is a “RING finger” ubiquitin-protein ligase that recognizes activated receptor tyrosine kinases and various molecular adaptors64. Remarkably, the Cbl SH2 domain went unrecognised in the protein sequence, but it was revealed when the crystal structure of the protein was determined65. Thus, although SH2 domain proteins are less prevalent in Dictyostelium, there is the potential for the kind of complex interactions that typify metazoan SH2 signalling pathways.
Dictyostelium, like other organisms, has adapted ABC transporters to control various developmental signalling events. Several ABC transporters, TagA, B and C, are used for peptide-based signalling, akin to that previously observed for mating in S. cerevisiae and antigen presentation in human T cells66–68. The novel domain arrangement of the Tag proteins, a serine protease domain fused to a single transporter domain, suggests that they have been selected for improved efficiency in signal production. Additional ABC transporters are needed for cell fate determination in Dictyostelium, suggesting that this ubiquitous protein family may be used in similar developmental contexts within many different species69.
Much cellular signal transduction involves the regulation of protein function through phosphorylation by protein kinases, often leading to the reprogramming of gene transcription in response to extracellular signals. The Dictyostelium proteome contains 295 predicted protein kinases, representing as wide a spectrum of kinase families as that observed in metazoa (Tables SI 14–16; Fig. SI 16). Given the presence of SH2 domain-based signalling it was surprising that no receptor tyrosine kinases could be recognized in the genome. However, Dictyostelium has a number of other receptor kinases such as the histidine kinases and a group of eight novel putative receptor serine/threonine kinases, which are involved in nutrient and starvation sensing70. Most of the ubiquitous families of transcription factors are represented in Dictyostelium, with the notable exception of the otherwise ubiquitous basic helix-loop-helix proteins (Table SI 17; Fig. SI 17). Compared to other eukaryotes, Dictyostelium appears to have fewer transcription factors relative to the total number of genes, suggesting that many transcription factors are yet to be defined, or that the activities of a smaller repertoire of factors are combined and controlled to achieve complex regulation (Table SI 18; Fig. SI 18).
Throughout Dictyostelium development, cells must modulate their adhesiveness to the substrate, to the extracellular matrix and to other cells in order to create tissues and carry out morphogenesis. To accomplish this, Dictyostelium uses a surprising number of components that have been normally only associated with animals. For example, disintegrin proteins regulate cell adhesiveness and differentiation in a number of metazoa and at least one Dictyostelium disintegrin, AmpA, is needed throughout development for cell fate specification71. We also identified distant relatives of vinculin and α-catenin – normally associated with adherens junctions - that support the idea that the epithelium-like sheet of cells that surrounds the stalk tube contains such junctions72. Consistent with this, the Dictyostelium genome encodes numerous proteins previously described as components of adherens junctions in metazoa like β-catenin (Aardvark), α-actinin, formins, VASP and myosinVII.
In animals, tandem repeats of immunoglobulin, cadherin, fibronectin III or E-set domains are often present in cell adhesion proteins, although their common protein fold predates the emergence of eukaryotes. EGF/Laminin domains are also found in adhesion proteins but, prior to the analysis of the Dictyostelium genome, no non-metazoan was known to have more than two EGF repeats in a single predicted protein. Dictyostelium has 61 predicted proteins containing repeated E-set or EGF/Laminin domains and many of these contain additional domains that suggest they have roles in cell adhesion or cell recognition, such as mannose-6-phosphate receptor, fibronectin III, or growth factor receptor domains and transmembrane domains (Fig. 9). In support of this idea, four of these proteins, LagC, LagD, AmpA and ComC, have been shown to be required for cell adhesion and signalling during development71, 73–75.
During development, Dictyostelium cells produce a number of cellulose-based structural elements. Dictyostelium slugs synthesize an extracellular matrix, or sheath, around themselves that is comprised of proteins and cellulose. Several of the smaller sheath proteins bind cellulose and are believed to have a role in slug migration, while the larger, cysteine-rich EcmA protein is essential for full integrity of the sheath and for establishing correct slug shape76, 77. During terminal differentiation, cellulose is deposited in the stalk and in the cell walls of the stalk and spore cells78–80. The first confirmed eukaryotic gene for cellulose synthase was discovered in Dictyostelium and this gene has since been recognized in many plants, N. crassa and the ascidian Ciona intestinalis81. The fungal and urochordate enzymes are more closely related to the Dictyostelium homologue than to plant or bacterial cellulose synthases, indicating that the common ancestor of fungi and animals carried a gene for cellulose synthase that was subsequently lost in most animals. The Dictyostelium genome encodes more than 40 additional proteins that are likely to be involved in cellulose synthesis or degradation and probably are involved in the production and remodelling of cellulose fibres of the slug sheath, stalk tube and cell walls (see Supplementary Information).
The fundamental similarities in cellular cooperation found in Dictyostelium and in the metazoa clearly resulted in a parallel positive selection for structural and regulatory genes required for cell motility, adhesion and signalling. Dictyostelium uses a set of signals and adhesion proteins that are distinct from those employed for similar purposes in metazoa but, like the metazoa, Dictyostelium has maintained a diversity of GPCRs, protein kinases and ABC transporters which enable it to respond to those signals. Dictyostelium has also retained and modified an organizational strategy perfected in plants, basing several structural elements on cellulose. At one level Dictyostelium has achieved multicellularity by employing strategies that are similar to plants and metazoa, but the differences between them suggest convergent evolution, rather than lineal descent from an ancestor with overt or latent multicellular capacities.
The complete protein repertoire of Dictyostelium provides a new perspective for studying its cellular and developmental biology. At a systems level, Dictyostelium provides a level of complexity that is greater than the yeasts, but much simpler than plants or animals. Thus, high-resolution molecular analyses in this system may reveal control networks that are difficult to study in more complex systems and may presage regulatory strategies used by higher organism82–84. At a practical level, the comparative genomics of Dictyostelium and related pathogens, such as Entamoeba histolytica, should aid in the functional definition of amoebozoa-specific genes that may open new avenues of research aimed at controlling amoebic diseases. Dictyostelium’s adeptness at hunting bacteria also renders it susceptible to infections by intracellular bacterial pathogens85, 86. Dictyostelium and human macrophages display fundamental similarities in their cell biology, which has spurred the use of Dictyostelium as a model host for bacterial pathogenesis. It is also an attractive model in which to study other disease processes: for a number of human disease-related proteins, it provides a test-bed for studying their functions in a model organism which has greater similarity to higher eukaryotes than do the yeasts, yet shares the latter’s experimental tractability.
The high frequency of repeated amino acids tracts in Dictyostelium proteins has long been known anecdotally, but we can now survey their precise nature and number and find them to be more abundant than in any other sequenced genome. Many human diseases result from the expansion of triplet nucleotide repeats, some of which encode polyglutamine tracts that cause cell degeneration87, 88. Learning how Dictyostelium cells tolerate so many proteins with amino acid homopolymers will, we hope, help to elucidate the roles of these motifs in protein function and dysfunction.
Comparative genomic studies in eukaryotes are providing the raw material for global examinations of the evolution of cellular regulation and developmental mechanisms31. Many genes have been lost in one species but retained in others such that each new genome sequence adds to our understanding of the genetic complement of the eukaryotic progenitor. Thus, our understanding of eukaryotes will continue to be refined as more genome sequences become available from representatives of large groups of organisms whose genomes remain largely unexplored, such as the amoebozoa. The surprising molecular diversity of the Dictyostelium proteome, which includes protein assemblages usually associated with fungi, plants or animals, suggests that their last common ancestor had a greater number of genes than had been previously appreciated.
Details on the availability of reagents can be found in the Supplementary Information. All analyses described here were performed on Version 2.0 of the genome sequence. Updates to the sequence and annotation are available at http://dictybase.org and http://www.genedb.org/genedb/dicty/index.jsp. Further details of analyses not explicitly described below can be found in the Supplementary Information.
A short-range (~100kb), high-resolution (+/−8.54kb) mapping panel was prepared as described9. Briefly, 96 aliquots each containing ±0.52 haploid genome equivalents of sheared AX4 genomic DNA were pre-amplified by PEP (primer extension pre-amplification89). A total of 4913 STS markers (Table SI 1) were typed by 2-phase hemi-nested PCR (multiplexed for up to 1200 markers in the first phase) on aliquots of the diluted PEP products. Maps were assembled from good-quality data essentially as described previously8. A second, longer-range (±150kb) mapping panel was used to confirm some linkages on chromosomes 2 and 5. HAPPY map analysis and PCR primer design for HAPPY mapping was performed using various custom programmes (PHD and ATB unpublished).
Genomic DNA from D. discoideum strain AX4 was prepared and separated by PFGE essentially as described27, 9, except that gels were run in stacked pairs; one member of each pair was stained with ethidium bromide, and bands excised from its unstained counterpart by alignment.
For WCS libraries, gel slices (above) were disrupted by several passages through a 30G syringe needle, digested with beta-agarase (NEB) and phenol-extracted. DNA was concentrated by ethanol precipitation, sonicated, end-blunted using mung bean nuclease and size-fractionated on 0.8% low melting-point agarose gels. Fractions of 1.4-2kb and 2-4kb were excised, DNA extracted as before and ligated into the SmaI site of pUC18 or pUC19. Clone propagation and template preparation followed standard protocols.
For YAC subclone libraries, AX4-derived YACs were identified (and their position and integrity confirmed) by screening the set described by Loomis et al22 using markers from the HAPPY map. Subclones were prepared from PFG-purified YACs essentially as for the WCS libraries; contaminating yeast-derived sequences were filtered out in silico.
Details of the sequencing and assembly methods can be found in Supplementary Information. Generally, mapped sequence features were used to nucleate sequence contigs assembled from the WCS data, and extended using read-pair information and iterative searches for overlapping sequences, followed by directed gap-closure using a range of approaches.
In situ hybridization was performed as in reference 17
Full details are provided in the Supplementary Information. Briefly, automated gene prediction was performed using a combination of programmes which had been trained on well-characterized D. discoideum genes, and the results integrated with reference to D. discoideum cDNA sequences and homology to genes in other species. Other features in the predicted proteins, and other sequence features, were identified using a variety of software packages.
Microarray targets54, 90, 91; and N. Van Driessche & G. Shaulsky unpublished) and gene models were mapped onto the genome sequence using BLAST92 and the modified LIS algorithm93. To look for clustering of genes with correlated temporal expression profiles, pairwise correlation coefficients were calculated for genes with known expression profiles on each chromosome91. Blocks of ≥6 consecutive genes were sought for which either (a) all pairwise correlation coefficients were positive and ≥70% were >0.2 (genes with similar developmental trajectories) or (b) each gene had a partner with an absolute correlation coefficient value of >0.6 (tightly co-regulated genes); no statistically significant clusters met these criteria.
To look for clustering of genes associated with specific developmental stages94, 95 or cell types90, 96, the genome was scanned with various sized windows97 for regions with significant (p<0.01) overrepresentation of genes in any one of these groups.
Predicted protein sequences were clustered using TribeMCL98, using a BLAST-P expectation of <10−40 as a cutoff. A χ-squared test invalidated the hypothesis that members of a family are randomly distributed in the genome. Within each family, protein divergences (similarity distances computed using the ‘Protdist’ module of PHYLIP; http://evolution.genetics.washington.edu/phylip.html) and physical intergenic distances between all pairs of family members were tabulated, and the correlation coefficient between the former and latter values was calculated. Analysis was performed on the 86 gene families (representing 155 gene pairs) with at least 10 intrachromosomal distance pairings to provide robust statistical confidence.
Other sequence analyses (nucleotide and dinucleotide composition; identification of simple-sequence repeats in nucleotide and protein sequence; coding density computation; tRNA cluster identification) was performed using a range of custom software (PHD and ATB unpublished). Graphical representation of chromosomes in Fig. 2 was done primarily using Cinema4D-8.5 (Maxon Computer GmbH) after pre-processing using custom software (PHD).
Sequencing and analysis of chromosomes 1, 2 and 3 was supported by grants from the DFG and by Köln Fortune, and that of chromosomes 4, 5 and 6 in the USA by grants from NICHD/NIH. Work in the UK/EU was supported by a program grant from the MRC to J.W., R.R.K., B.B. and P.H.D. and by the EU. Analyses at DictyBase were supported by a grant from the NIGMS/NIH to R.L.C. The Dictyostelium cDNA project was supported from Research for the Future of JSPS and from Grants-in-Aid for Scientific Research on Priority Areas of MEXT of Japan.
The German team wishes to thank Silke Förste, Nadine Zeisse, Sandra Rothe, Sabine Landmann, Regine Schultz, Christiane Neuhoff and Rolf Müller for expert technical assistance. The USA team is indebted to Steven Kaminsky, Steven Klein and Tyl Hewitt for their scientific foresight in the early stages of this project and for their support of Dictyostelium as a model system, and to Hailey Hosak, Oliver Delgado, Lora Lewis, Keelan Hamilton, Jennifer Hume, Christie Kovar Smith, Dearl Neal, Paul Havlak, K. James Durbin and Paula Burch of the HGSC at Baylor College of Medicine. R.S. thanks Laura Cortez, Emily Joyner and Barry Hill for their assistance during the course of the project. C.B. thanks Michel Veron and Philippe Glaser for their support and fruitful discussions. P.H.D. thanks Helen O’Hare for early involvement in the project, Al Ivens for helpful discussions, the MRC Centre Visual Aids department, Cambridge for advice on graphics. We thank Sharen Bowman and Daniel Lawson for their contribution to the EUDICT region in the initial stages of the project, and David Martin at the Wellcome Trust Biocentre, University of Dundee for running the GOtcha search of our gene models. The Japanese cDNA project deeply thanks Naotake Ogasawara of the Nara Institute of Science and Technology and Ikuo Takeuchi for valuable comments and encouragement, and others who participated in earlier stages of the project.
Correspondence should be sent to P.H.D. (ku.ca.mac.bml-crm@dhp). Chromosomal sequences are deposited under accession number AAFI00000000.
Competing interests statement
The authors declare that they have no competing financial interests.