Assembly 19, the diploid assembly of the genome of C. albicans
strain SC5314 was a very important achievement. It provided a great deal of insight into many aspects of genomic organization, especially the large amount of heterozygosity [6
]. The subsequent annotation of the assembly by the community demonstrated a number of important properties of the genome, including the number of genes (6,354), the number with introns (224), the frequency and characteristics of short tandem repeats, and the characteristics of several multigene families. Braun et al
] also identified putative spurious genes and genes either on overlapping contigs or truncated by the end of contigs. However, they did not address chromosome location nor try to join the 266 haploid contigs of Assembly 19 into chromosome-sized assemblies. Thus, although these two projects brought the C. albicans
genome to a very useful state, they still left it incomplete, lacking chromosome-size contigs and with some genes in an ambiguous state. Subsequently, Chibana et al
] completed the sequence of chromosome 7, identified 404 genes, and compared the synteny to the S. cerevisiae
genome. They sequenced the MRSs and the gaps left in Assembly 19. They then aligned the sequence on the chromosome as determined by the physical map [8
We undertook to complete the assembly (on a haploid basis) of all the chromosomes of C. albicans. We ordered and aligned the existing contigs along the chromosomes, filled in the gaps either by reexamining the traces at the Stanford Genome Technology Center, by gap sequencing or by using the emerging C. albicans WO-1 sequence to correct two regions of chromosomes 1 and 4. The assembly was also based on the STS fosmid map and on an optical map. As completed, the assembly consists of 16 supercontigs, interrupted on 5 chromosomes only by large blocks of repeated DNA. The contigs for chromosomes 6 and 7, for which the MRSs have been sequenced, have no gaps, while chromosome 3 has one gap, where adjacent contigs could not be joined. Chromosome 4 has two gaps and 5, 2, and 1 have one gap each, corresponding to the MRS. Chromosome R has two gaps, one for the MRS and one for the rDNA. Thus, there is only one gap in the unique sequence of the C. albicans genome that cannot be filled with sequence data from either SC5314 or WO-1. We identified 85 junctions that could not be filled with the original SC5314 sequence traces. We successfully amplified 82 of these (see Additional data file 1) and work is continuing to amplify the last three junctions and to produce 'SC5314-pure' sequence. We have used this assembly to determine the size of the various chromosomes and to examine several unique aspects of the genome, including the subtelomeric regions, the gene families, and evidence for chromosome rearrangements. Our ultimate objectives are to identify aspects of the genome that affect virulence and to increase our understanding of the evolutionary mechanisms that affect the genome of this fungal pathogen.
There are chromosome size discrepancies between Assembly 21 and the optical map; these are attributable to several causes. Where the Assembly 21 size is smaller than the optical map size, the explanation may be the missing MRS, missing telomere-associated sequences, or size heterozygosity between the homologues. For example, we know that on chromosome 5 the MRS is 50 kb in size [4
], very close to the difference between the two estimates. Where the size determined by the optical map is smaller (chromosomes 2, 3, and 4), the difference seems most likely to be heterozygosity for insertions of retrotransposon-related sequences. In these cases, the optical map of this chromosome is probably derived from the smaller homologue. For chromosome 2, the discrepancy is rather large, given that this chromosome in Assembly 21 lacks the MRS and probably some telomere sequences. Interestingly, the size estimates in Jones et al
] for the various chromosomes are remarkably close to the sizes determined by the optical map in Table .
One piece of information that comes out of our assembly is the similarity of the sequence of the C. dubliniensis
genome to that of C. albicans
. Although the karyotypes of these two organisms are quite divergent, the arrangement of genes within the chromosomes is similar enough to be of great assistance in mapping the contigs from Assembly 19. This bears out the evidence from the presence of MRS-like sequences and the ability to produce interspecies hybrids [30
] that these two species are very closely related indeed. Other studies have shown that only about 4.4% (247) of C. albicans
genes have less than 60% homology to C. dubliniensis
]. Our results suggest that intergenic regions also show regions of significant sequence conservation.
The amount of repeated DNA in C. albicans is significant. The MRSs were a major problem and their placement on the chromosomes required the physical and optical maps. Chromosomes 4 and 7 each have two MRSs forming an inverted repeat, and in principle the internal DNA fragment could invert via mitotic recombination. In strain SC5314 and its derivative, CAI-4, this inversion seems to occur very rarely, at least in the laboratory. In spite of the fact that most of the known translocations in C. albicans occur at the MRS, suggesting that this is a hot spot for recombination, there is no evidence on either chromosome for a flip of the bracketed sequence.
The specific sequences of six of the nine MRSs are unavailable. This is only a problem if sequence variation in the MRS plays a biological role, and there is no evidence that it does. In addition to the MRSs and the subtelomeric repeats, there are more than 350 LTR sequences belonging to 34 different families scattered throughout the genome [26
], and several of these are found clustered at telomeres. The subtelomeric repeat CARE-2 contains an LTR called kappa [32
], which is found at the 5' end of each member of the TLO
gene family. Whether this is related to the expansion of this family to the telomeres is not clear. The repeated DNA led to misassembly of some contigs in Assembly 19, including chimeras, artifactual duplications, and omitted sequence. The two physical maps and the C. dubliniensis
sequence were essential in sorting out these artifacts.
The numerous gene families in C. albicans distinguish it from S. cerevisiae. A very common feature of these families is a clustering of members on a particular chromosome, which might reflect an ontology wherein a single copy undergoes tandem duplication and then sequences diverge as function diverges. There are several instances where similar but oppositely oriented gene clusters suggest that an inverted duplication of a region larger than a gene has occurred (Figure ).
The model for gene family ontology of duplication followed by dispersion would predict that, in general, similarity should be related to proximity. The arrangements of the two families we examined in detail, the SAP
family and the LIP
family, raise some questions about this model. In only two cases are the most similar family members the closest neighbors (SAP6
). However, the members clustered on one chromosome tend to be most closely related. For the LIP
gene family, Hube and coworkers [33
] showed that LIP5
, and 9
, on chromosome 7, form a group and LIP1
, and 10
, on chromosome 1, are a related but distinct group. LIP7
, on chromosome R, is an outlier, only distantly related, while LIP4
, on chromosome 6, fits with the chromosome 7 group. For the SAP
gene family, SAP4
, and 6
(chromosome 6) form a highly related cluster, while the rest of the group, on chromosomes R, 3, 4, and 6, form a loose association, with the highest similarity being between SAP2
on chromosome R and SAP1
on chromosome 6. These relationships suggest that the families originate on one chromosome and expand there, and when one member is duplicated on another chromosome, the pattern may or may not be repeated. The large number of gene families whose members are dispersed but not randomly would suggest that C. albicans
is efficient at gene duplication at a distance. However, there are no hints of a specific mechanism in the sequence, such as homology between flanking sequences on different chromosomes or traces of mobile genetic elements. The relatively small number of highly similar ORFs suggests that the gene family members either diverged some time ago or are under strong selection to perform specific functions.
gene family is unique in C. albicans
because it is found on every chromosome, and there are no closely adjacent members. This suggests that it arose by a mechanism different from, for example, the LIP
family. One clue is that in all cases it is flanked on its 5' side by the LTR kappa [26
]. It seems possible that it has moved via genomic rearrangements caused by the transposon for which kappa is the LTR. An alternative possibility is that this family dispersed by telomere recombination, which is relatively frequent in S. cerevisiae
] and has been shown to occur in C. albicans
]. There are no obvious subtelomeric repeats in C. albicans
, in contrast to S. cerevisiae
and C. glabrata
The two subgroups of the TLO family are differentiated by the presence of an intron. On chromosome 1, there is an interior TLO gene, as well as one near each telomere. A plausible explanation for this arrangement is that a chromosome translocation has occurred, with DNA being added to the end of a smaller precursor of chromosome 1, followed by reconstitution of the telomere at the new end generated. There are only three genes in the emerging C. dubliniensis sequence with similarity to the TLO family, and they are not located at the telomeres. On chromosomes 1 and R in C. dubliniensis, the genes adjacent to the TLO family member are present and are several kilobases from the end of the assembled sequence, suggesting that the TLO gene absence is not due to missing telomere-proximal sequence. Since there are significant differences in virulence between C. albicans and C. dubliniensis, there may be a role for the TLO gene family in some aspect of pathogenesis.
The function of the TLO
genes is unknown. Although a member of this gene family was isolated as a potential trans-activating protein (and named CTA2
), based on a one-hybrid screen in S. cerevisiae
, there is no evidence beyond those experiments as to function [27
Assembly 21 will be of major importance as studies of the biology and virulence of C. albicans
continue. It will provide the mapping information that has been lacking due to the absence of a sexual cycle, and it should stimulate experiments in areas as different as evolution and genome dynamics. Among the unsolved questions in the latter area are the detailed structure of the centromere and the function of the MRS. The presence of chromosomal aberrations in clinical isolates was demonstrated early [37
], and several laboratory strains have recently been shown to be aneuploid [2
]. Genome alterations have recently been shown to play an important role in drug resistance [5
], and the complete sequence of each of the chromosomes may lead to the discovery of other changes that affect pathogenesis. Assembly 21 will also be useful for studying aneuploidy in C. albicans
. Finally, this assembly provides an up-to-date listing of the genes of this important pathogen and will greatly aid its ongoing molecular analysis.