We begin by describing the sequencing procedure, followed by the results of our analyses at the nucleotide and protein levels. Data files containing the results of the analyses reported here are found as Supporting Information (the full set is Dataset S1
). The C. briggsae
sequence and analytic results are also available in interactive format from WormBase (http://www.wormbase.org
To sequence the C. briggsae genome, we adopted a hybrid strategy combining whole-genome shotgun sequencing (WGS) with a high-resolution, sequence-ready physical map. The previously finished clone-based sequence was then integrated with the whole-genome assembly to produce the draft sequence.
Fingerprint Map Construction
Physical mapping and sequencing was performed on C. briggsae
strain AF16 (Fodor et al. 1983
). We constructed a physical map for C. briggsae
using a high-throughput fingerprinting scheme (Marra et al. 1997
). Our substrates were two large-insert C. briggsae
libraries: a 6.5-fold coverage fosmid library (M. A. Marra, unpublished data) and a 10-fold coverage bacterial artificial chromosome (BAC) library (M. A. Marra and P. deJong, unpublished data).
After several rounds of automated assembly and manual review, we developed a physical map consisting of 188 fingerprinted contigs (FPCs) with a mean length of approximately 450 kbp containing 17,885 BAC clones and 16,414 fosmid clones. The longest FPC contig spans more than 4 Mbp of genomic sequence. Because there is little genetic mapping information for C. briggsae, these FPC contigs cannot currently be localized to chromosomal locations.
Shotgun Sequencing, Assembly, and Physical Map Integration
We performed WGS of small-insert plasmid libraries using the paired-end sequencing strategy introduced by Edwards et al. (1990
). This was supplemented with end sequencing of clones from the BAC library in order to facilitate integration of the assembled sequence with the physical map. In all, 2.068 million shotgun reads were assembled, giving more than 10-fold coverage of the C. briggsae
genome. A total of 82% of the shotgun reads were paired.
The WGS was then assembled using the Phusion assembler (Mullikin and Ning 2003
) into a set of 5,341 contigs; these contigs were linked together using the read pair information into a total of 899 gapped “supercontigs” containing 105.6 Mbp of DNA sequence and another 1.9 Mbp of inferred gaps. The N50
(the length x
such that 50% of the genome lies in blocks of x
or longer) of the contigs was 41 kbp. The supercontigs were substantially larger, as reflected in an N50
of 474 kbp. For reasons discussed below, the actual C. briggsae
genome may be slightly smaller than 105.6 Mbp.
To integrate the sequence assembly with the physical map, we performed a conceptual restriction digest of the 5,341 sequence contigs and incorporated them into the fingerprint map using automated assembly followed by manual review. During this process, misassemblies in both the physical map and the sequence assembly were detected and corrected.
Lastly, we integrated the sequence from 272 finished fosmid and BAC clones from many different regions of the C. briggsae genome. We used this finished sequence to fill 264 gaps in the assembly, adding 269 kbp of sequence from 155 finished clones.
In the final analysis, 463 sequence supercontigs from the WGS assembly were placed in 142 FPC contigs. These 142 supercontigs had an N50 length of 1,450 kbp and a total sequence length covering 102,431,873 bp, equivalent to 94% of the summed length of all supercontigs. A total of 436 sequence supercontigs could not be placed onto the FPC map. These unplaced supercontigs were relatively small (largest, 103 kbp; mean, 13.8 kbp; SD, 16.6 kbp; N50, 34 kbp) and covered a total of 6.0 Mbp.
Completeness and Accuracy of the Draft
To assess the completeness of the C. briggsae
draft sequence, we compared the WGS contigs and supercontigs to the 12 Mbp of previously finished sequence. We did this before incorporating the finished sequence into the assembly. The SSAHA algorithm (Ning et al. 2001
), a fast search method for large DNA databases, was used to match previously finished contigs to the corresponding regions of the WGS draft. From this, we estimate that the contigs cover 98% of the previously finished clone-based sequences and that the scaffolds span 99.3% of the clone-based sequence. Thus, we estimate that the draft covers 98% of the C. briggsae
Some 65,000 of the unassembled reads matched three or more other reads in the set. These represent 2.5% of the total reads, suggesting that 2.6 Mbp of additional sequence is represented in these unplaced reads. Among these read clusters are the 5S, ribosomal DNA (rDNA), and mitochondrial sequences. Assembly of these reads yields a 14,420 base mitochondrial sequence, a 7,429–7,431 base rDNA repeat unit, and a 697 and a 938–940 base 5S repeat unit. Based on the total mitochondrial reads, we estimate on average approximately ten copies of the mitochondrial genome per haploid nuclear genome. Most other clusters appear to contain tandem copies of low complexity sequence of the type found scattered throughout the C. elegans genome. Read pair and assembly information allow both the rDNA and 5S clusters to be placed in the current assembly. The rDNA sequences lie adjacent to telomeric repeats as they do in C. elegans. The two arrays of the 5S gene lie adjacent to each other.
To assess overall quality of the assembly and to evaluate the gap size estimate, we again used the comparison of the assembly to the finished clone sequence. We detected no global misassemblies. The 264 gaps that were filled with finished sequence contained a total estimated gap size of 179 kbp before filling. After gap filling, however, the total length of the assembly decreased by 323 kbp. Because of the change in the gap size estimates, we estimate that 3% of the bases in the assembly consist of undetected overlaps. Combined with the estimated coverage of 98%, this yields an estimate for the C. briggsae genome size of 104 Mbp. However, this figure contains substantial uncertainty, in part because the finished sequence is not a random selection of the genome, and this number is certain to be revised as additional finishing is performed.
We assessed sequence accuracy in two ways. First, we aligned the finished sequence to the final WGS assembly and counted discrepancies. Using this method, the accuracy is 99.98%. Second, we examined the consensus quality scores (Ewing and Green 1998
; P. Green, unpublished data) across the assembly. These data again suggest a sequencing accuracy of approximately 99.98%.
Content of the C. briggsae Genome
To characterize the C. briggsae genome, we began with the following analyses of the content of the C. briggsae genome: (1) identification of candidate protein-coding genes and development of a canonical gene set; (2) characterization of the predicted protein-coding gene set by domain analysis; (3) comparative analysis of the C. briggsae and C. elegans proteomes using the predicted protein-coding gene set; (4) identification and characterization of candidate noncoding genes; and (5) characterization of the repeat family content of the genome. In later sections we examine the C. briggsae and C. elegans genomes at the long-range, structural level and illustrate the utility of comparative analysis in aiding genome interpretation.
For all C. briggsae characterizations described in the remainder of this section, we used the cb25.agp8 assembly described earlier. With minor exceptions noted below, all comparisons that involved C. elegans used the C. elegans genome and annotations contained in WormBase release version WS77 (WS77), which was current as of April 2002.
Different protein-coding gene prediction algorithms generally have high concordance rates for predicting exons, but they tend to disagree on the grouping of exons into genes (Reese et al. 2000
). Hence, four different gene prediction programs may give four strikingly different answers across the same region of the genome. An increasingly common procedure for overcoming this problem is to predict gene structures using several ab initio gene prediction programs, to compare their output in order to find a representative prediction, and then to partially or fully confirm the structures with experimental data from expressed sequence tags (ESTs), sequenced cDNAs, or protein similarity matches (Goff et al. 2002
; Rogic et al. 2002
). To find protein-coding genes in C. briggsae
, we developed an enhancement of this procedure that uses the concordance of predictions between C. elegans
and C. briggsae
to predict the most likely gene model.
We predicted genes in the C. briggsae
genome using the programs Genefinder (version 980506; P. Green, unpublished data), FGENESH (Salamov and Solovyev 2000
), TWINSCAN (Korf et al. 2001
), and the Ensembl annotation pipeline (Clamp et al. 2003
). These programs combine a variety of gene prediction methodologies, including ab initio predictions (Genefinder, FGENESH), EST- and protein-based comparisons (Ensembl), and sequence conservation metrics (TWINSCAN). We called genes in the C. elegans
genome by combining hand-curated data models from WS77 (Stein et al. 2001
) with the ab initio predictions from Genefinder and FGENESH. For technical reasons, we were unable to run TWINSCAN on the C. elegans
genome, while the Ensembl methodology of predicting genes based on matches to previously predicted proteins meant that an Ensembl C. elegans
gene set would essentially be a duplicate of the hand-curated WormBase set.
As expected, the output of the four gene prediction programs was largely concordant with respect to the position of C. briggsae exons (80% of exons predicted identically by two or more programs, 26% predicted identically by all four programs), but discordant with respect to gene predictions (38% of genes called identically by two or more programs, just 4% called identically by all four programs). A similar pattern was seen in the genes called in C. elegans.
To select among overlapping predictions produced by different programs, we reasoned that the most likely gene model is the one that maximizes the similarity between the gene sets in two related species (). For each C. briggsae region that had multiple overlapping but inconsistent predictions, our selection procedure chose the prediction that had the most extensive similarity to the matching C. elegans prediction. The extent of similarity was measured by the fraction of the C. briggsae prediction that aligned to the matching C. elegans prediction at the protein level. Likewise, from all the predictions for a C. elegans gene, we chose the prediction having the most extensive similarity to its C. briggsae match. This selection step produced two gene sets, one each for C. briggsae and C. elegans. The gene sets were then filtered to remove transposons and putative pseudogenes.
Joint Refinement of C. elegans and C. briggsae Gene Models: acy-4
We call the gene sets produced by our procedure “hybrid” gene sets because the final gene sets are a mixture of gene predictions from multiple programs. The procedure selected predictions for both species simultaneously, yielding a C. briggsae hybrid gene set and a C. elegans hybrid gene set.
To assess the accuracy of the gene prediction programs in C. elegans, we made a “gold standard” set of C. elegans gene predictions, consisting of 2,257 genes from WS77 for which every base and intron–exon junction had been confirmed by cDNA or EST data. Genefinder made 2,309 predictions that overlapped a gold-standard gene, of which 1,280 (53%) contained all confirmed bases and introns. FGENESH made 2,742 predictions that overlapped a gold-standard gene, of which 1,230 (45%) contained all the confirmed data. We also used the gold standard to assess our selection procedure. For C. elegans genes in the gold-standard set, the selection procedure chose the correct gene model for 92% of gold-standard genes, choosing an alternative (incorrect FGENESH or Genefinder) model 8% of the time. We could not assess the accuracy of the gene prediction programs or selection procedure directly in C. briggsae because we lacked an independent dataset to create a gold standard.
The final C. briggsae gene set contains 19,507 genes, and the hybrid C. elegans gene set contains 20,621 genes. Some of the genes taken from WS77 have alternative splices, so the 20,621 C. elegans genes have 21,578 different splice variants. In the absence of substantial EST data, we are currently unable to call or comment on patterns of alternative splicing in C. briggsae.
In order to compare the C. briggsae and C. elegans hybrid gene sets to the C. elegans WS77 gene set, we also applied our transposon and pseudogene filtering step to the C. elegans WS77 gene set. This removed 619 genes to create a “pruned” WS77 set of 18,808 genes and 19,791 splices. This pruned set is henceforth called WS77*. An important caveat is that some of the predictions discarded by our filtering step may include real exons: 29 (9%) of the 316 putative pseudogenes in C. elegans WS77 that were discarded have been partially or fully confirmed by EST or cDNA data.
Files containing the C. briggsae
, C. elegans
hybrid, and C. elegans
WS77* gene predictions, their genomic positions, and their conceptual translations are available as Dataset S2
Comparing the C. briggsae and C. elegans Gene Sets
The C. briggsae gene set (19,507 genes), the C. elegans WS77* gene set (18,808 genes), and the C. elegans hybrid gene set (20,621 genes) all contain approximately the same number of genes. The recent WormBase release WS103 (June 2003; approximately 19,600 curated genes) also has a similar number. We next set about examining these sets in more detail.
The unspliced lengths of genes are roughly the same in the two species (C. briggsae median, 1.9 kbp; C. elegans WS77*, 1.9 kbp; ), and the total length of the C. briggsae genome occupied by the 19,507 genes, including their introns, is 56 Mbp (54% of the 102 Mbp assembly)—approximately the same fraction of the C. elegans genome occupied by the WS77* gene set. Thus, the larger size of the C. briggsae genome is not due to an increase in the number or size of protein-coding genes.
Comparison of the C. briggsae and C. elegans Protein-Coding Gene Sets
The C. elegans gene sets have slightly more introns than the C. briggsae hybrid set. Some of this difference might be due to the hand-curation of the WS77 gene set, since curation may add exons that were missed by gene prediction software. However, as shown in the C. briggsae/C. elegans Orthologs section below, a portion of the intron differences can be confirmed as exhibiting true evolutionary changes.
We examined codon usage for both the predicted gene sets as a whole and for C. briggsae/C. elegans ortholog pairs (described below) and found no substantial differences between the protein-coding gene codon usage in the two species. We then used the EMBOSS tool codcmp to test for a significant difference in codon usage. The GC content for the two species differed only slightly for the genome as a whole (35.4% for C. elegans versus 37.4% for C. briggsae) and for protein coding exons (42.7% versus 44.1%).
Domain and Gene Ontology Analysis
We used Pfam analysis to search the C. briggsae
gene set for functional domains and other known sequence motifs and to assign InterPro annotation to each such feature found (Zdobnov and Apweiler 2001
). These InterPro annotations were translated into Gene Ontology (GO) functional descriptions, and the descriptions were grouped into broader categories according to molecular function (13 categories) and biological process (9 categories) (“GOslim”; http://www.ebi.ac.uk/proteome
; Gene Ontology Consortium 2001
). We did not classify proteins by intracellular compartment due to the small number of genes in these categories for the C. elegans
Of 19,507 predicted proteins in the C. briggsae
dataset, 10,606 (54.4%) had one or more Pfam annotations, 9,829 were associated with one or more InterPro terms, and 6,526 of these could be assigned one or more GO terms. This annotation process touched 1,443 InterPro terms and 706 GO terms. The results of these classifications are available as Dataset S2
For comparison, we performed an identical analysis on the C. elegans proteins in the C. elegans hybrid gene set. Of the 20,621 proteins in this set (counting only the longest protein encoded by each alternatively spliced gene), 11,116 (53.9%) had Pfam annotations, 10,460 had InterPro annotations, and 6,696 of these could be assigned GO terms, touching 1,436 InterPro terms and 699 GO terms.
As expected, the distribution of classifications is roughly the same in both species (), with 75%–76% of classifiable proteins associated with metabolic processes, 15%–16% associated with transport, and 14%–15% associated with cell communication. The molecular function classification showed the majority of proteins classified as having ligand-binding or carrier functions (63%–66%), followed by enzymatic activity (48%) and nucleic acid binding (28%–30%). There is a 4% difference in the proportion of proteins classified as involved in signal transduction in C. elegans relative to C. briggsae. As will be discussed in a later section, this difference is largely due to the difference in the number of predicted chemosensory receptor proteins in the two genomes. The significance of other small differences, such as the approximately 1% increase in C. elegans in the proportion of proteins involved in cell motility, is not known.
Classification of Predicted C. briggsae and C. elegans Proteins into GOslim Categories
C. briggsae/C. elegans Orthologs
We searched for orthologs between the 19,507 C. briggsae genes and the 18,808 C. elegans WS77* genes. Although it is possible for a gene in one species to have multiple orthologs in another species if the gene has duplicated since the two species diverged, for this analysis we used the simpler definition of a pair of genes that have a common ancestor and are in a one-to-one correspondence between the two species.
We found orthologs by searching for C. briggsae/C. elegans
gene pairs that were each other's top BLASTP (Altschul et al. 1997
) match in the opposite species. We identified 11,255 such gene pairs. We then used conserved gene order (synteny) between the two species to identify nonreciprocal best matches that were supported by the positions of flanking orthologs. This procedure netted an additional 900 orthologs. The final set of 12,155 orthologs corresponds to 62% of the C. briggsae
gene set and 65% of the C. elegans
WS77* gene set.
To assess the accuracy of this ortholog definition, we compared the results obtained by this procedure to those obtained by building phylogenetic trees of the chemosensory receptor sra protein subfamily (see the Protein Families section below). Phylogenetic tree building identified eight ortholog pairs in this set, seven of which were called identically by the mutual-best BLASTP match procedure. The mutual-best BLASTP match procedure had one false negative and one false positive involving a recent C. briggsae gene amplification. As will be seen, the chemosensory receptor family represents the worst-case scenario of a family that is undergoing rapid evolutionary change. This provides conservative estimates of false positive and false negative rates for the mutual-best BLASTP match procedure of roughly 15%.
The median percent identity between orthologs at the protein level is 80% (mean, 75%; SD, 18%), which is similar to the level of divergence between mouse/human orthologs (median identity, 78.5%; Waterston et al. 2002
). The ortholog pairs are very similar in terms of exon length (median in both species, 0.15 kbp), coding length per gene (median, 1.14 kbp in C. elegans
versus 1.11 kbp in C. briggsae
), and gene length (median, 2.29 kbp in C. elegans
versus 2.19 kbp in C. briggsae
). However, orthologs are longer than the overall set of predicted genes (median length, 1.90 kbp in C. elegans
), which suggests that the nonorthologous gene set includes a population of truncated or split genes.
To pursue the earlier observation that C. elegans has more introns than C. briggsae, we searched for cases in which a C. elegans gene has an intron absent from its C. briggsae ortholog and vice versa. To do this, we aligned orthologous proteins and searched for cases where a single exon in one species aligned to two adjacent exons in the other species.
We found 6,579 species-specific introns among the 60,775 introns in the ortholog pairs: 4,379 C. elegans
-specific introns and 2,200 C. briggsae
-specific introns. This approximately 2-fold ratio agrees with that reported by Kent and Zahler (2000
) using a smaller dataset. Intron gains or losses have occurred at a rate of at least 0.5 per gene in the 80–110 million years (MY) since C. elegans
and C. briggsae
diverged (see the Estimating the C. briggsae/C. elegans
Divergence Date section below). This is similar to the arthropod rate of approximately one intron gain or loss per gene per 125 MY since Drosophila
diverged (Zdobnov et al. 2002
). In contrast, in mouse and human there have been fewer than 0.01 losses or gains per gene in 75 MY (Roy et al. 2003
). Since the average number of introns per gene is quite different among these species, this means that 9% of Caenorhabditis
introns are species-specific, in contrast to 50% of Anopheles/Drosophila
introns and only 0.05% of mouse/human introns. Thus, intron–exon structure has apparently evolved more rapidly in nematodes and arthropods than in chordates.
The list of ortholog pairs can be found as Dataset S3
Rate of Neutral Evolution and Estimates of Selective Pressure
Using the set of C. briggsae/C. elegans
ortholog pairs, we calculated the rates of nonsynonymous (KA
) and synonymous (KS
) amino acid substitutions, using a maximum likelihood (ML) algorithm that corrects for reversion events (Yang 1997
) and removing pairs where accurate estimates of KA
Orthologous genes identified by mutual-best BLASTP hits had an average KS
of 1.78 (SD, 0.62) synonymous substitutions per synonymous site and a KA
of 0.11 (SD 0.09) nonsynonymous substitutions per nonsynonymous site, while orthologous gene pairs identified by the combination of mutual-best BLASTP hits and colinearity had average KS
of 1.73 (SD, 0.68) and 0.12 (SD, 0.131), respectively. The corrected KS
rate is almost three times as high as that reported between mouse and human (KS
= 0.6; Waterston et al. 2002
), despite the fact that the apparent C. briggsae/C. briggsae
divergence date is only 5–45 MY before the Mus musculus/Homo sapiens
divergence date (see next section). However, the reported KA
ratio for mouse/human (0.115) is similar to the ratio that we see in C. briggsae/C. elegans
(approximately 0.06), arguing that the levels of purifying selection are similar.
ratio, a measure of selective pressure, is expected to be 1.0 for genes that are under no selective pressure (for example, pseudogenes), less than 1.0 for genes under purifying selection, and greater than 1.0 for genes under positive selection. As expected, we found that nearly all the genes in ortholog pairs are under purifying selection (). However, the extent of this purifying selection is more marked in genes with essential functions. For example, ortholog pairs which exhibit an embryonic lethal phenotype in systematic RNA inhibition (RNAi) screens of the C. elegans
ortholog partner (Maeda et al. 2001
; Piano and Gunsalus 2002
; Kamath et al. 2003
) show a markedly lower KA
ratio than do pairs for which a wild-type phenotype was observed (KA
, 0.0445 ± 0.0340 versus 0.0627 ± 0.0494; p
< 1 × 10−16
by the Welch two-sample t
-test). We also confirmed the findings of Castillo-Davis and Hartl (2002
), who showed that genes involved early in development tend to be less prone to duplications and have a lower KS
value than late-development genes (data not shown).
Distribution of KA/KS Ratio among Ortholog Pairs
The trend towards higher levels of purifying selection in essential genes has been seen in organisms as distantly related as prokaryotes. For example, Jordan et al. (2002
) found similar differences when comparing rates of evolution of essential and nonessential genes in Escherichia coli
, Helicobacter pylori
, and Neisseria meningitidis
We also looked for genes with evidence of positive selection. Civetta and Singh (1998
) reported generally higher KA
ratios for sex determination genes tra-1
than for other sampled genes and interpreted this as positive selection on genes that are involved in speciation. Under the strict definition of positive selection, where KA
> 1.0, we do not find evidence for positive selection in these genes, but given the high value of substitution between the two species, genes with a ratio greater than 1.0 may no longer be recognizable as orthologs. Perhaps genes under positive selection might be found among members of gene families or in genes that are no longer recognizably similar at the primary sequence level. Sequence analysis of species of intermediate evolutionary distance, if available, would be revealing. Also, since our tests for positive selection are very conservative, we would not detect any sites under positive selection if they are small in proportion to those of the gene under purifying selection (Yang et al. 2000
Estimating the C. briggsae/C. elegans Divergence Date
Using the divergence of the nematodes from the arthropods 800–1,000 MYA (Blaxter 1998
) to calibrate the molecular clock, we estimated the C. briggsae/C. elegans
divergence date from 338 sets of orthologs. Each set comprised a C. elegans
gene and its one-to-one orthologs from C. briggsae
, A. gambiae
, and H. sapiens
. When the nematode/arthropod divergence is taken to be 800 MYA, a 95% confidence interval for the median C. briggsae/C. elegans
speciation date is 78–90 MYA. If the nematode/arthropod divergence is taken to be 1,000 MYA, the interval becomes 97–113 MYA.
Our best estimate of the C. briggsae/C. elegans
speciation date is therefore approximately 80–110 MYA. This confidence interval is tighter than a previous estimate of 50–120 MYA made using 92 sets of orthologs from the then available C. briggsae
genome (Coghlan and Wolfe 2002
). The current estimate is probably more accurate due to both a larger sample size and improved C. briggsae
gene predictions and ortholog assignments. Interestingly, recent studies date the mouse/human divergence to 65–75 MYA (Waterston et al. 2002
), so the date of the C. briggsae/C. elegans
divergence was between 5 and 45 MY before the rodent/primate divergence.
C. briggsae/C. elegans Paralogs and Orphans
In this section, we look in more detail at those C. briggsae proteins that could not be assigned to C. elegans orthologs. Roughly, a third of the C. elegans and C. briggsae proteins fall into this category. Of these, 4,545 (23%) C. elegans (WS77*) genes and 5,211 (28%) C. briggsae proteins have multiple BLASTP matches in the opposite species. These correspond to paralogous relationships within gene families and are examined in greater detail in the next section (Gene Families).
The remaining 2,108 (11%) C. elegans and 2,141 (11%) C. briggsae genes do not have any BLASTP hit of E-value <10−10 in the opposite genome and therefore represent candidate species-specific genes, or “orphans.” However, many of these are simply genes that have evolved rapidly. Lowering the BLASTP threshold to E-value <10−5 finds 785 C. briggsae proteins that have a weak C. elegans match. An additional 11 proteins have a strong TBLASTN match to the C. elegans genomic sequence, signifying either a C. elegans gene that is missing from the predicted gene set or a pseudogene. These are being examined individually. Another 538 C. briggsae genes were found by TRIBE to belong to shared C. briggsae/C. elegans gene families (see the Gene Families section below) and so are members of rapidly evolving families common to both species.
This leaves 807 C. briggsae proteins that have no BLASTP match (E-value, <10−5) in the opposite species and that do not belong to a shared C. briggsae/C. elegans gene family. A similar analysis yields 1,061 C. elegans orphans. Of these, 695 C. briggsae genes and 963 C. elegans genes have at least two exons and so are less likely than single-exon predictions to be pseudogenes or mispredictions. Of the C. elegans orphans, 208 (22%) have partial or full empirical confirmation of their gene structures in the form of ESTs or cDNA data.
Some of these orphans may be novel genes that have been generated in one of the two genomes since the species diverged (Long 2001
). However, we emphasize that some of the candidate orphans may not be real orphans at all, but are either pseudogenes that have not yet been deleted or are very rapidly evolving genes that have diverged so quickly that the BLAST and Smith–Waterman algorithms (used in the Gene Families section) cannot recognize their cross-species matches. In either case, it will be fruitful to look at the orphans in more detail because they are likely to reveal sites of rapid evolution.
A list of the candidate orphans is available as Dataset S3
In order to identify protein family structure in C. briggsae
while simultaneously identifying conserved families in both C. elegans
and C. briggsae
, we performed all-against-all Smith–Waterman (Smith and Waterman 1981
) alignments of the combined C. briggsae
and C. elegans
(WS77*) predicted protein sets. The pairwise protein similarity data were then used to cluster proteins with the TRIBE-MCL software (Enright et al. 2002
). To characterize the clusters, we correlated clusters with the protein family domain analysis of C. briggsae
and C. elegans
proteins described earlier and kept the domain descriptions that were held in common by the majority of cluster members.
The TRIBE-MCL analysis produced 7,778 clusters, 2,169 of which contained more than two genes. The largest cluster contained 775 genes with a eukaryotic protein kinase protein domain. The next largest clusters were associated with Zn finger, C4-type steroid receptor and ligand-binding domain of nuclear hormone receptor, 7TM receptors, and EGF-like domain. We found 852 clusters of more than two genes (comprising 4,567 genes in total) contained no identifiable Pfam domains. The top ten largest gene families with their percentage of C. elegans and C. briggsae proteins are shown in .
Top Ten Protein Clusters by Size in C. elegans and C. briggsae
While the great majority of TRIBE-MCL clusters had a balanced number of C. elegans and C. briggsae members, we found several interesting exceptions. For clusters of more than two proteins, a total of 328 clusters were made up of only C. briggsae genes; 283 of these did not have identifiable Pfam domains. Similarly there were 111 clusters that contained C. elegans proteins exclusively, 98 of which had no identifiable Pfam domains. Although some of these putative families are undoubtedly artifacts from the gene-calling procedure, some may be true species-specific families, at least at the limit of the sequence similarity criteria used here (see also the C. briggsae/C. elegans Paralogs and Orphans section above).
In other cases, a well-known gene family was found to have greater representation in one species than the other. A prominent example is the chemosensory receptor (also known as the olfactory receptor) family. While there are 718 putative C. elegans chemosensory receptor proteins annotated in WS77*, the TRIBE-MCL clustering and Pfam annotation detects only 429 putative C. briggsae chemosensory receptor proteins. Another example is a large family containing a cyclin-like F-box (usually associated with phosphorylation-dependent ubiquination) which is represented by 243 copies in C. elegans and 98 in C. briggsae. For clusters containing five or more members, there were 202 clusters with compositional differences of at least 2-fold between the two species (118 enriched in C. briggsae proteins, 84 clusters enriched in C. elegans proteins) and 12 clusters with compositional differences of at least 10-fold (ten C. briggsae-enriched; two C. elegans-enriched).
The chemosensory receptor family is subdivided by Pfam 9.0 (Bateman et al. 2002
) into six subfamilies: 7TM subfam4, 7TM subfam5, sra, srb, sre, and srg. To determine whether the difference in size of the chemosensory receptor families in the two species affects all subfamilies equally, we undertook a detailed analysis of the clusters using neighbor-joining trees and manual inspection. The results, summarized in , show that the size differential in the chemosensory receptor family between the two species is not distributed equally but is concentrated in two subfamilies: the 7TM subfam5 subfamily, with 311 and 151 members in C. elegans
and C. briggsae
respectively, and the sra subfamily, with 36 members in C. elegans
and 18 members in C. briggsae.
Differential Sizes of Chemosensory Receptor Subfamilies in C. elegans and C. briggsae
shows a phylogenetic tree for the sra subfamily. Many of the tree's terminal branches have exactly two members, one from the C. elegans and the other from C. briggsae. These cases correspond to putative orthologs. Several branches (arrows in ) show expansions in C. elegans that presumably represent cases in which a common ancestor in the two species underwent expansion in C. elegans but not in C. briggsae. A more modest expansion of a cluster of C. briggsae genes is also seen.
Tree View of All Chemosensory Receptor Genes in the Sra Subfamily
In the chemosensory receptor family, there is a strong correlation between C. elegans-specific expansions in the similarity tree and regions of tandem duplication in the genome. For example, all the genes found in the large C. elegans-specific expansion in the upper right of are tightly clustered in a tandem array within a single 20 kbp region of C. elegans chromosome I (inset of , lower right).
The physical clustering of protein families that was first observed in C. elegans
(C. elegans Sequencing Consortium 1998
) is also common in C. briggsae
. To compare family clustering in the two species, we evaluated sliding windows of 15 protein-coding genes and determined the average number of gene products within the window that belonged to the same TRIBE family. In C. briggsae
, the average (mean ± SD) number of family members in a sliding window was 0.37 ± 1.04, while a similar value of 0.55 ± 1.57 was observed in C. elegans
, in contrast to values of 0.00008 (C. briggsae
) and 0.00058 (C. elegans
) that are expected of families that are not clustered. A permutation test indicates that the relationship of families is highly nonrandom (p
< 0.001) compared to a reshuffling of the genome on either a chromosome or a contig level.
Noncoding RNAs (ncRNAs) are a class of gene that produce a functional transcript as a final product, rather than being translated into a protein. We used a combination of ncRNA prediction programs and similarity search algorithms to identify ncRNAs in C. briggsae. For the purposes of comparison, we repeated the analysis with C. elegans.
We found 962 ncRNA genes in C. briggsae and 838 ncRNA genes in C. elegans (), as well as an additional 191 and 212 fragmentary matches in each species, respectively, that may be pseudogenes. Although the transfer RNA (tRNA) search software we used is able to distinguish pseudogenes from true genes with high sensitivity, this is not true of the other ncRNA assignments, which should therefore be treated with caution in the absence of experimental data.
ncRNA Gene Predictions Present in C. elegans and C. briggsae Genome Assemblies
We found 777 tRNA genes and 181 tRNA-derived pseudogenes in the C. briggsae
genome. In C. elegans
we found 609 tRNA genes and 210 tRNA pseudogenes, in good agreement with previous analyses (C. elegans Sequencing Consortium 1998
The C. briggsae
tRNA set included two putative selenocysteine tRNAs. Furthermore we also identified a selenocysteine insertion sequence (SECIS) in the 3′ untranslated region (UTR) of a putative C. briggsae
thioredoxin reductase gene. This C. briggsae
, is orthologous to C. elegans
, and was previously reported to contain a conserved SECIS in the 3′ UTR (Buettner et al. 1999
The approximately 170 extra tRNA genes in C. briggsae
are distributed across the amino acyl-tRNA families. We constructed neighbor-joining trees of the amino acyl-tRNA families and identified several C. briggsae/C. elegans
orthologs (data not shown). The codon wobble patterns predicted by Guthrie and Abelson (1982
) and described for the human genome (International Human Genome Sequencing Consortium 2001
) are conserved in both nematodes.
Ribosomal RNA Genes
The genes for three of the four ribosomal RNA (rRNA) components, 18S, 5.8S, and 26S, are known to occur in large, tandem-repeat structures in C. elegans
and other higher eukaryotes (Ellis et al. 1986
). There are thought to be 100–150 copies of this repeat on chromosome I in C. elegans
. The 5S rRNA genes are strikingly conserved between C. elegans
and C. briggsae
, with the genes in the two species identical in sequence (Butler et al. 1981
; Nelson and Honda 1989
). An estimated 100 copies of 5S rRNA lie in a tandem array on chromosome V in C. elegans.
The C. briggsae
rDNA repeat measured 7,429–7,431 bases with variations in length of a poly(A) tract and single nucleotides. The unit repeat contained one copy each of 5.8S, 28S, and 18S rDNA genes. In agreement with previous reports (Nelson and Honda 1989
), two distinct 5S unit repeats were found of 697 bases and 938–940 bases. Each version contains a single copy of the 5S gene. Several minor variants of each version were apparent. Based on the number of reads present for each repeat, the rRNA array extends 410 kbp (55 copies), and the 5S arrays extend 20 kbp (30 copies) for the shorter unit and 70 kbp (70 copies) for the longer. The latter two estimates are in reasonable agreement with previous estimates (Nelson and Honda 1989
Spliced Leader RNA Genes
A characteristic of many C. elegans genes is the presence of a trans-spliced leader, a short RNA that is spliced onto the 5′ end of the primary transcript prior to further processing. Two types of spliced leader, SL1 and SL2, have been described.
RNA gene that donates the SL1 spliced leader is present on the 938 bp 5S rRNA repeat units (Nelson and Honda 1989
), as it is in C. elegans
. There are 18 known SL2
genes in C. elegans
. They vary somewhat in sequence and are scattered throughout the genome (Evans et al. 1997
; T. Blumenthal, unpublished data). We searched the C. briggsae
genome for the SL2
RNA genes and found 18 matches. These matches encode four of the eleven SL2
sequence variants known in C. elegans
as well as two variants not found in C. elegans
. In both species, roughly half of the SL2
RNA genes are found in a few small clusters. A phylogenetic tree of the SL2
RNA genes in the two species indicates that four SL2
RNA genes in the last common ancestor expanded after separation of the species to create the 18-member gene families that exist today.
Several miRNAs are clustered in the C. elegans
genome (for example, mir-42
, and mir-44
) and may be processed from the same primary transcript. Of nine clusters of two or more miRNA genes within regions of around 1 kb in C. elegans
, we find seven with conserved gene order and orientation in the C. briggsae
genome. An additional cluster contains six C. elegans
paralogs to mir-35
, versus eight in C. briggsae
, as previously reported (Lau et al. 2001
Other ncRNA Genes
Our analysis identified 178 putative non-tRNA, non-rRNA genes in C. briggsae. The majority of classes, including U1, U2, U5, and SRP RNA, showed slightly higher copy numbers in C. elegans.
The Rfam search failed to find C. elegans or C. briggsae RNAase P and U3 small nucleolar RNA (snoRNA) sequences. However, RNAase P from C. elegans has recently been identified by others (R. Klein, personal communication), and with aid of this sequence, the matching C. briggsae sequence was then easily found using BLASTN. BLASTN searches with six C. elegans U3 snoRNAs (T. Jones, personal communication) identified four matching genes in the C. briggsae genome.
It is interesting to note that there are roughly twice as many putative U6 small nuclear RNAs (snRNAs) in C. briggsae as in C. elegans. Upon further examination, we found that the U6 Rfam family appears to contain large numbers of pseudogenes, and thus the observed expansion may reflect recent pseudogene duplication. The putative U6 snRNA sequences are remarkably conserved, with 22 clusters in C. briggsae and 16 in C. elegans having identical sequences. Two regions of around 10 kbp on the same FPC contig in C. briggsae (cb25.fpc2888) contain six and three U6 snRNA genes and correspond to two similar clusters located on C. elegans chromosome IV. Additional highly similar U6 hits can be found flanking these regions and in additional clusters. However, a neighbor-joining tree of all 63 putative U6 snRNAs did not show any obvious C. briggsae-specific subfamily expansion (data not shown).
Missing ncRNA Genes
We found no homologs of the U12 spliceosome components U4atac, U6atac, U11, and U12 in the C. briggsae sequence. This spliceosome has not been previously identified in C. elegans or other nematodes. We found no obvious homologs of telomerase RNA or of several snoRNAs in the C. briggsae or C. elegans genomes, suggesting significant divergence from previously identified examples.
To characterize the repeat content of C. briggsae,
we applied RECON, an ab initio repetitive element identification algorithm (Bao and Eddy 2002
) to identify repetitive sequences with more than ten copies in the genome. Putative repetitive elements were then screened to remove RNA and protein families not associated with transposable elements. For the purposes of comparison, we applied the same algorithm to C. elegans
The RECON-constructed library for C. briggsae
contains 473 consensus repeat sequences, and the library for C. elegans
contains 377. Previously reported transposons with more than ten copies in the C. briggsae
genome, Tcb1 (Harris et al. 1990
) and Tcb2 (Prasad et al. 1991
) for C. briggsae
, as well as C. elegans
transposons Tc1, Tc2, Tc3, Tc6, and Tc7 (Plasterk and von Luenen 1997
), were all recovered in the corresponding library.
When the RECON libraries were used as the substrate to RepeatMasker, 22.4% of the C. briggsae and 16.5% of the C. elegans genomes were masked. Extrapolation to the expected 104 Mbp of the complete C. briggsae genome indicates that 23.3 Mbp of the C. briggsae genome is repetitive, as opposed to 16.5 Mbp of C. elegans, which has a contiguous genome size of 100.3 Mbp. Hence, the differential repeat content accounts almost entirely for the different observed sizes of the species' genomes.
Comparison of the repetitive portion of the two genomes confirms the early observation from c0t
curves that closely related species contain similar repetitive sequences (Britten and Kohne 1968
): the C. briggsae
library masks 4.6% of the C. elegans
genome, while the C. elegans
library masks 6.6% of the C. briggsae
genome. In contrast, the RECON-constructed library for Arabidopsis thaliana
(Z. Bao and S. R. Eddy, unpublished data) can only mask 0.2% of the C. elegans
genome and 0.1% of the C. briggsae
The top ten repeat families in the two genomes are shown in Tables and . Interestingly, a single C. briggsae repeat family, Cb000047, is present more than 20,000 times and constitutes 4.7 Mbp of the C. briggsae genome. In both genomes, the majority of repeat families are either short, nonautonomous DNA transposons or tandem arrays. The presence of Cb000047 is an interesting difference between C. briggsae and C. elegans, the largest of whose repeat families is present in roughly 3,000 copies.
Top Ten Most Abundant Repeat Element Families in C. briggsae
Top Ten Most Abundant Repeat Element Families in C. elegans
Despite their general similarities, we were not able to systematically identify ortholog pairs among the C. briggsae and C. elegans repeats. When we used RepeatMasker to compare the sequences of one library against the other, we found no simple one-to-one mapping between them. Furthermore, similarity between repeat element pairs was generally restricted to subparts of the consensus sequences, presumably reflecting domains important for the propagation of the elements. It is not yet clear whether the overall similarity in repeat composition between the two genomes is due to orthology or because related genomes tend to be permissive for similar types of repeats (in particular, transposable elements). The relatively small amount of similarity between the repeat libraries in the two species suggests that most observed dispersed repeat elements postdate the divergence of the two species.
Genome Organization of C. briggsae and C. elegans
To begin to compare the organization of the C. briggsae
and C. elegans
genomes, we created whole-genome alignments of the two genomes at the nucleotide level using the WABA algorithm (Kent and Zahler 2000
). WABA produced a total of 1,340,518 blocks of alignment. After adjusting for overlaps, the WABA alignments were found to cover 52.3% of the C. elegans
genome (52.4/100.2 Mbp) and 50.1% of the C. briggsae
assembly (52.9/105.6 Mbp).
To characterize the alignments, we compared them with the positions of annotations on the C. elegans genome, using the WS77* annotations for the positions of genes, coding regions, and introns, and the WS87 annotations (October 2002) for the positions of 5′ and 3′ UTRs (the annotations for which were not available in WS77). In addition, we also scored aligned regions that fell within the upstream and downstream regions of genes, which were defined arbitrarily as the region 1,000 bp 5′ to the translational initiation codon for upstream regions and the region 1,000 bp 3′ of the translational stop codon for downstream regions.
summarizes the relationship among the 1.3 million raw WABA alignments and annotated C. elegans genes. The WABA algorithm produces pairwise alignments that contain smaller aligning blocks of three types: “coding,” “strong,” and “weak.” Coding blocks have the characteristic match-2/skip-1 pattern of diverged coding regions, while strong and weak blocks have high and low levels of similarity, respectively. Remarkably, only a third of aligned bases overlap known coding exons or their 5′ or 3′ UTRs. Another third lie in introns, and the final third lie in intergenic regions. More than half of the latter class lie more than 1,000 bases outside of known protein-coding genes.
Bases Covered by Aligned WABA Blocks, Stratified by Relationship to Annotated C. elegans Genes
As indicated by Kent and Zahler (2000
), although there is a preference of WABA coding blocks for coding regions and strong blocks for intergenic regions, there is not a clear delineation. shows a C. elegans
and its WABA alignment to part of C. briggsae
supercontig cb25.fpc2454. WABA blocks covered the snt-1
CDS and predicted 3′ UTR. Coding WABA blocks (dark blue in ) correlate well with exons of both of the alternatively spliced transcripts of snt-1
. However, there are also small coding blocks between exons 1 and 2 and between exons 2 and 3. These could be missed exons in the C. elegans
gene model or conserved functional elements, but there are no experimental data to support either supposition.
A WABA Alignment over a Known C. elegans Gene (snt-1)
Strong blocks are found in exons and in the region immediately upstream of the gene. Weak blocks are frequently found in introns and in intergenic regions. Intriguingly, the intergenic region 2,000 bp upstream of snt-1 includes regions of alignment that contain strong, weak, and coding blocks. Such regions, which are numerous in the alignment of the two genomes, do not correspond to repeats or other known features and beg investigation to determine whether they are footprints of unknown functional elements.
A file containing the raw WABA alignments is available as Dataset S4
Colinearity between C. briggsae and C. elegans Genomes
We used the C. briggsae/C. elegans WABA alignment data and a simple set of algorithms to identify regions of long-range colinearity between the two genomes. We first merged alignments whose coordinates overlapped in both the C. elegans and C. briggsae genomes. We then filtered these data to remove regions in which an excessive number (more than five) of segments of C. briggsae sequence were aligning to the same region of C. elegans or vice versa. This was followed by a simple merge, in which adjacent blocks of colinear alignments were merged, and a second round of merging using a dynamic programming algorithm to bridge small transpositions and inversions by finding runs of monotonically increasing alignment blocks. The resulting set of 13,467 candidate synteny blocks was then filtered to remove small blocks of alignment of less than 1.8 kbp.
The final list of candidate synteny blocks contained 4,837 alignments covering 84.6% and 80.8% of the C. elegans and C. briggsae genomes (mean, 37,472 bp; median, 5,557 bp). The largest such synteny block was a 1.68 Mbp segment involving the center of C. elegans chromosome II and C. briggsae supercontig cb25.fpc0058. The next largest blocks are on C. elegans chromosome X, where there are two adjacent synteny blocks of roughly 1.3 Mbp each. However, these blocks cannot be merged without introducing a disproportionately large gap in the corresponding C. briggsae alignment.
To understand the nature of the unaligned areas, we examined the ten longest gaps, which ranged in size from 51 kbp to 85 kbp. In four cases, the gaps were occupied by expansions of one or more C. elegans gene families, none of which had strong matches to C. briggsae. Three cases corresponded to a region of genes that were unrelated to each other, but none had strong matches to C. briggsae. Two cases were gene deserts with no genes or just a single gene without a C. briggsae match. In the last case, we found that half the interval was occupied by a set of unrelated genes without strong C. briggsae matches, while the other half contained a large set of WABA alignments that overlapped a set of orthologous genes. This last case therefore represented an instance in which our procedure did not extend an alignment block as far as possible.
Changes in gene order during genome evolution are typically classified as inversions, in which a region of the genome is flipped without loss of genetic material; reciprocal translocations, in which two regions are swapped without loss of genetic material; and transpositions, in which a nonreciprocal movement of genetic material takes place. We classified the colinear blocks by searching for the signature breakpoints of each of these events and counted 1,384 putative inversions, 244 putative translocations, and 2,735 putative transposition events. The remaining 476 blocks could not be classified because they abutted the ends of C. briggsae supercontigs.
We refer to the junctions between adjacent colinear blocks within a C. briggsae supercontig as “breakpoints,” because they correspond to a putative rearrangement between ancestral chromosomes during the divergence of C. briggsae from C. elegans. is a tally of such breakpoints, broken down by the two C. elegans chromosomes matching either end of the breakpoint junction on a C. briggsae supercontig. There is a clear bias towards rearrangements within the same ancestral chromosome, which occur roughly four times more frequently than those between chromosomes (p < 10−4, χ2 test). When the relative sizes of the intra- versus interchromosomal compartments are considered, the overall density is approximately ten times higher for intrachromosomal rearrangements: 29.7 intrachromosomal rearrangements per megabase pair versus 3.6 interchromosomal rearrangements per megabase pair (p < 10−4, χ2 test). Overall, the X chromosome has a lower density of colinearity breakpoints than the autosomes (31.1 versus 52.1 rearrangements per megabase pair in X versus the autosomes). This conclusion applies equally to rearrangements within X and between X and the autosomes.
Chromosomal Partners at Colinearity Breakpoints
We also examined the frequency of breakpoints within and between C. elegans
chromosomal arms. The C. elegans
autosomes are regionally divided into a “central cluster,” in which the meiotic recombination rate is low, and two “arms,” in which the meiotic rate is high (Barnes et al. 1995
). The arms are located roughly one third of the distance from the two telomeres. Because the chromosomal arms are asymmetrical with respect to the meiotic pairing region (Sanford and Perry 2001
), for this analysis we reversed C. elegans
chromosomes I and V to place the meiotic pairing region on the left arm of all chromosomes.
We found that breakpoints between regions of colinearity are strongly biased towards junctions that are within the same arm of the same chromosome (; p
test). We also found a regional difference in the distribution of the lengths of colinear blocks. The mean block length increases gradually from a mean of 25.5 kbp in the first 10% of C. elegans
chromosome length to 44 kbp in the center and then decreases to a mean length of 27.5 kbp in the last 10% of the chromosome. This distribution is nonrandom at p
-test comparing the first 10% to the center or the center to the last 10%). This correlation between synteny block lengths and chromosomal position was first noted by Kent and Zahler (2000
). Coghlan and Wolfe (2002
) attempted to reproduce this finding but were unable to show statistical significance with the data available at the time.
Chromosomal Arms at Colinearity Breakpoints
The list of candidate synteny blocks is available as Dataset S5
Ordering of C. briggsae Supercontigs by C. elegans Synteny Block
We used the merged synteny blocks to order the C. briggsae supercontigs by placing each supercontig into the order and orientation dictated by its single largest block of C. elegans colinearity. To maximize the number of supercontigs for which there was ordering information, we used the unfiltered set of 13,467 alignment blocks. With this technique we were able to position all but one of the 142 supercontigs that had been placed on the C. briggsae physical map and 241 of the 436 unplaced contigs. The total amount of sequence that could be placed in this way was 104.5 Mbp of the 105.6 Mbp C. briggsae assembly.
The C. briggsae supercontigs that could not be placed onto C. elegans by synteny block were small (mean, 5,462 bp; median, 3,479 bp). We examined the ten largest of the unplaced supercontigs, including the supercontig that had been placed onto the C. briggsae physical map (cb25.fpc4500; 8 kbp) but not onto the reconstruction. We found the unplaced contigs to be generally gene-poor and largely devoid of nucleotide-level alignments to C. elegans.
shows a graphical representation of this ordering. Of note are the maintenance of large regions of colinearity between C. elegans chromosome X and C. briggsae supercontigs fpc0045, fpc3857, fpc2397, fpc4044, fpc4033, fpc4069, and fpc0929 and a marked preference for intrachromosomal versus interchromosomal rearrangements within the autosomes. Also of note is the high level of concordance between synteny blocks predicted by nucleotide-level synteny and the positions of protein-level orthologs. Small areas of discordance, scattered throughout the genome, may be misassignments among the orthologs or small regions of colinearity that were not detected at the nucleotide level.
Representation of the C. briggsae WGS Assembly on a C. elegans Scaffold Using Colinearity Relationships
It is important to emphasize that this ordering of C. briggsae supercontigs is arbitrary and gives only an approximate idea of their chromosomal assignments. Accurate chromosomal assignment of the supercontigs and a comparison of large-scale rearrangements between the chromosomes of the two species must await empirical determination.
The ordering of C. briggsae
supercontigs on the C. elegans
genome is available as Dataset S5
Genomic Breakpoint Rate
As noted in the previous section, we identified 1,384 putative inversions, 244 putative translocations, and 2,735 putative transposition events. As described in Sankoff (1999
), each inversion or translocation implies two breakpoint events (at either end of the conserved block), while each transposition implies three breakpoint events (two at the source and one at the destination of the conserved block). This gives an estimated 11,461 genomic breakpoint events during the divergence of C. elegans
and C. briggsae,
or 57 breakpoint events per megabase pair per species.
Using our previous estimate of a divergence time of 80–110 MYA between the two species, we calculate a breakpoint rate of between 0.5 and 0.7 breakpoints per megabase pair per million years. Although our synteny block counts must be viewed as approximate, since modest changes in the parameter choices can affect the number of small blocks counted by as much as a factor of two, our estimate is in good agreement with previous estimates based on the available finished sequence (Coughlan and Wolfe 2002). A minor difference is that we found relatively fewer translocations than were found in the earlier study. It is likely that differences in methodologies are responsible for this discrepancy, as we defined conserved blocks on the basis of merged regions of nucleotide-level similarity, whereas Coghlan and Wolfe (2002
) used protein-level similarity of sequenced genes to identify orthologous regions. The C. briggsae/C. elegans
breakpoint rate is roughly five times the rate reported by Ranz et al. (2001
) for Drosophila
Conservation of Operon Structure
contains roughly 1,000 trans
-spliced operons (Blumenthal and Gleason 2003
), most of which have been identified. On the basis of the colinearity blocks described above, we examined the conservation of the 800 operons present in WS77 and found that 768 (96%) are preserved intact in the C. briggsae
genome, as judged by the criterion that all the genes in the C. elegans
operon could be found in the same order and orientation in the corresponding region of C. briggsae
. By comparison with the preservation of order of nonoperonic genes, we would have expected that only 60% of the operons would be preserved if no selection were operating to preserve gene order. Hence, we conclude that the structures of operons are under purifying selection.
Of the 32 C. elegans operons that were disrupted in C. briggsae, seven were broken by large expansions of intercistronic distance, five had transpositions of the first gene in the operon to another location, nine had rearrangement breakpoints within two-gene operons, four were breakpoints between the last gene and the rest of the operon, two were breakpoints between the first gene and the rest of the operon, two were internal breakpoints, two were breakpoints within a gene in the operon, and one was a general scattering of all the genes in the operon.
Position-Specific Variations in Caenorhabditis Chromosomes
When the distribution of synteny blocks, orthologs, low-stringency orphans, and silent site substitution rate is projected onto the C. elegans
sequence map, an intriguing pattern appears (; see also Poster S1
). When normalized for the distribution of genes, there is a marked increase in the frequency of orthologs in the center of the chromosomes versus in the arms (C; 74.8 versus 53.2 orthologs per 100 genes, p
< 1 × 10−12
by Welch's two sample t
-test) and an increase in the frequency of “orphan” genes in the chromosome arms versus the centers (D; 6.15 versus 3.36 orphans per 100 genes in the arms versus the central regions, p
< 1 × 10−8
-test). This pattern is correlated with the ratio of nonsynonymous to synonymous substitutions (the KA
ratio) between ortholog pairs, which is higher in the chromosomal arms than in the centers (G; mean KA
, 0.065 versus 0.059; p
< 4.6 × 10−9
-test). This is due in part to elevated rates of KA
on chromosome arms versus centers (mean KA
, 0.138 versus 0.128; p
< 7.9 × 10−5
-test.). Although there is considerable regional variation in the rate of silent site substitutions (the KS
value), there is no significant difference in the mean or variance between chromosome arms and the central regions.
Evolutionary Divergence across C. elegans Chromosome V
The difference between arms and centers is also reflected in the distribution of blocks of nucleotide-level similarity between C. elegans and C. briggsae (A). As noted in the previous section, synteny blocks are longer in the centers than in the arms. These patterns are pronounced in all the autosomes, but present to a lesser degree in the X chromosome as well.
The arm/center dichotomy seen in the comparison between the two species is reflected in a number of intrinsic features of the C. elegans
chromosomes, many of which were reported at the time of whole-genome sequencing (C. elegans Sequencing Consortium 1998
). The centers are gene-rich and have a lower meiotic recombination rate than the arms (B). Transposons and other repeat elements are greatly enriched in the arms and largely excluded from the centers (F). Lastly, the frequency of essential genes, as judged by mutant phenotype or lethal outcomes in genome-wide RNAi screens (Fraser et al. 2000
; Gonczy et al. 2000
; Kamath et al. 2003
), is higher in the chromosomal centers than in the arms even after normalization for gene number (E).
Mechanistically, the increased rate of species-specific genes in the chromosomal arms could be explained by the preponderance of transposable element insertions in these locations. Transposable elements have been identified as an engine driving exon shuffling (Ejima and Yang 2003
) and gene evolution (Lev-Maor et al. 2003
). Transposable elements and other repeats are also associated with an increased rate of illegitimate chromosomal rearrangement (Gray 2000
; Stakiewicz and Lupski 2002), which may be responsible for the trend toward shorter synteny blocks in the arms.
The observation of large positional variances in neutral substitution rate (KS
) has been seen in recent mouse/human comparisons (Waterston et al. 2002
), and it appears that the same phenomenon is seen in Caenorhabditis.
However, while there is a significant difference in KA
in chromosome centers versus arms, the variation in KS
alone is more localized in nature.
Using C. briggsae Sequence to Improve C. elegans Annotation
The C. elegans
genome now totals 100,273,501 bases (WS103 release; June 2003) and consists of six contiguous segments of DNA corresponding to the C. elegans
chromosomes. The last gap in the sequence was closed in November 2002. Since the publication of the C. elegans
genome (C. elegans Sequencing Consortium 1998
), the gene set has been extensively hand curated. Between the WormBase WS17 release in April 1999 and the WS77 release in April 2002 (the release used in this paper), WormBase curators made manual changes to approximately 6,300 genes (D. Lawson, personal communication).
To investigate the potential of the C. briggsae/C. elegans comparison for improving the C. elegans gene annotations, we compared the C. elegans hybrid gene set of 20,621 genes derived from our comparison of the two species to the set of 18,808 protein-coding genes in WS77* derived from WormBase. The majority (14,011) of the hybrid gene set predictions overlapped perfectly with WS77* gene predictions. Many of these, of course, are derived indirectly through Ensembl from WS77. From the remainder we derived 1,275 well-supported suggestions for new C. elegans genes, 1,763 new exons in 1,100 existing genes, 2,093 exon deletions in 1,583 existing genes, 1,675 exon truncations in 1,502 existing genes, and 1,115 exon extensions in 1,008 existing genes ().
Updating the C. elegans Gene Set Using C. briggsae Similarity
Most of the corrections suggested for the WS77 gene set using C. briggsae
similarity are still applicable to WS103, even after the manual correction of approximately 3,800 C. elegans
genes between the WS77 and WS103 WormBase releases, prompted in part by the open reading frame (ORF) sequence tag (OST) dataset of Reboul et al. (2003
). Only 290 of the 1,275 proposed new hybrid set genes overlap new WormBase gene predictions made since WS77, and 4,802 of the 6,646 proposed exon changes are in gene structures that have not been edited between WS77 and WS103.
The automated analysis presented above indicates that the C. briggsae sequence will suggest many changes to the C. elegans set of gene predictions. To clarify the nature of these changes, we subjected several areas of colinearity to careful hand curation. In one carefully inspected area involving three C. elegans cosmids (ZK632, R10E11, and T05G5) and 33 C. elegans predicted genes, the syntenic C. briggsae region has 38 predicted genes (). Inversions have broken the syntenic region into three conserved segments, within which gene order and orientation are largely conserved, except for one single-gene inversion (ZK632.9). There are 30 one-to-one orthologs in the syntenic block and two one-C. elegans-to-two-C. briggsae orthologs (T05G5.6 versus CBG10003 and CBG09979, and T05G5.8 versus CBG10002 and CBG09978), where the two C. briggsae orthologs seem to have been duplicated as a block since speciation. The remaining C. elegans gene (CEG09285) belongs to a C. elegans-specific gene family; its closest C. elegans paralog is F40F12.3, a gene of unknown function that is located nearby on chromosome III.
A Region on C. elegans Chromosome III Containing 33 Genes, and the Syntenic C. briggsae Region, Which Has 38 Genes
The remaining four C. briggsae genes include two members of a C. briggsae-specific gene family (CBG10004 and CBG09973), one that has a C. elegans ortholog on chromosome X (CBG09992) and one that has no clear C. elegans ortholog but has a match on chromosome X (CBG09973). Compared to the C. elegans WS77 gene set, the C. elegans hybrid set has two extra gene predictions: the C. elegans-specific genes CEG09285 and CEG09299, which are the orthologs of C. briggsae gene CBG09988. The other 31 C. elegans genes in this region are in both the C. elegans WS77 and hybrid gene sets. However, for seven of these genes, there are substantial differences between the WS77 and C. elegans hybrid gene structure that are supported by C. briggsae similarity. These include extra exons (in T05G5.10 and T05G5.4), deletions of WS77 exons (in T05G5.1, T05G5.9, and ZK632.7), truncations of WS77 exons (in R10E11.5), and extensions of WS77 exons (in T05G5.1, T05G5.4, and R10E11.7). In summary, the analysis of 33 C. elegans genes suggested corrections to seven gene models, proposed two missed genes, and confirmed the other 24 gene models.
We also evaluated a set of C. elegans WS77 gene predictions where the new C. elegans hybrid gene set suggests a strong likelihood of a change to the WS77 prediction. We found examples of hybrid gene set predictions that are pseudogenes and possible splitting and fusing of existing WS77 genes, as well as the addition of new exons.
Extrapolation of these results to the whole C. elegans genome suggests that the C. elegans gene set will be increased by at least approximately 1,300 gene predictions and that approximately 2,800 exons will be extended or truncated in existing WS77 predictions, based on C. briggsae similarity. Thus, despite five years of extensive manual curation, the comparative genomic approaches made possible by the C. briggsae sequence can provide significant improvements to the C. elegans gene models. Manual inspection of these discrepancies will be necessary, but especially for less highly expressed genes, where EST and messenger RNA (mRNA) data are not available, and for initial and terminal exons where signals can be difficult to detect, sequence conservation with C. briggsae will now provide a primary pointer for C. elegans gene structure refinement.