In WormBase, gene model curation refers to the determination of the correct exon structure of protein coding genes and, where possible, pseudogenes. All new protein-coding structures and pseudogenes are now manually curated. There are currently 49

423 curated gene loci in
C. elegans, of which 20

403 are protein coding and 27

588 are non-coding RNA genes. Of the 20

387 protein-coding genes, 3034 gene loci have expressed sequence tag (EST) or
trans-splicing evidence that they produce two or more protein isoforms, giving 24

891 coding sequence (CDS) structures in total. There are 1432 pseudogenes. Most of the other species in WormBase are expected to have similar numbers of genes. The non-coding genes are largely based on importing data from databases such as mirBase (
5) and Rfam (
6) with minimal manual curation.
Most of the initial C. elegans coding gene structures were determined by the gene prediction program Genefinder (Green,P., unpublished data), but subsequent refinement of the structures have been done manually using supporting evidence from various sources such as nucleotide and protein alignments, other experimental evidence and WormBase users’ input, as well as sequence features associated with the regulation of the genes and their transcripts.
The numbers of
C. elegans coding sequences have increased fairly steadily over the last 10 years, as shown in . Of the 19

099 CDS structures curated at the time of the 1998
C. elegans paper, there are now only 8709 that remain unchanged. There were around 900 curated non-coding genes until recently when several major imports of non-coding genes occurred.
As new gene prediction programs have become available, programs such as Twinscan (
7), Jigsaw (
8) and mGene (
9) have been used to augment the gene structures. Although the focus remains primarily on
C. elegans, recently WormBase has been expanded to also include
C. briggsae,
C. remanei and
C. brenneri, and may be expanded further to include some parasitic nematodes.
Gene prediction programs give a reasonable set of gene structures, but the best of them only predict ~80% of the complete gene structure correctly (
10) and although the best gene prediction programs exhibit a similar overall level of sensitivity, they differ in which particular genes are correctly predicted.
Caenorhabditis elegans genes with a large number of exons, short exons, long introns, a weak translation start signal, weak splice sites or poorly conserved orthologs pose great difficulty for gene prediction programs (Williams,G.W. and Davis,P.A., personal observation). They can incorrectly predict a coding gene model where the gene is a pseudogene or a pseudogenic fragment and they predict the isoforms of a gene poorly, if at all. They do not use several additional types of information such as the 5′ position of genes as given by
trans-splice leader sites, tiling array expression, mass spectrometry peptides or knowledge of a potential genome sequencing error as indicated by a frameshift in homologous protein alignments. The predicted gene structures, therefore, often need to be manually changed.
The WormBase genomes, gene structures and all associated data and genomic features are held in an ACeDB database (
11). This is an object-orientated database that can efficiently hold a wide variety of genomic data types. The curators view and edit the gene structures and other genomic features in the ‘feature map (FMAP)’ editor of ACeDB. The data are exported as general feature format (GFF) files with each release of the database for display on the WormBase web site, using GBrowse (
12).
Initial gene set
The original gene set of
C. elegans was produced by using Genefinder (Green,P., unpublished data). The initial set of genes in some of the non-
C. elegans genomes in WormBase were predicted using the methods from nGASP (
10), a project to find the best nematode gene prediction method. The most accurate gene finders found by nGASP were ‘combiner’ algorithms, such as Jigsaw, which made use of transcript and protein alignments and multi-genome alignments, as well as gene predictions from other gene finders.
Manual curation
Initially, the majority of
C. elegans gene structure changes were based on the alignments of transcript data from large-scale transcriptome projects such as Yugi Kohara’s EST libraries (Kohara,Y., unpublished data) and the ORFeome project (
13). This approach was taken because it proved to be a rich source of evidence for correcting gene structure errors because it indicates the exact intron boundaries and covers the exons.
There are also many sources of evidence for curation that do not depend on transcript data. This evidence is more indirect than transcript data and often requires a deduction of the likely structure of the genes based on weak or conflicting evidence. This non-transcript evidence includes protein alignments, mass spectroscopic peptides, conserved protein domains and homology to paralogs and orthologs. These are becoming increasingly useful in the refinement of the gene structures, especially in genes with a low level of expression that often lack transcript data. In the WormBase database release ‘WS220’, only 46.9% of the C. elegans CDS structures have coverage of every base of every exon with EST or mRNA transcript evidence and 8.8% of CDS structures have no transcript evidence at all. It is therefore often necessary to use indirect evidence to deduce the most likely structure of the nearly 55% of CDS structures that are not fully confirmed by transcript data.
Supporting evidence for changes to gene structures comes from a variety of sources, which curators investigate and review while attempting to improve the gene models. Some of the major types of supporting evidence include, in roughly their order of significance for curators:
User input We receive notifications from the individual users that gene models need attention. These notifications either come through forms on the WormBase web site or from email to the WormBase Help email address. These suggestions are extremely useful. They contain data that might never make it into a publication or that are from an expert in a particular field.
Literature curation WormBase literature curators at the California Institute of Technology flag publications where data are in conflict with current gene structures or sequence features. These are sent to sequence curators by email for examination and resolution. We encourage people to submit their sequences to public databases, such as GenBank, EMBL or DDBJ, in order to provide a public record of the evidence for any changes made. These sequences will then provide an additional means of linking a change in the gene model to the user’s publication.
Transcript data Nematode transcript data are routinely extracted from a variety of sources. These include mRNA and EST sequences from the nucleotide databases, the ‘OST’ reverse transcription polymerase chain reaction (RT–PCR) sequences from the ORFeome project (
13) and the ‘RST’ sequences which are 5′ and 3′ RACE sequence tags. Recently, we have also been adding data from next-generation sequencing platforms such as Illumina and 454 short-read RNASeq data sets (
14). The EST, OST, RST and 454 reads are aligned to the genome using BLAT (
15). SAGE and TEC-RED sequences are aligned using a simple Perl string-match and the short-read RNASeq data are aligned using a mixture of MAQ (
16) and cross-match (Green,P., unpublished data). Errors identified by transcript alignments are generally of four types. The first, and most obvious, is the absence of a gene model where there is a transcript alignment, which indicates a possible missing gene. The second type of error comes from the comparison of introns defined by a transcript to introns in existing gene structures. If an intron that is confirmed by a transcript does not match an intron in a gene structure, then there is probably a mistake in the gene structure or a new isoform needs to be added. The third type of error comes from the paired-end read information (5′ and 3′ reads from the same clone) of transcript sequences. For instance, the mapping of 5′ and 3′ reads of a single EST clone to different gene predictions is an indication that the two gene structures may need to be merged. Features derived from the analysis of transcript alignments, such as
trans-spliced leader (TSL) sequence sites and poly-A addition sites are also used to establish gene or isoform boundaries.
Protein alignments and homology A variety of protein databases are aligned to the genome using BLASTX (
17) to assist in refining gene structures and to identify unannotated genes. These databases include UniProt (
18), human proteins from the International Protein Index (
19),
Drosophila melanogaster proteins from FlyBase (
20), Yeast proteins from SGD (
21) and
C. elegans, C. briggsae, C. brenneri, C. japonica, Pristionchus pacificus and
C. remanei proteins from WormBase (
2). Alignments of
C. elegans proteins are particularly useful for highlighting regions where potential exons are missing in members of a gene family. Alignments to non-elegans proteins are used to identify genes that are not currently annotated and to refine existing gene models. Comparing a gene’s structure, including the position and spacing of the introns, to that of its paralogs and orthologs is often a useful means of confirming or refuting a proposed structure. This is particularly useful when curating partially sequenced nematode genomes, which are still in contigs and so may be too short or of too low-quality for the gene prediction programs to successfully determine a structure. Care has to be taken when using homology to curate a gene’s structure because nematode genes can reciprocally confirm each others’ structures, leading to the material fallacy of ‘arguing in a circle’. Many of the gene structures from other species of nematodes have been based on the structure of their
C. elegans ortholog, either directly by referring to the
C. elegans gene while manually refining the structure of the gene or indirectly by training gene predictor programs on the
C. elegans gene structures and then using these gene predictors to predict genes in other nematode species.
Repeat regions The
C. elegans repeat library is aligned against the genome using RepeatMasker (
22), which also finds simple tandem repeats. The
C. elegans repeat library has changed little in the last 4 years; however, several ‘repeat motifs’ have been removed because they actually represented common protein domains. Inverted repeat regions are found using the program ‘einverted’ from the EMBOSS project (
23), and these regions aid in identifying transposons. Gene models that overlap with repeat regions are carefully inspected, as they are probably incorrect.
TSL sequence sites These are a feature of many nematode genes where 22

bp sequences are spliced onto the 5′-end of the transcript to form the mature mRNA. The TSL sequence sites are found by comparing the 5′-end of the transcript data for matches to the known TSL sequences and are also deduced from the
trans-spliced exon-coupled RNA end determination (TEC-RED) project (
24). These sites therefore indicate the 5′-end of an mRNA, though not the start site of transcription.
Poly-A sites These are found by comparing the 3′-end of those transcript data that have a poly-A tail to the genome, confirming that there is not an A-rich genomic region at that position. The poly-A site is characteristic of the end of the processed mRNA and so is a good indicator of the end of the coding gene's structure.
Tiling array expression data There are data sets of tiling array expression from He
et al. (
25) and Fraser
et al. (unpublished data) held in the modENCODE (
26) database. These are useful for indicating exons excluded from the gene structures. The size of the probes used, typically 25

bp, limits resolution, and there is no indication of the strand being transcribed. They are, however, useful because libraries from different life stages or strains can indicate changes in expression over time or in different genomic environments.
Intron splice sites The potential of each base in the genome to form a 5′ or 3′ intronic splice site has been determined using a position weight matrix (Green,P. and Hillier,L., unpublished data). Predicted gene structures that use splice sites with a poor score should be inspected because the prediction program is possible using the nearest available splice site to splice over a region that does not allow a good gene structure. These regions can be caused by either an error in sequencing the genome or the presence of a pseudogene.
Conserved genomic regions Sequence alignments to the
C. briggsae genome have been made using the WABA alignment tool (
27). These conserved regions provide confirmatory information about gene structures, indicated possible missed or unannotated exons and genes and indicate the presence of conserved, non-coding sequences that might have regulatory roles. Further alignments of several orthologous
Caenorhabditis loci have been made using Pecan (
28).
Mass spectroscopy data There are over 115

000
C. elegans mass spec peptides in WormBase, primarily from the MacCoss lab at the University of Washington (
29) and the Hengartner lab at the University of Zurich (
30). The measured masses of the peptide ions are matched to fragments of known or predicted
C. elegans proteins or translated ORFs by the authors of this data. The locations of these mass spectroscopy peptides are then mapped back to the genome via their locations on the
C. elegans proteins. This data matches 10

965 gene loci and have been useful in confirming existing gene models. It is also useful in indicating genes that are currently curated to be pseudogenes, but may have some protein product. This mass spectroscopy peptide data have included alignments to 120 regions that previously had only an
ab initio gene prediction with no further evidence, indicating that these predictions are likely to be real coding genes. The presence of a single mapped peptide to a curated gene or pseudogene is not absolute confirmatory evidence of a real protein product, because there appears to be a high frequency of errors in predicting these peptides.
Protein secretory signals and domain structure An incomplete or fragmented protein domain in the protein product, as indicated by Pfam (
31) or InterPro (
32), might indicate a missing exon or incorrect splice sites. Protein secretory signals, as predicted by SIGNALP (
33), in translated ORFs might indicate a start of a CDS and these locations are generally chosen in preference to other START codons where there is uncertainty about which START codon should be used. Nagy
et al. (
34) submitted valuable information on genes with incorrect structures, based on an analysis of incomplete and incongruous domains in
C. elegans proteins. For example, they highlighted genes which contained obligatory extracellular domains but lacked appropriate sequence signals (signal peptide, signal anchor and transmembrane segments), since their obligatory extracellular domains are not delivered to the extracellular space where they are stable and properly folded.
SAGE There are 449

980 SAGE tags in WormBase. These have been used to indicate regions where there could be unannotated genes and have resulted in the creation of 243 new coding sequences.
Use of indirect evidence
To give an indication of the types of additional evidence that can improve the confidence that curators have in a CDS structure, a sample of 100 of the predicted CDSs from the set of 8.8% of CDSs with no EST or mRNA transcript evidence were inspected. The 100 genes had been created from a variety of evidence: most of them (93%) were created because they had a structure predicted by at least one gene prediction program and 78% of them had support for some part of their structure from the original Genefinder prediction. The others had been created because SwissProt or WormBase protein alignments indicated a probable CDS structure. Often the predicted structures appear dubious and as much supporting evidence as possible is sought, even if the extra evidence is tenuous and would not have been used for a CDS with good EST evidence. This supporting evidence is noted along with the recorded evidence for the creation of a new structure or when changing a existing structure to match new evidence and it usually strongly influences the choice of which predicted or probable exons to include in a structure.
The 100 CDS structures often have conflicting structures predicted for them by the different gene prediction programs used and it is often not obvious from these conflicting predictions which potential exons are correct or even that the region contains a gene. In these circumstances, it is useful to seek supporting evidence from orthologs or paralogs or other indications that a protein structure is conserved. Of the 100 inspected CDS structures, 22% had supporting evidence of exons from conserved coding regions found by WABA measures of conservation with the C. briggsae genome, and 81% of them have some SwissProt or WormBase protein alignment evidence of exons.
The 5′ and 3′ exons are often small and divergent between orthologs and are easy to get wrong in structures predicted from protein alignments. Of the 100 CDSs inspected, 10% had their 5′-end confirmed by the presence of a TSL sequence site.
When a CDS structure lacks any consistent gene predictions or has an unusual structure that makes the existence of the gene dubious, it is useful to have evidence that the region is transcribed or produces a protein product. In the absence of EST or mRNA evidence for transcription, such evidence can come from more indirect corroboration or transcription or translation such as aligned SAGE tags or mass spectroscopy. Of the 100 CDSs inspected, 29% have some mass spectrometry evidence and 62% have SAGE evidence of transcription in the region.
Pseudogenes
There are currently 1432 pseudogenes in WormBase. Pseudogenes in WormBase are regions of the genome, which resemble coding genes but are not expressed or cannot produce a successful protein product. These pseudogenes are manually curated and reviewed every few years. They are created when curators note EST alignment evidence for premature STOP codons or frameshifts in the open reading frames. Some pseudogenes have been created on the advice of experts in a particular gene family who note that the domains are incomplete or the likely tertiary structures of the gene products are not consistent with the rest of that family. Where possible, the exonic structure of the pseudogene is curated and the parent gene of the pseudogene is noted. Some coding genes are reclassified as pseudogenes every year as new evidence for their structure is collected and it becomes evident that the curated CDS structure is not correct and no successful protein product can be made. More rarely, a pseudogene may be reclassified as a coding gene if there appears to be good mass spectrometry evidence or other evidence from the literature for the change. The criteria for deciding whether a gene is a pseudogene is not specified very well in WormBase. In general: there should be a near-duplicate coding gene that is probably the parent gene of the pseudogene, the coding frame should be disrupted or an expert should declare it to be a pseudogene. No attention is paid to whether the pseudogene has a functioning promoter or not, as promoter regions are still poorly characterized. When there is equivocal evidence for changing a coding gene into a pseudogene, the curators tend to be slightly biased against making the change. This is because making a gene into a pseudogene effectively removes it from the scrutiny that coding genes get and removes the protein product data from the database.
Genomic sequence errors
Genomic sequence errors are also corrected when found. Genomic errors within genes can affect their structure, so correction is critical for accuracy. Over the years, there have been a number of changes to the underlying
C. elegans genome sequence. These have usually been small indel modifications, but there have also been a number of large changes. The changes are based on reinterpretations of the original sequencing trace data, often done because there are mismatches between the genomic sequence and aligned EST sequence. Details of the genomic sequence changes can be found on the WormBase wiki pages (
http://www.wormbase.org/wiki/index.php/Genome_sequence_changes).