In WormBase, gene model curation refers to the determination of the correct exon structure of protein coding genes and, where possible, pseudogenes. All new protein-coding structures and pseudogenes are now manually curated. There are currently 49
423 curated gene loci in C. elegans
, of which 20
403 are protein coding and 27
588 are non-coding RNA genes. Of the 20
387 protein-coding genes, 3034 gene loci have expressed sequence tag (EST) or trans
-splicing evidence that they produce two or more protein isoforms, giving 24
891 coding sequence (CDS) structures in total. There are 1432 pseudogenes. Most of the other species in WormBase are expected to have similar numbers of genes. The non-coding genes are largely based on importing data from databases such as mirBase (5
) and Rfam (6
) with minimal manual curation.
Most of the initial C. elegans coding gene structures were determined by the gene prediction program Genefinder (Green,P., unpublished data), but subsequent refinement of the structures have been done manually using supporting evidence from various sources such as nucleotide and protein alignments, other experimental evidence and WormBase users’ input, as well as sequence features associated with the regulation of the genes and their transcripts.
The numbers of C. elegans
coding sequences have increased fairly steadily over the last 10 years, as shown in . Of the 19
099 CDS structures curated at the time of the 1998 C. elegans
paper, there are now only 8709 that remain unchanged. There were around 900 curated non-coding genes until recently when several major imports of non-coding genes occurred.
The number of curated CDSs and non-coding genes in C. elegans.
As new gene prediction programs have become available, programs such as Twinscan (7
), Jigsaw (8
) and mGene (9
) have been used to augment the gene structures. Although the focus remains primarily on C. elegans
, recently WormBase has been expanded to also include C. briggsae
, C. remanei
and C. brenneri
, and may be expanded further to include some parasitic nematodes.
Gene prediction programs give a reasonable set of gene structures, but the best of them only predict ~80% of the complete gene structure correctly (10
) and although the best gene prediction programs exhibit a similar overall level of sensitivity, they differ in which particular genes are correctly predicted. Caenorhabditis elegans
genes with a large number of exons, short exons, long introns, a weak translation start signal, weak splice sites or poorly conserved orthologs pose great difficulty for gene prediction programs (Williams,G.W. and Davis,P.A., personal observation). They can incorrectly predict a coding gene model where the gene is a pseudogene or a pseudogenic fragment and they predict the isoforms of a gene poorly, if at all. They do not use several additional types of information such as the 5′ position of genes as given by trans
-splice leader sites, tiling array expression, mass spectrometry peptides or knowledge of a potential genome sequencing error as indicated by a frameshift in homologous protein alignments. The predicted gene structures, therefore, often need to be manually changed.
The WormBase genomes, gene structures and all associated data and genomic features are held in an ACeDB database (11
). This is an object-orientated database that can efficiently hold a wide variety of genomic data types. The curators view and edit the gene structures and other genomic features in the ‘feature map (FMAP)’ editor of ACeDB. The data are exported as general feature format (GFF) files with each release of the database for display on the WormBase web site, using GBrowse (12
Initial gene set
The original gene set of C. elegans
was produced by using Genefinder (Green,P., unpublished data). The initial set of genes in some of the non-C. elegans
genomes in WormBase were predicted using the methods from nGASP (10
), a project to find the best nematode gene prediction method. The most accurate gene finders found by nGASP were ‘combiner’ algorithms, such as Jigsaw, which made use of transcript and protein alignments and multi-genome alignments, as well as gene predictions from other gene finders.
Initially, the majority of C. elegans
gene structure changes were based on the alignments of transcript data from large-scale transcriptome projects such as Yugi Kohara’s EST libraries (Kohara,Y., unpublished data) and the ORFeome project (13
). This approach was taken because it proved to be a rich source of evidence for correcting gene structure errors because it indicates the exact intron boundaries and covers the exons.
There are also many sources of evidence for curation that do not depend on transcript data. This evidence is more indirect than transcript data and often requires a deduction of the likely structure of the genes based on weak or conflicting evidence. This non-transcript evidence includes protein alignments, mass spectroscopic peptides, conserved protein domains and homology to paralogs and orthologs. These are becoming increasingly useful in the refinement of the gene structures, especially in genes with a low level of expression that often lack transcript data. In the WormBase database release ‘WS220’, only 46.9% of the C. elegans CDS structures have coverage of every base of every exon with EST or mRNA transcript evidence and 8.8% of CDS structures have no transcript evidence at all. It is therefore often necessary to use indirect evidence to deduce the most likely structure of the nearly 55% of CDS structures that are not fully confirmed by transcript data.
Supporting evidence for changes to gene structures comes from a variety of sources, which curators investigate and review while attempting to improve the gene models. Some of the major types of supporting evidence include, in roughly their order of significance for curators:
We receive notifications from the individual users that gene models need attention. These notifications either come through forms on the WormBase web site or from email to the WormBase Help email address. These suggestions are extremely useful. They contain data that might never make it into a publication or that are from an expert in a particular field.
WormBase literature curators at the California Institute of Technology flag publications where data are in conflict with current gene structures or sequence features. These are sent to sequence curators by email for examination and resolution. We encourage people to submit their sequences to public databases, such as GenBank, EMBL or DDBJ, in order to provide a public record of the evidence for any changes made. These sequences will then provide an additional means of linking a change in the gene model to the user’s publication.
Nematode transcript data are routinely extracted from a variety of sources. These include mRNA and EST sequences from the nucleotide databases, the ‘OST’ reverse transcription polymerase chain reaction (RT–PCR) sequences from the ORFeome project (13
) and the ‘RST’ sequences which are 5′ and 3′ RACE sequence tags. Recently, we have also been adding data from next-generation sequencing platforms such as Illumina and 454 short-read RNASeq data sets (14
). The EST, OST, RST and 454 reads are aligned to the genome using BLAT (15
). SAGE and TEC-RED sequences are aligned using a simple Perl string-match and the short-read RNASeq data are aligned using a mixture of MAQ (16
) and cross-match (Green,P., unpublished data). Errors identified by transcript alignments are generally of four types. The first, and most obvious, is the absence of a gene model where there is a transcript alignment, which indicates a possible missing gene. The second type of error comes from the comparison of introns defined by a transcript to introns in existing gene structures. If an intron that is confirmed by a transcript does not match an intron in a gene structure, then there is probably a mistake in the gene structure or a new isoform needs to be added. The third type of error comes from the paired-end read information (5′ and 3′ reads from the same clone) of transcript sequences. For instance, the mapping of 5′ and 3′ reads of a single EST clone to different gene predictions is an indication that the two gene structures may need to be merged. Features derived from the analysis of transcript alignments, such as trans
-spliced leader (TSL) sequence sites and poly-A addition sites are also used to establish gene or isoform boundaries.
Protein alignments and homology
A variety of protein databases are aligned to the genome using BLASTX (17
) to assist in refining gene structures and to identify unannotated genes. These databases include UniProt (18
), human proteins from the International Protein Index (19
), Drosophila melanogaster
proteins from FlyBase (20
), Yeast proteins from SGD (21
) and C. elegans, C. briggsae, C. brenneri, C. japonica, Pristionchus pacificus
and C. remanei
proteins from WormBase (2
). Alignments of C. elegans
proteins are particularly useful for highlighting regions where potential exons are missing in members of a gene family. Alignments to non-elegans proteins are used to identify genes that are not currently annotated and to refine existing gene models. Comparing a gene’s structure, including the position and spacing of the introns, to that of its paralogs and orthologs is often a useful means of confirming or refuting a proposed structure. This is particularly useful when curating partially sequenced nematode genomes, which are still in contigs and so may be too short or of too low-quality for the gene prediction programs to successfully determine a structure. Care has to be taken when using homology to curate a gene’s structure because nematode genes can reciprocally confirm each others’ structures, leading to the material fallacy of ‘arguing in a circle’. Many of the gene structures from other species of nematodes have been based on the structure of their C. elegans
ortholog, either directly by referring to the C. elegans
gene while manually refining the structure of the gene or indirectly by training gene predictor programs on the C. elegans
gene structures and then using these gene predictors to predict genes in other nematode species.
The C. elegans
repeat library is aligned against the genome using RepeatMasker (22
), which also finds simple tandem repeats. The C. elegans
repeat library has changed little in the last 4 years; however, several ‘repeat motifs’ have been removed because they actually represented common protein domains. Inverted repeat regions are found using the program ‘einverted’ from the EMBOSS project (23
), and these regions aid in identifying transposons. Gene models that overlap with repeat regions are carefully inspected, as they are probably incorrect.
TSL sequence sites
These are a feature of many nematode genes where 22
bp sequences are spliced onto the 5′-end of the transcript to form the mature mRNA. The TSL sequence sites are found by comparing the 5′-end of the transcript data for matches to the known TSL sequences and are also deduced from the trans
-spliced exon-coupled RNA end determination (TEC-RED) project (24
). These sites therefore indicate the 5′-end of an mRNA, though not the start site of transcription.
These are found by comparing the 3′-end of those transcript data that have a poly-A tail to the genome, confirming that there is not an A-rich genomic region at that position. The poly-A site is characteristic of the end of the processed mRNA and so is a good indicator of the end of the coding gene's structure.
Tiling array expression data
There are data sets of tiling array expression from He et al.
) and Fraser et al.
(unpublished data) held in the modENCODE (26
) database. These are useful for indicating exons excluded from the gene structures. The size of the probes used, typically 25
bp, limits resolution, and there is no indication of the strand being transcribed. They are, however, useful because libraries from different life stages or strains can indicate changes in expression over time or in different genomic environments.
Intron splice sites
The potential of each base in the genome to form a 5′ or 3′ intronic splice site has been determined using a position weight matrix (Green,P. and Hillier,L., unpublished data). Predicted gene structures that use splice sites with a poor score should be inspected because the prediction program is possible using the nearest available splice site to splice over a region that does not allow a good gene structure. These regions can be caused by either an error in sequencing the genome or the presence of a pseudogene.
Conserved genomic regions
Sequence alignments to the C. briggsae
genome have been made using the WABA alignment tool (27
). These conserved regions provide confirmatory information about gene structures, indicated possible missed or unannotated exons and genes and indicate the presence of conserved, non-coding sequences that might have regulatory roles. Further alignments of several orthologous Caenorhabditis
loci have been made using Pecan (28
Mass spectroscopy data
There are over 115
000 C. elegans
mass spec peptides in WormBase, primarily from the MacCoss lab at the University of Washington (29
) and the Hengartner lab at the University of Zurich (30
). The measured masses of the peptide ions are matched to fragments of known or predicted C. elegans
proteins or translated ORFs by the authors of this data. The locations of these mass spectroscopy peptides are then mapped back to the genome via their locations on the C. elegans
proteins. This data matches 10
965 gene loci and have been useful in confirming existing gene models. It is also useful in indicating genes that are currently curated to be pseudogenes, but may have some protein product. This mass spectroscopy peptide data have included alignments to 120 regions that previously had only an ab initio
gene prediction with no further evidence, indicating that these predictions are likely to be real coding genes. The presence of a single mapped peptide to a curated gene or pseudogene is not absolute confirmatory evidence of a real protein product, because there appears to be a high frequency of errors in predicting these peptides.
Protein secretory signals and domain structure
An incomplete or fragmented protein domain in the protein product, as indicated by Pfam (31
) or InterPro (32
), might indicate a missing exon or incorrect splice sites. Protein secretory signals, as predicted by SIGNALP (33
), in translated ORFs might indicate a start of a CDS and these locations are generally chosen in preference to other START codons where there is uncertainty about which START codon should be used. Nagy et al.
) submitted valuable information on genes with incorrect structures, based on an analysis of incomplete and incongruous domains in C. elegans
proteins. For example, they highlighted genes which contained obligatory extracellular domains but lacked appropriate sequence signals (signal peptide, signal anchor and transmembrane segments), since their obligatory extracellular domains are not delivered to the extracellular space where they are stable and properly folded.
There are 449
980 SAGE tags in WormBase. These have been used to indicate regions where there could be unannotated genes and have resulted in the creation of 243 new coding sequences.
Use of indirect evidence
To give an indication of the types of additional evidence that can improve the confidence that curators have in a CDS structure, a sample of 100 of the predicted CDSs from the set of 8.8% of CDSs with no EST or mRNA transcript evidence were inspected. The 100 genes had been created from a variety of evidence: most of them (93%) were created because they had a structure predicted by at least one gene prediction program and 78% of them had support for some part of their structure from the original Genefinder prediction. The others had been created because SwissProt or WormBase protein alignments indicated a probable CDS structure. Often the predicted structures appear dubious and as much supporting evidence as possible is sought, even if the extra evidence is tenuous and would not have been used for a CDS with good EST evidence. This supporting evidence is noted along with the recorded evidence for the creation of a new structure or when changing a existing structure to match new evidence and it usually strongly influences the choice of which predicted or probable exons to include in a structure.
The 100 CDS structures often have conflicting structures predicted for them by the different gene prediction programs used and it is often not obvious from these conflicting predictions which potential exons are correct or even that the region contains a gene. In these circumstances, it is useful to seek supporting evidence from orthologs or paralogs or other indications that a protein structure is conserved. Of the 100 inspected CDS structures, 22% had supporting evidence of exons from conserved coding regions found by WABA measures of conservation with the C. briggsae genome, and 81% of them have some SwissProt or WormBase protein alignment evidence of exons.
The 5′ and 3′ exons are often small and divergent between orthologs and are easy to get wrong in structures predicted from protein alignments. Of the 100 CDSs inspected, 10% had their 5′-end confirmed by the presence of a TSL sequence site.
When a CDS structure lacks any consistent gene predictions or has an unusual structure that makes the existence of the gene dubious, it is useful to have evidence that the region is transcribed or produces a protein product. In the absence of EST or mRNA evidence for transcription, such evidence can come from more indirect corroboration or transcription or translation such as aligned SAGE tags or mass spectroscopy. Of the 100 CDSs inspected, 29% have some mass spectrometry evidence and 62% have SAGE evidence of transcription in the region.
There are currently 1432 pseudogenes in WormBase. Pseudogenes in WormBase are regions of the genome, which resemble coding genes but are not expressed or cannot produce a successful protein product. These pseudogenes are manually curated and reviewed every few years. They are created when curators note EST alignment evidence for premature STOP codons or frameshifts in the open reading frames. Some pseudogenes have been created on the advice of experts in a particular gene family who note that the domains are incomplete or the likely tertiary structures of the gene products are not consistent with the rest of that family. Where possible, the exonic structure of the pseudogene is curated and the parent gene of the pseudogene is noted. Some coding genes are reclassified as pseudogenes every year as new evidence for their structure is collected and it becomes evident that the curated CDS structure is not correct and no successful protein product can be made. More rarely, a pseudogene may be reclassified as a coding gene if there appears to be good mass spectrometry evidence or other evidence from the literature for the change. The criteria for deciding whether a gene is a pseudogene is not specified very well in WormBase. In general: there should be a near-duplicate coding gene that is probably the parent gene of the pseudogene, the coding frame should be disrupted or an expert should declare it to be a pseudogene. No attention is paid to whether the pseudogene has a functioning promoter or not, as promoter regions are still poorly characterized. When there is equivocal evidence for changing a coding gene into a pseudogene, the curators tend to be slightly biased against making the change. This is because making a gene into a pseudogene effectively removes it from the scrutiny that coding genes get and removes the protein product data from the database.
Genomic sequence errors
Genomic sequence errors are also corrected when found. Genomic errors within genes can affect their structure, so correction is critical for accuracy. Over the years, there have been a number of changes to the underlying C. elegans
genome sequence. These have usually been small indel modifications, but there have also been a number of large changes. The changes are based on reinterpretations of the original sequencing trace data, often done because there are mismatches between the genomic sequence and aligned EST sequence. Details of the genomic sequence changes can be found on the WormBase wiki pages (http://www.wormbase.org/wiki/index.php/Genome_sequence_changes