The Otterlace annotation client runs on a local machine and downloads all of its data from the WTSI web server. The genomic region being annotated is stored in a persistent annotation session directory on the user's computer, which can be recovered following system reboots. Annotation actions require only occasional network access, so the system is tolerant of interruptions to network connectivity.
The genomic sequence is run through an analysis pipeline that consists of homology searches, gene predictions and de novo
sequence analysis. The pipeline analysis includes: BLASTX against SwissProt and TrEmbl proteins, BLASTN against ESTs and vertebrate mRNAs, tandem repeat finder, Augustus (31
)and Genscan (32
) gene predictions. The results are displayed in the ZMap graphical interface (B). ZMap is written in the C programming language to give good drawing performance and makes use of threading to load multiple datasets simultaneously resulting in much faster startup times.
Figure 1. A selection of different views of Otterlace and ZMap. (A) Assembly sequence chooser showing user’s email displayed on locked clones. (B) ZMap view of the results of pipeline analysis, namely EST (in purple) and vertebrate mRNA (in brown) homology (more ...)
Large-scale data analysis, such as searches of mRNA libraries against the whole genome, are performed on WTSI systems, served by Otter CGI scripts, and presented in ZMap on the client where they can then be used to construct the annotation. Additional sources of evidence, such as BAM files on FTP or web servers anywhere in the world, can be configured on the server and then loaded into ZMap for display. As many of these data sources can be very large ZMap allows the annotator to choose which tracks and how much of each track is loaded.
Access to the Otter system is restricted to authorized users. External annotators register themselves with the WTSI SingleSignOn system, using their email address at their Institute. This takes care of authentication, and access to each species (authorization) is controlled via a configuration file which lists their email address and which is administered by the Otter support staff at WTSI.
Users save annotation back to the master Otter annotation databases. Since it contains a relatively small quantity of valuable data, this database is carefully and frequently backed up. Saving edits to genes does not delete old versions, but writes new versions of genes into the Otter database. It is therefore possible to recover old versions of genes if mistakes are made. The author of any changes to genes and transcripts is recorded, so who has been editing what is tracked. Unchanged transcripts keep their author, but changed transcripts are given the new author, and the author of the parent gene changes to the new author too. The system tracks changes to genes and transcripts via their stable identifiers, and these are shown on the VEGA (33
) and Ensembl websites too. These stable identifiers remain attached to each version of genes and transcripts stored in the database, and are independent of any changes to their names.
Locks are used to prevent more than one annotator making changes to the same region of the genome (A). Existing genes which are not contained entirely within the region being annotated cannot be edited in the otterlace session and appear ‘greyed out’ (C).
Quality control in Otterlace
The Otterlace client performs a number of quality and sanity checks as genes and transcripts are built by the annotator. The names of transcripts with problems are highlighted in red in the session window, and a ‘tool tip’ gives a brief description of the problem when the annotator mouses over the transcript name (D). The transcript editing window shows the 2
bp in the intron immediately adjacent to each exon, and colours them green if they match a splice consensus, and red if they do not (E). Introns are checked to make sure that they are not too short. When present, the protein translation is checked for internal stop codons and completeness, and the transcript is checked to ensure that it is not subject to NMD (34
), or if it is subject to NMD has been correctly flagged. The format of the transcript name is checked to ensure that it conforms to an approved naming convention. Transcripts must have evidence attached (accessions of the nucleotide or protein sequences used to build them), and more than one transcript in the same gene cannot share the same evidence. The locus must have the full name associated with the gene symbol added in the Full name field. A vocabulary of attributes, which can be attached to transcripts or loci is provided to avoid keying errors, and these appear in the transcript window with green shading (E).
This integrated QC within Otterlace proved a valuable tool for external annotators as it flags errors as they occur and reduces the need for QC by Havana annotators. For the Blessed annotator model, due to the extended training period there is minimal manual QC over a period of several years for several thousand genes. However, for the Gatekeeper annotator model, the manual QC is much more extensive due to the much shorter training period of the annotators. Thus, this model requires more frequent input by professional annotators but over a shorter timescale compared to the KOMP and NorCOMM projects. The annotators were all trained with reference to the Havana team annotation guidelines (35
) which was very important to give an assurance of the quality of the annotation.
The annotation for the KOMP and NorCOMM projects took advantage of the customized software features that were already available for the EUCOMM project (25
) in particular identifying critical exons and making knock-out constructs. The number of genes targetted for annotation is 5000 for KOMP and 500 for NorCOM, and they are complimentary to the EUCOMM project. This Blessed annotation makes use of the full complement of biotypes that are available within Otterlace, and is integrated into the gene set for mouse that is available from the VEGA website. Gene target for knockouts are identified from Ensembl predictions. gives and example of the importance of manual annotation for this project.
Figure 2. An example of manual annotation in mouse to identify a critical exon. (A) Dnhd1 is a KOMP target gene that is automatically chosen to create a knockout from the Ensembl prediction. A Zmap view of the Dnhd1 gene manually annotated in mouse. The Ensembl (more ...)
The IRAG project has ~30 external annotators working through a list of ~1700 genes. For the pig project a condensed version of the biotypes was used due to the dearth of sequence evidence available for pig and the lower quality of the genome sequence. The reduced numbers of pig mRNA and SwissProt entries that are available and required to make a coding locus biotype Known_CDS, resulted in many more Novel_CDS made from cross-species mRNA evidence. Working with unfinished genomic contigs was a challenge for both the software and the annotators, as for high quality finished genomes, such as human, the annotation is added to finished BAC sequences. For the pig autosomes many BACs consist of several, often unordered, contigs that are not finished to a high quality. shows and example of how manual annotation can assist in assessing the quality of a genome assembly.
Figure 3. An example of manually annotated genes viewed in ZMap and also displayed as a DAS track in Ensembl. (A) ZMap view of copies of the REG3G gene in pig. The automated Ensembl track predicts one copy of the gene, whilst the manual annotation can resolve two (more ...)
In order to find genuine deletions and duplications of pig genes relative to the human genome, a high-quality genome is required. The current pig assembly 9.2, is thought to be missing ~10% of the genome. The process of gene annotation identifies assembly and sequencing errors, but as full finishing will only be performed on the X chromosome it is unlikely that these errors will be resolved under current plans.
Despite the concerns about the quality of the genome, with reference to high-quality manual annotation, the group has already identified at least 12 genes that show genuine duplication, for example the REG3A gene. Genes that are thought to be absent in the swine genome will be re-assessed when the new genome build is available to ensure that they are not artefactual deletions.
The HUGO Gene Nomenclature Committee (HGNC), (36
) naming convention for pig genes orthologous to human was used whenever possible and the Havana naming convention for potentially duplicated/similar genes was followed (see guidelines). The KOMP, NorCOMM and IRAG projects are ongoing and the number of de novo
genes annotated to date are 1876, 378 and 1276 respectively. The full swine genome is not available in VEGA so in order to view the manual annotation a DAS track for Havana Pig manual annotations is available in Ensembl, called ‘havana_pig’ and can be found from the DAS source http://das.sanger.ac.uk/das/havana_pig
. An example of this can be seen in .
Figure 4. Unordered contigs on pig chromosome 6 viewed in Zmap. The annotation of the CRISPLD2 gene shows clearly how the annotation highlights the fragmented nature of the assembly and aids in identifying the correct contig ordering. The vertebrate mRNA homology (more ...)