Subsystems: an Overview
It is commonly held that one central role of bioinformatics is to project a relatively small set of assertions of gene and protein function from the literature (i.e., from wet lab characterizations) to genes from other genomes. This captures a kernel of truth (that, ultimately, new assertions of function are based on wet lab characterizations), but, perhaps, elevates the role of bioinformatics beyond what is reasonable to expect. In contrast, we view projection as a 2-step process:
1. In an initial stage, an expert in a biological topic integrates what is known from the literature producing a set of expert assertions, which include the assertions from the literature, as well as a far broader set based on judgement and extrapolation.
2. Bioinformatics tools are developed to project structured collections of expert assertions (rather than just the wet lab results captured in the literature) to new genomes.
The process of integrating what is known from the literature into a set of expert assertions involves highly complex decisions and is well beyond most of the common bioinformatics tools. On the other hand, there is every reason to believe that fully automated tools can be developed to project these expert assertions. The more comprehensive and well structured the collection of expert assertions, the more rapidly accurate projection technology will be developed. Here it is worth noting that we speak of "well-structured" sets of expert assertions, since the developed tools will almost certainly need to encapsulate numerous rules covering special cases, and a careful delineation of these rules can best be achieved by domain experts.
One technology for creating and maintaining expert assertions was developed within the context of
The Project to Annotate 1000 Genomes [
10].
This technology involves an expert curator defining a subsystem as a set of abstract functional roles. Figure shows a very simple case in which a subsystem named "Tricarballylate Utilization" is composed of four functional roles. The subsystem is populated by connecting these functional roles to specific genes in specific genomes, producing a subsystem spreadsheet, where each row represents one genome and each column corresponds to one functional role as shown in Figure . The proteins encoded by the genes in one column are used to construct the subsystem-based FIGfams (discussed below). The cooperative effort to develop subsystems has produced a publicly available set of such populated subsystems that now includes over 600 subsystems. These subsystems include assertions of function for well over 500,000 protein-encoding genes in over 500 bacterial and archaeal genomes (relating to over 6200 functional roles). This manually curated collection represents sets of co-curated protein families. While it is true that the quality of the assertions varies substantially, it is also true that these structured sets of assertions represent a major resource in constructing automated annotation systems.
FIGfams: Yet Another Set of Protein Families
A number of groups have spent substantial effort building protein families that now represent resources that are widely used and valued by the community [
11-
15]; see [
16] for a more extended discussion. RAST utilizes a new collection of protein families. This collection is referred to as the set of
FIGfams, and the publication of a detailed account of them is in preparation. Each FIGfam may be thought of as a 3-tuple composed of a set of proteins, a family function, and a decision procedure. The set of proteins are believed to be globally similar (and, presumably, homologous) and the members all share a common function. The decision procedure takes as input a protein sequence and returns a decision about whether or not the protein could be added to the family (i.e., whether or not the protein is globally similar to the members and shares the common function).
Hence, the basic principles underlying FIGfams are quite similar to those corresponding to the lowest-level PIR families [
17] or the TIGRfam
equivalogs [
15].
The construction of FIGfams is done conservatively: care is taken to make sure that two proteins included in the same set actually do share a common function, but if substantial uncertainty exists about whether or not two proteins actually share the same function they are kept in distinct families. Two proteins will be placed in the same family:
1. If both occur in the same column of a manually curated subsystem spreadsheet (i.e., if they implement the same functional role) and the region of similarity shared by the two sequences covers over 70% of each sequence.
2. If they come from closely related genomes (e.g., genomes from two strains of the same species), the similarity is high (usually greater than 90% identity), and the context on the chromosome (i.e., the adjacent genes) can easily be seen to correspond, then they can be placed in the same family (even if the function they implement is yet to be determined).
These are the two cases in which we feel confident in asserting a common function between two proteins; the first reflects an expert assertion, and the second an instance in which divergence is minimal. Construction of FIGfams using these two grouping principles has led to a collection of about 17,000 FIGfams that include proteins related to subsystems (those are the FIGfams that we call subsystem-based) and over 80,000 that contain only proteins grouped using the second principle (i.e. the non-subsystem-based FIGfams). Many of the non-subsystem-based FIGfams contain just 2, 3 or 4 proteins.
Over time we expect to coalesce the non-subsystem-based FIGfams. This will be done by creating new, manually curated subsystems; these will form kernels of new families that will group the isolated families that now exist.
It is worth noting that the existing collection of FIGfams covers most of the central cellular machinery with families derived from subsystems, and the numerous small non-subsystem-based FIGfams efficiently support recognition of genes in close strains. While it is true that we cover a limited percentage of genes in newly sequenced divergent genomes, we recognize well over 90% of the genes in newly sequenced strains that are close to existing annotated genomes. It seems likely that a large percentage of newly sequenced genomes will be close to existing genomes (e.g., note projects to sequence tens and soon hundreds of closely related pathogenic strains), and the FIGfams already constitute an effective recognition framework in such cases.
The Basic Steps in Annotating a Genome Using RAST
The basic steps used to annotate a genome using RAST are described in the subsections below. Input to the process is a prokaryotic genome in the form of a set of contigs in FASTA format. As described below, the actual RAST server will allow a user to specify a set of gene calls, but in the usual case RAST will make its own calls. We now describe the basic steps in a RAST annotation in detail.
Call the tRNA and rRNA genes
We use existing tools built by other research teams to first identify both the tRNA and rRNA encoding genes. For the tRNA genes we use tRNAscan-SE [
18] and to identify the rRNA encoding genes we use a tool " search_for_rnas" developed by Niels Larsen (available from the author). We begin the process by calling these genes, which we believe can be reliably determined. Then, the server will not consider retaining any protein-encoding gene that significantly overlaps any of these regions. Unfortunately, the public archives do contain putative protein-encoding genes that are embedded in rRNAs. These gene calls are almost certainly artefacts of the period in which groups were learning how to develop proper annotations, and RAST attempts to avoid propagating these errors.
Make an Initial Effort to Call Protein-Encoding Genes
Once the tRNA and rRNA gene-encoding regions are removed from consideration, we make an initial call using GLIMMER2 [
19]. At this point we are seeking a reasonable estimate of probable genes, and GLIMMER2 is an excellent tool for that purpose. At this stage, RAST is not concerned about calling spurious genes or getting starts called accurately. What is needed is that most of the actual protein-encoding genes are represented in the initial estimate of
putative genes.
Establishing Phylogenetic Context
Once an initial set of protein-encoding genes has been established, we take representative sequences from a small set of FIGfams that have the property that they are universal or nearly universal in prokaryotes. This set includes, for example, the tRNA synthetases.
Using this small set of representatives we search the protein-encoding genes from the new genome for occurrences of these FIGfams. It should be noted that this is a very rapid step, since only the new genome is being searched, and it is being searched using a small set of representative protein sequences. The outcome of this initial scan is a small set (normally, 8–15 genes) that can be used to estimate the closest phylogenetic neighbours of the newly-sequenced genome. This can be done by taking each located gene and blasting it against the genes from the corresponding FIGfam. Normally, we attempt to locate the ten closest neighbours, but clearly the approach is insensitive to the exact number sought. For each detected gene, we adjust its starting position and move it from the set of putative genes to a set of determined genes and the function (i.e., product name) assigned to the gene is taken from the FIGfam.
A Targeted Search Based on FIGfams that Occur in Closely Related Genomes
Once the "neighbouring genomes" have been determined, we can form the set of FIGfams that are present in these genomes. This constitutes a set of FIGfams that are likely to be found in the new genome. For each of these FIGfams, we search the new genome. Note that we expect these searches to have a relatively high rate of success. Whenever we do find a gene, we adjust its starting position and move the gene from the set of putative genes to the set of determined genes. The computational costs required to locate these genes are low (since we are searching a very small set of putative genes).
Recall Protein-Encoding Genes
At this point, we have accumulated a set of determined genes within the new genome and can now use this excellent training set to recall the protein-encoding genes. In the case of a genome that is a closely related strain of one or more existing genomes, this training set may well include over 90% of the actual protein-encoding genes.
Processing the Remaining Genes Against the Entire FIGfam Collection
The putative genes that remain can be used to search against the entire collection of FIGfams. This is done by blasting against a representative set of sequences from the FIGfams to determine potential families that need to be checked, and then checking against each family. While computationally more expensive than the focused searches in the previous steps, it is still far, far cheaper than blasting against a large non-redundant protein database. Currently, the collection of representative protein sequences from FIGfams used to compute potentially relevant FIGfams includes somewhat over 100,000 protein sequences.
This step amounts to a comprehensive search of the FIGfams for each of the remaining putative genes. Once it has been completed, all of the genes that could be processed using FIGfams have been processed.
Clean Up Remaining Gene Calls (Remove Overlaps and Adjust Starting Positions)
The putative proteins that remain are processed to attempt to resolve issues relating to overlapping gene calls, starts that need to be adjusted, and so forth. In the case of the RAST server, we do blast the remaining putative genes against a large non-redundant protein database in an attempt to determine whether there is similarity-based evidence that could be used in resolving conflicts.
Process the Remaining, Unannotated Protein-encoding Genes
At this point, final assignments of function are made to the remaining putative genes. If similarities were computed in the preceding step, these similarities can be accessed and functions can be asserted. Optionally, one can employ any of the commonly employed pipeline technologies to run a suite of tools and produce a more accurate estimate. The genes processed using this approach represent most of the overhead in a RAST annotation. By first processing a majority of the genes using FIGfam-based technology and focused searches, this cost is minimized by RAST without (we believe) reducing accuracy.
Construct an Initial Metabolic Reconstruction
Once assignments of function have been made, an initial metabolic reconstruction is formed. For our purposes, this amounts to connecting genes in the new genome to functional roles in subsystems, determining when a set of connections to a specific subsystem are sufficient to support an active variant of the subsystem, and tabulating the complete set of active variants. Since the subsystems themselves are arranged in crude categories reflecting basic divisions of function, we can produce a detailed estimate of the genome contents that got successfully connected to subsystems (see Figure ). In the case of a genome like Buchnera aphidicula, in excess of 82% of the genes fall in this category; for Escherichia coli O157:H7 the percentage drops to 76%, while in a relatively diverged genome like Methanocaldococcus jannaschii DSM 2661 the percentage that can be connected (at this point in time) is only 22%. Figure offers a brief overview of the type of display a user can employ to quickly explore the contents of the new genome.
It should be emphasized that the subsystems cover all modules of cellular machinery – not just the metabolic pathways. Hence, what we are calling a
metabolic reconstruction (a collection of the active variants of subsystems that have been identified) is more properly thought of as a grouping of genes into modules, rather than the reconstruction of the metabolic network. However, besides simply compiling the set of active variants of subsystems, the RAST server uses a set of
scenarios encoded in metabolic subsystems to assemble a metabolic reaction network for the organism [
20]. These scenarios represent components of the metabolic network in which specific compounds are labelled as inputs and outputs (i.e., they may be thought of as directed modules of the metabolic network). The metabolic network is assembled using biochemical reaction information associated with functional roles in subsystems to find paths through scenarios from inputs to outputs. Scenarios that are connected by linked inputs and outputs can be composed to form larger blocks of the metabolic network, spanning processes that convert transported nutrients into biomass components. In the case of newly sequenced genomes that are close to those our team manually curates, it is possible to directly estimate what percent of the reaction network typically included in a genome-scale metabolic reconstruction [
21] can be generated automatically. Today the RAST server produces 70–95% of the reaction network, depending on the specific species and genome.