This survey demonstrates that ~80% of orphan enzymatic activities are bona fide; therefore, we conclude that of the 1,356 putative orphans extant at the time of this study, more than 1,000 are highly likely to constitute true information deficits since their lack of sequence information is not the result of a database error.
The absence of DNA or protein sequences encoding such well-characterized enzymatic activities is particularly consequential because these activities were often identified decades ago, and many have been the focus of significant research activity (Table ). Without the cognate sequences for these activities, the quality of annotation of all sequenced genomes in terms of both coverage (fraction of genes that can be recognized) and accuracy (fraction of predicted gene functions that are correct) is diminished. Many of these activities may go for years without being sequenced – for example, 1-acylglycerophosphocholine O-acyltransferase (Table ) was finally purified and sequenced nearly forty years after it was first characterized [10
]. Perhaps more troubling is the unknown pool of "false positive" annotations. Phosphogluconate 2-dehydrogenase (Table ), an orphan at the time of this analysis, has since been assigned to a sequence in the human genome with no experimental evidence linking it to that or any homologous sequence, but apparently instead on the basis of the gene in question already being assigned a similar activity. This kind of "hidden orphan" would have been missed by most orphan analyses, and can be expected to propagate a potentially incorrect assignment to other genomes in the future. Computational metabolic pathway prediction [11
] and metabolic engineering also depend on sequence information and are thus similarly compromised.
Conversely, ~20% of orphans surveyed were observed to be artifacts, such that ~270 orphans out of 1,356 putative orphans examined should be resolvable entirely via literature research and database cleanup. As a result of this process as it was carried out on our sampling of orphans, we have reported 11 artifactual orphan activities to public sequence repositories for correction (see Table for examples).
In addition to validating orphans, the survey was useful in capturing information from the literature to assess their salvageability: more than half of validated orphans were found to be salvageable (Figure ). Examples of salvageable orphan activities with the traits that make them salvageable are listed in Table .
As abundantly noted elsewhere, such database cleansing is essential to maximize the existing research investment and prevent the propagation of mistakes [12
] (see Table for examples of artifacts that have been resolved). This necessity has not eluded the field of enzymology [3
], and the present survey demonstrates the usefulness of correlating biological databases and mining the literature to enhance the value of existing research and facilitate the identification of the remaining orphan-associated genes. Until recently, there were no general repositories of orphan activity data, although some species-specific databases and pages were maintained, such as EchoBase [17
] and a web page listing unidentified E. coli
enzymes maintained by the EcoCyc project [18
]. Consequently, we updated the MetaCyc [19
] database to identify reactions that have been analyzed by this survey, and annotated them and associated database objects with results such as the validity of their orphan status, links to their cognate protein in the case of artifacts, and the properties of the protein copurifying with the activity in the case of validated orphans. Recently, Lespinet and Labedan created ORENZA [20
], a database dedicated to maintaining an up-to-date listing of all enzyme activities for which no sequences are available in major sequence databases [6
]. We are contributing our updated orphan information to ORENZA as well. These data, captured in MetaCyc and ORENZA, should facilitate the work of enzymologists interested in identifying the cognate genes of orphan activities. For instance, the work of Melnick et al
] is an excellent example of the combined application of modern laboratory and bioinformatics techniques that would benefit from the data described here.
Example of artifactual orphans resolved by this survey
Several proposals have been made recently aimed at producing a complete catalog of biochemical activities, biological functions, and their cognate genes [2
]. Many of these proposals recommend that such a project begin with prokaryotes because of the general ease of gene cloning from these species [1
]. Indeed, our data support this notion, as we find substantially more orphans with a salvageability ranking of "good" and "excellent" in prokaryotes as compared to eukaryotes. The availability of a comprehensive review of the problem achieved by this survey, combined with broad genomic sequencing and powerful computational tools, leads us to conclude that the field is in an excellent position to rectify the information gap associated with the orphan activity phenomenon.