Providing high-level views of attributes inferred from genome sequences
Genome Properties is a collection of definitions for the higher-level attributes that may be ascribed to a species when a sufficient set of molecular markers are detected in its genome, or else reported jointly absent (12
). If all enzymes listed as essential markers of biotin biosynthesis are detected (by HMMs in the TIGRFAMs database), then the Genome Property ‘biotin biosynthesis’ is set by rule to ‘YES’. Assertions made by Genome Properties are useful to summarize high-level traits of species biology from genome analysis, to understand metabolic context while trying to understand the roles of other proteins from the same species, and for comparative genomics based on the whole biological processes rather than single genes.
Genome Properties entries describe subsystems
The linkage between TIGRFAMs and Genome Properties is paralleled in a kindred effort, the SEED from the Fellowship for the Interpretation of Genomes (FIG) (13
). FIG selected the term ‘subsystem’ to describe the biological role in aggregate that a set of marker proteins working in concert enable. We support usage of that term, using ‘subsystem’ here to refer in general to any emergent property in species biology that is understood more clearly by viewing the collection of component genes together rather than separately. We use the term ‘Genome Property’ for any subsystem described as an entry in the Genome Properties database.
Metabolic reconstruction is based on evidence rather than annotation
Many sequences in public protein sequence databases are annotated incorrectly as biotin synthase (EC 220.127.116.11), to give one example, through errors of transitive annotation, in part because biotin synthase was one of the first radical SAM superfamily (14
) members sequenced, characterized and named (15
). In fact, misannotations that attach overly specific functions are a particularly abundant class of error (16
). Any metabolic reconstruction system that takes annotations at face value runs a risk of going badly astray. Genome Properties follows a principle that whether or not some genome encodes a subsystem should be determined by evidence sufficient to drive annotation, not by pre-existing annotations of uncertain provenance. Following this principle means that creating a new Genome Property often necessitates building new entries for TIGRFAMs database of HMMs, or else confirming that existing models in either TIGRFAMs or Pfam behave suitably.
Genome Properties emphasizes non-pathway subsystems
Genome Properties includes simple biochemical pathways, of course, but describes many additional types of subsystem. When Genome Properties describes a subsystem whose core is an enzymatic pathway, it often represents non-enzymatic proteins such as transcriptional regulators and molecular chaperones as additional components. Computation of a Genome Property may depend on a protein, a feature within a protein (such as a sorting signal), or a genetic element that does not even code for a protein, such as an array of CRISPR repeats or the selenocysteine tRNA. It describes physical complexes such as transporters and flagella. It describes systems in which one protein acts upon another, including secretion systems, protein-sorting systems, and post-translational modification systems.
Methanobactins are generating considerable interest because neither their diversity nor their biosynthesis is well understood yet (17
). The very small size of the only known class of precursor peptide has meant missed gene calls (at least initially) in every species so far with an example of the system, originally just Azospirillum
sp. B510 and Methylosinus trichosporium
), but now including Gluconacetobacter oboediens
sp. SXCC-1, Tistrella mobilis
KA081020-065, Pseudomonas extremaustralis
sp. SC2. Model TIGR04071 for the methanobactin precursor family belongs to a Genome Property, GenProp0962, which explains that the gene should occur in the context of two other, larger molecular markers, members of families TIGR04159 and TIGR04160. These two companion proteins are much easier to find when prospecting in newly sequenced genomes for new examples of methanobactin-like natural products.
The architecture of Genome Properties
Each Genome Property is described completely by records in a series of tables stored in a relational database. The tables are now made available for downloading by ftp through the TIGRFAMs/Genome Properties web pages, at ftp.jcvi.org/pub/data/TIGRFAMs/GEN_PROP
. The table structure and meanings of all fields are described in the release notes. The logic of the table structure is discussed below.
Each Genome Property has a basic definition in the prop_def table that includes a name (e.g. ‘urease’), a unique accession (e.g. ‘GenProp0051’), a paragraph-long description, and some ancillary information. Because the first subsystems described in Genome Properties were simple enzymatic pathways, components are referred to in relational database tables as ‘steps.’ Thus, the components that belong to a given Genome Property are enumerated in the prop_step table. Note that one enzymatic activity, a single entity in the typical pathway definition, will be represented by multiple components if the enzyme is comprised of multiple subunits.
A Genome Property is judged complete, and may be assigned the state ‘YES’, if every component listed as required is found. Nearly every Genome Property has two or more components, largely because the comparative genomics of subsystem reconstruction is often essential for creating and establishing trust in the protein family definitions that a Genome Property requires.
Property definitions include genes that are not necessarily essential
Some protein families occur exclusively as part of some subsystem, yet comparative genomics shows they are not always present, and not required for all instances of the subsystem to operate (although they may be required in some cases). Such a protein may be listed as a part of the system by entry into the prop_step table, but is marked as non-essential, meaning not core to the definition of the Genome Property and not used to compute its completeness. A protein that is absolutely required, but for which no reliable detection tool is available, similarly must be marked as non-essential to prevent the scoring system from calling the subsystem incomplete. Models may be unavailable because an activity has not yet been matched to any sequence, or a single known sequence example is not easily extended into a whole protein family definition, or the role tends to be filled by members of different proteins in different species (as is common for phosphatases and aminotransferases), or the function may be hard to assign based on full-length homology rather than select specificity-determining residues (as seems to occur with transporters).
Genome Properties allows multiple lines of evidence for each step
Genome Properties defines types of evidence that will be treated as sufficient to show that a required component is encoded within a genome. In most cases, the evidence is a TIGRFAMs or Pfam HMM that scores above the model’s cutoffs to some protein. Trusted cutoffs are used rather than gathering thresholds (these differ only for Pfam). For a given canonical protein function (e.g. glutamate–cysteine ligase, EC 18.104.22.168), several different known families may be known, and assignment to any one of these may constitute evidence. Thus, a separate linking table, step_ev_link, defines what evidence is sufficient to satisfy a step.
In some cases, Genome Properties requires that there be at least one member encoded in a genome of some family of proteins, without implying that all members found from that family necessarily participate in the Genome Property in question. However, recording sets of such proteins during evaluations of Genome Properties across large numbers of genomes, and distinguishing those found in genomes encoding the other candidate markers from the rest supports data mining approaches that may lead to the construction of new protein families. Genome Properties allows an evidence type designated HMM-CLUST, meaning that a protein must be a member of the designated protein family but also within 3000 base pairs of another marker of the same system. The HMM-CLUST method may identify, for example, members of the radical SAM domain family (PF04055 in Pfam) found in close proximity to a precursor gene for post-translationally modified peptide. This co-clustering may mark a subfamily or equivalog group within radical SAM, which can then be ascribed a role in peptide modification. Such approaches let Genome Properties computation support both protein family development and discovery of new types of subsystems.
In the ideal case, a Genome Property has complete evidence, or no evidence, in nearly every genome. If most components are found, but not all, the YES-leaning state ‘some evidence’ may be assigned. But some proteins can play a role in any of several different properties. Finding such a marker, but no other, for a given property in question does not suggest the property is actually present. The enzyme selenide, water dikinase (SelD, TIGRFAMs entry TIGR00476), for example, is essential to at least three traits (18
): selenocysteine incorporation (GenProp0016), the selenouridine tRNA modification (GenProp0692) and post-translational activation of selenium-dependent molybdenum hydroxylases (GenProp0726). For any one of these systems, finding SelD only is very weak evidence. Similarly, a Genome Property may rely in part on an HMM that was available (perhaps from Pfam) but that hits a number of homologs beyond the set that actually carry the function of interest. Absence of any member of that family from a genome would be informative, so it is useful to require a hit as an additional constraint for recognizing a subsystem to be present. But finding a member of that family as the only evidence would be very weak evidence the entire subsystem is encoded. For these reasons, each Genome Property has a threshold value. If the number of components found does not exceed the threshold, evidence that the Genome Property on the whole is encoded by a genome should be considered weak, and the state ‘not supported’ will be assigned instead of ‘some evidence.’
Genome Properties is hierarchical
A component required for a Genome Property to be complete may itself be a Genome Property. Urea utilization (GenProp0814), for example, consists of a urea uptake system and a urea degradation pathway. For urea degradation, either of two pathways is sufficient, urease (GenProp0051) or the urea carboxylase/allophanate hydrolase pathway (GenProp0481). Whichever of the two is the more complete is used to score the urea utilization property.
Genome Properties and TIGRFAMs usually are constructed in concert
Where a set of proteins cooperate to form an enzymatic pathway, or some other type of subsystem with a fixed set of required components, each successfully completed protein family definition gives contextual clues that help identify trusted exemplars for the remaining protein families. Some protein families are straightforward to construct because essentially every detectable homolog appears to perform the identical function. The very clear boundaries to such families provide information that can guide construction of additional protein families. It appears, for example, that every detectable homolog of the first described PqqA, a peptide whose role is to serve as the precursor of pyrroloquinoline quinone (PQQ), likewise serves as a PQQ precursor peptide. In contrast to PqqA, the PQQ biosynthesis enzyme PqqE belongs to radical SAM, a family so abundant that 1 genome may encode over 30 members, each different in function. Correctly separating all true PqqE from their functionally distinct homologs would be difficult except that the PqqA model (TIGR02107) assured that the PqqE model (TIGR02109) would be constructed with no false-positives among its seed members. The PqqE model, in turn, identifies all PQQ biosynthesis systems where the small (~23 amino acid) PqqA peptide was missed because of faulty gene calling.
Genome Properties made available for the current release, designated 3.0, number 628, a marked expansion over the ~200 Genome Properties released at the time of the last published database description. The current total includes new subsystem definitions plus previously unreleased ones likely to benefit from additional development.