|Home | About | Journals | Submit | Contact Us | Français|
One major objective of structural genomics efforts, including the NIH-funded Protein Structure Initiative (PSI), has been to increase the structural coverage of protein sequence space. Here, we present the target selection strategy used during the second phase of PSI (PSI-2). This strategy, jointly devised by the bioinformatics groups associated with the PSI-2 large-scale production centres, targets representatives from large, structurally uncharacterised protein domain families, and from structurally uncharacterised subfamilies in very large and diverse families with incomplete structural coverage. These very large families are extremely diverse both structurally and functionally, and are highly over-represented in known proteomes. On the basis of several metrics, we then discuss to what extent PSI-2, during its first three years, has increased the structural coverage of genomes, and contributed structural and functional novelty. Together, the results presented here suggest that PSI-2 is successfully meeting its objectives and provides useful insights into structural and functional space.
The multiple international genomics and metagenomics initiatives are providing us with sequences of hundreds of genomes and millions of genes. Analysis of this windfall is greatly aided by the fact that these millions of genes can be grouped into a much smaller number of gene families that, being related by evolution, share similarities in function and three-dimensional structure (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Finn et al., 2008;Redfern et al., 2008). Gene family sizes seem to follow a power law with most families containing small numbers of members, and relatively few families being very large and diverse (Todd et al., 2001;Gerlt et al., 2001;Reeves et al., 2006;Marsden et al., 2007) (Figure 1).
Several explanations that rely on historical, functional or thermodynamic arguments have been proposed to rationalise the existence of these very large families (Goldstein, 2008). For example, it was suggested that certain types of functions in ancestral proteins made these more amenable to duplication and diversification (Ranea et al., 2006;Goldstein, 2008). Results from other analyses imply that particular structural folds are more likely to accommodate insertions and deletions, which in turn allow functional diversification during evolution (Reeves et al., 2006). Interestingly, whilst the total number of families keeps growing at an almost linear pace with new sequencing data, the number of such very large families remains essentially constant (Goldstein, 2008;Redfern et al., 2008). This phenomenon not only reflects the laws of statistics, but also seems to hint at the history of life on Earth since several of these families contain very ancient genes that are present in organisms from all domains of life, often in multiple paralogues (Aravind et al., 2002;Goldstein, 2008). During their long evolution, genes from these families have had ample chances to diversify, both in structure and function. For example, analysis of bacterial genomes has shown that some of these ancient families have linearly expanded with genome size and the occurrence of multiple paralogues has allowed diversification of functions increasing the functional repertoire of the organisms (Ranea et al., 2006). It is also worth noting that the largest domain families are often involved in essential functions (Shakhnovich et al., 2006), making them potentially interesting targets to understand better disease-related processes for instance.
Because of the observed modularity of proteins and the fact that many proteins, especially in eukaryotes, consist of multiple domains that can combine differently in other proteins, it is generally convenient to consider domains as the fundamental units of protein evolution (Ponting et al., 2002;Moore et al., 2008). Analyses of completed genomes have characterized the extent to which domains are duplicated and fused in different domain contexts. Whilst fewer than ten percent of the protein families in an organism are common to all kingdoms of life, over half the domain sequences in an organism are likely to belong to less than 200 families universal to all kingdoms of life (Lee et al., 2005;Ranea et al., 2006) appearing in diverse multi-domain contexts. A number of domain family resources, e.g. Pfam (Finn et al., 2008), CATH (Greene et al., 2007) and SCOP (Murzin et al., 1995), have emerged to capture evolutionary relationships between domains enabling studies on the evolution of different functional roles in diverse relatives. Various large-scale efforts, among them structural genomics, are directed at attaining some level of description of all the known domain families.
Structural genomics initiatives that have been set up world-wide have undoubtedly started modifying the way protein three-dimensional structures are used to address issues in several disciplines, among which are enzymology (Gerlt, 2007), protein folding (Fersht, 2008) or protein function prediction (Watson et al., 2007;lali-Hassani et al., 2007). One major historical focus of many structural genomics efforts has been to increase structural coverage of known protein space, by selecting targets from novel, structurally uncharacterised protein families (Sali, 1998;Chandonia et al., 2006;Liu et al., 2007). More recently, it has been argued that structural coverage of protein space could only be completed by concomitantly selecting targets from very large and diverse superfamilies, which often display extreme structural and functional diversity (Todd et al., 2001;Pegg et al., 2006;Reeves et al., 2006;Marsden et al., 2007). In addition, it can be expected that a more comprehensive sampling of structures from such very large superfamilies would help understanding better the determinants and the extent of their functional diversity (Reeves et al., 2006;Redfern et al., 2008). Accordingly, as part of the Protein Structure Initiative (PSI) funded by the NIH, structural genomics production centres have committed significant part of their resources to solve structures of proteins from such diverse families (Norvell et al., 2007).
Structural data will not only help rationalising the mechanism for function divergence in these extremely large families but may also explain why some families are more highly recurrent in particular organisms or environmental contexts. Recently, structural genomics target selection strategy was extended to target protein families that were shown to be over-represented in uncultured bacteria present in specific environments, such as the human distal gut, as identified by metagenomics studies.
In this article, we present the target selection strategy that is being followed by the four large-scale production centres of the NIH Protein Structure Initiative (i.e. JCSG (www.jcsg.org), MCSG (www.mcsg.anl.gov), NESG (www.nesg.org) and NYSGXRC (www.nysgrc.org)). This strategy has been aimed at two major objectives, namely (a) to provide structures from protein families representing significant proportions of the genome sequences, and (b) to study the structural basis of functional diversity in the most diverse and highly populated families. We specifically address the benefits of increasing our sampling of structure space in these very large families.
The first phase of the Protein Structure Initiative (PSI-1), which started in September 2000 and ended in June 2005, did not specify the exact meaning of structural coverage of protein space and generally targeted ‘novel’ proteins showing no close relationship to any proteins of known structure. A simplistic general threshold of 30% sequence identity was widely adopted as a definition of novel targets (Vitkup et al., 2001). This threshold was selected based on evidence from CASP quality assessments (Moult, 2005), which suggested that 30% sequence identity was a reasonable cut-off for building homology models. The underlying idea was that each novel structure solved could in turn be exploited to provide approximate models of all close homologues (Sali, 1998). Centres participating in the PSI focused on targets from specific organisms, metabolic pathways or other medically relevant topics. For example, the JCSG focused on targets from Thermotoga maritima, NESG from human and other eukaryotes, MCSG focused on various pathogenic organisms, and NYSGXRC solved structures of proteins involved in metabolic pathways and cancer. These four large-scale centres continued to participate in PSI-2 since July 2005.
PSI-1 succeeded in establishing a new model of structure determination whereby very large numbers of structures are solved by an automated high throughput experimental pipeline, providing unparalleled productivity and cost savings. The four large-scale centres now involved in PSI-2 have solved over 800 protein structures during the first phase of PSI, far more than conventional structural biology labs could have solved alone with a comparable amount of funding. Thereby, PSI-1 achieved one of its goals, namely to reduce significantly the cost of solving protein structures. Several reports in the literature have detailed the success of PSI-1 according to different criteria (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007). All these publications clearly suggest that PSI-1 was successful in significantly increasing the proportion of novel distinct protein structures deposited in the PDB (Berman et al., 2000), as well as the proportion of novel structural superfamilies and novel fold groups. The analysis by Todd et al (2005) showed encouraging increases in numbers of structures solved in these different categories over the first 3 years. More recent analyses by Chandonia and Brenner (2006) confirmed these observations over 5 years (see Table 1).
However, it was clear from both analyses that there were still considerable levels of redundancy between the PSI large-scale centres and also between the centres and the general structure biology community, despite the adoption of a centralized target tracking system (TargetDB, http://targetdb.pdb.org/) for publicizing information on selected targets and progress with these targets. In many cases this was due to similar targets having advanced too far in different pipelines by the time conflicts surfaced. In some situations, targets were not stopped because they involved a relative from a different species or with a different ligand bound that could potentially provide useful biological insights.
In order to reduce the redundancy in structures targeted and solved by the four PSI large-scale centres, the target selection strategy from PSI-1 was reviewed at the start of PSI-2, and a new joint initiative was started involving four BioInformatics Groups (jointly referred to as the BIG4), each being associated with one of the four large-scale centres. The aim was to improve the productivity of PSI by reducing the overlap among centres as far as possible and to coordinate efforts of all the centres towards the main goal of PSI.
All four large-scale PSI centres split their efforts among three major target lists by spending about 70% of their efforts on a centralized list targeting structural novelty, 15% on community nominated targets and 15%, on bio-medically important targets. Here we present the strategies developed by the BIG4 in PSI-2 for assembling the centralized list targeting structural novelty, by focusing on uncharacterised domain families, as well as diverse relatives in very highly populated domain families of known structure which are predicted to be structurally and functionally dissimilar to previously determined structures. We also present our initial analyses of the structures deposited in the PDB by PSI-2 large-scale centres during the first three years of PSI-2, and examine the degree to which PSI-2 has been successful in increasing the proportion of distinct (less than 98% sequence identity to any structure pre-existing in the PDB – see Methods) and structurally novel structures solved since the beginning of the initiative. Even though it was not an explicit goal of PSI, we also assess the success of PSI-2 in contributing structural information for functional families and thereby the degree to which PSI-2 has illuminated both structure and function space, by identifying the functional categories within the Gene Ontology classification (The Gene Ontology Consortium, 2000) for which PSI-2 has solved the first structure.
A primary aim in PSI-2 has been to increase the proportion of domain families for which one or more structures have been characterized, by a coarse-grained sampling of sequence space. One major challenge for the target selection strategy was therefore to construct a list of domain families with no representative structure. Domain families with at least 10 relatives were targeted more specifically so as to maximize the potential impact of PSI-2 structures via homology modelling. Here, these domain families are referred to as structurally uncharacterised large families, and were also referred to internally as BIG families. Even though a family with 10 relatives may arguably not qualify as large, this threshold was chosen as a result of a compromise between selecting families having a significant size and not restricting the final list to too few families given the other constraints we had in the selection procedure (e.g. no features that might affect structure determination – see below).
Some domain families are very large and very diverse both in terms of structure and function. The largest 200 CATH (Greene et al., 2007) families in Gene3D (Yeats et al., 2008) account for at least 50% of domain sequences in the genomes and yet, Figure 2 shows that, for all these very large CATH superfamilies but 2, less than 10% of the sequence subfamilies (so-called modelling subfamilies – see Methods) within them have structural representatives in the PDB (Berman et al., 2000). Previous analyses of some of these very large families reveal that some proteins can be up to 5-times larger than other members of the same family, sometimes to the point of actually adopting a different fold (Reeves et al., 2006). Such structural divergence is often clearly correlated with divergence in function (see Figure 3). Our structural sampling of most of these very large and diverse families is very incomplete. Here, these domain families are referred to as very large and diverse families with incomplete structural coverage, and were also referred to internally as MEGA families. Another aim of PSI-2 has therefore been to target additional relatives in these MEGA families, with the expectation that this will give us deeper insights into the nature of structural divergence within a family, and on how structural changes between related domains bring about changes in function. This, in turn, should trigger improvements in algorithms that attempt to predict functions from structures. Finally, such a fine-grained sampling of subfamilies with diverse families is required to fully characterize the structural repertoire in nature.
In recent years, metagenomics experiments have revealed the extent of previously uncovered parts of the protein universe, which are found in complex communities of uncultured microbes from various environments (e.g. ocean, soil, human skin or gastrointestinal tract). On that account, PSI-2 centres also started a pilot project in which the above-mentioned target selection strategies were applied to include domain families that are over-represented in one of the most studied environments, namely the human distal gut microbiome. Sequence information from metagenomics can illuminate important functional roles being carried out by the bacterial communities found in specific habitats (Riesenfeld et al., 2004). For example many bacterial proteins in the human gut are essential for breaking down complex food substrates and synthesizing vital nutrients such as vitamins. Understanding how these communities function and what populations are most beneficial to the human host is likely to be important for understanding and promoting human health and diagnosing conditions likely to lead to disease. In practice, as will be shown hereafter, these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs.
Bioinformatics groups (BIG4) from all 4 PSI Centres collaborated in developing a consensus strategy for target selection. Defining domain families is a complex issue, and a number of curated domain family resources such as Pfam (Finn et al., 2008) and TIGRFAMs (Haft et al., 2003) are now publicly available, which can facilitate research in this field. In order to benefit from these existing domain family resources but also from more optimal strategies for target selection, we applied a mixed protocol to identify suitable sequence families for coarse-grained targets. A primary list of large structurally uncharacterised families was constructed using Pfam, which is one of the most comprehensive manually curated resources. Exclusion of families with less than 10 relatives (see Methods) or with features that might affect success in structure determination (e.g. trans-membrane regions etc.) resulted in a total of 1369 target Pfam families, corresponding to approximately 20% of sequences in Pfam families without structural representatives.
However, several problematic features of Pfam were identified, which originate from the fact that the aims of PSI efforts and the rules guiding Pfam classifications are similar but not identical. For example, the sequences clustered into a Pfam family sometimes represent a multi-domain family rather than a single domain family. A reverse problem happens in proteins that have been chopped into partial domains that are never found separately and may not constitute a proper domain and therefore cannot be solved experimentally. Since consensus approaches have historically been shown to be highly successful in bioinformatics, we attempted to solve these problems by using a collaborative approach involving several orthogonal methodologies for domain family definition, which would allow us to look for consensus families, i.e. families that were found by more than a single source. Therefore, the target list of Pfam families was supplemented by families identified using various automated protocols developed in the BIG4 groups described below:
A combined target list of families identified by Pfam and by the BIG4 protocols was generated, and families found by more than one source were labelled (see Methods). Each centre used their own criteria to prioritise those families that they wished to target for structure determination, for example depending on their reagent genomes, and the families were then divided amongst the four large centres using a random pick procedure. This random pick assignment was iterated four times and, in total, 2357 families were distributed to the centres (see Table 2).
The Gene3D database (Yeats et al., 2008) was exploited to identify the most highly populated domain families with known structures in the genomes. This resource comprises more than five million protein sequences, including sequences from 520 completed genomes and the UniProt (UniProt Consortium, 2008) and RefSeq (Pruitt et al., 2007) databases. Putative domains are identified by scanning sequences against Hidden Markov Models (HMMs) derived from the CATH and Pfam domain databases, using conservative thresholds that have been carefully benchmarked with structural data. As of August 2008, approximately 37% of residues in protein sequences from Gene3D can be assigned to families of known structure in CATH, with a further 48% that can be assigned to Pfam families. Furthermore, approximately 55% of protein sequences in Gene3D contain at least one domain that can be assigned to a family in CATH.
Figure 4 shows that the largest 200 domain families contain more than 290,000 modelling subfamilies. PSI-2 is unlikely to solve this number of structures over the next few years and rational approaches are clearly needed to attempt to select representatives that are structurally and functionally diverse. Therefore, a large part of the first year of PSI-2 (June 2005–June 2006) was dedicated to design a robust target selection strategy and to develop the clustering and analysis tools needed to improve the rational selection of targets within the very large and diverse families selected. For example in the NESG and NYSGXRC research into improved methods of aligning sequences and deriving homology models led to revised thresholds for clustering sequences into modelling subfamilies on the basis of predicted structural similarity.
Similarly, the MCSG consortium developed the GEMMA approach (Lee et al., submitted) which exploits HMM-HMM strategies to progressively merge subfamilies of functionally related domains to enable selection of functionally diverse representatives. For some superfamilies this approach can reduce the number of predicted functionally diverse subfamilies to target by more than ten-fold, making it more feasible to achieve structural coverage of these diverse subfamilies using this rational approach. Additional constraints that operate when selecting representatives are the reagent genomes available for cloning to the centre, which restrict the choice of homologues for structure determination. A measure of success of this target selection strategy will be the degree of structural and functional novelty observed in the structures that are deposited in the PDB by the four centres during this second phase of PSI. This is reviewed below for the first three years of PSI-2.
For each of the most highly populated families in CATH that have been allocated to the PSI-2 large-scale centres, Table 3 gives the number of relatives identified in Gene3D, the number of different functional terms from the Gene Ontology (GO) (The Gene Ontology Consortium, 2000), the number of different modelling subfamilies it contains (where sequences are clustered into modelling subfamilies using a 30% sequence identity threshold), and the percentage of modelling subfamilies for which there is a solved structure. Table 3 also shows to which of the large-scale centres each family was allocated, as well as the date of allocation.
The largest four of these very large families, also called SUPERMEGA superfamilies, were not allocated to any individual centre but instead, each centre prioritised individual modelling subfamilies within these superfamilies, largely on the basis of features which suited their experimental pipelines (e.g. presence of homologues in the reagent genomes used by the centre) and functional assignments (e.g. biologically interesting GO terms for which no structures were currently known). Modelling subfamilies from these four largest families were then assigned to each centre using the draft pick protocol. It is worth noting the disproportionately larger size of the superfamily of P-loop containing nucleotide triphosphate hydrolases (CATH code 220.127.116.110), as compared with the other very large superfamilies.
Two rounds of identification of protein families over-represented in the human gut microbiome were performed. For both rounds, protein sequences found in the gut microbiome were first grouped into homologous clusters (see Methods for further details). Comparing numbers of homologues from these clusters found in the gut microbiome and in other bacterial genomes allowed the identification of clusters that are significantly over-represented in the gut. The largest and most over-represented clusters were considered as potential targets. A subset of 1092 clusters from the first round and 136 clusters from the second round (defined by HMMs) were then selected as targets and equally divided amongst the four centres using the draft pick protocol. Many of these Gut Metagenome Families constitute a subset of the targeted large families (BIG and MEGA) mentioned in the above paragraphs, however some represent novel families, specific to the human gut environment.
It is possible to gauge the success of the structural genomics initiatives, in particular that of PSI-2, using a number of different measures. The total number of structures solved is an obvious preliminary indicator, but it must be considered with caution since it is not necessarily indicative of the actual impact and leverage of PSI-2, or of its success at meeting its objectives (Liu et al., 2007). One major aim of PSI-2 is to determine “novel” protein structures and in that context, all newly solved structures do not have the same value. For example, alternative structures of a given protein with different ligands can be crucial for understanding better the mechanism of a particular protein, but do not help in terms of structural novelty.
We consider two measures to evaluate the success of PSI-2 at determining novel protein structures. First, we measure the extent to which these structures are affecting the structural coverage of known proteomes. Ultimately, this issue relies on the definition of modelling subfamilies and how the newly solved structures can be used to provide valuable structural information on their relative sequences. Secondly, we directly measure the structural novelty of PSI-2 structures by comparing them with previously released structures using a normalised RMSD score.
Another means of assessing the success of the structural genomics initiatives and the potential value of this data to biologists is to consider the number of diverse functions which have been characterized experimentally and captured in public resources such as GO but for which there are no structural relatives. Solving representative structures for proteins possessing these functions will help in revealing the molecular mechanisms by which these proteins function and expand our understanding of functional space as well as structural space. For this reason, we also consider the number of functional groups that were previously uncharacterised structurally and for which PSI-2 has provided a first structural representative.
Analyses were performed on all structures deposited in the PDB (Berman et al., 2000) by the four PSI-2 large-scale production centres from July 1st 2005 to July 1st 2008. Some of these analyses were conducted in collaboration with the PSI Structural Genomics Knowledgebase established at Rutgers University (http://kb.psi-structuralgenomics.org/KB/) (Berman et al., 2009). A total of 1600 structures were solved by the 4 centres in the first three years of PSI-2 and they amounted to 1502 distinct chains (~94%). This compares with a ratio of 61% of distinct chains to PDB entries (9597/15629) for the entire PDB (excluding PSI structures) over the same period of time. Of the 1502 distinct structures solved, 460 (~30%) were from BIG families of which 355 (~24%) were from Pfam families. During the first three years of PSI-2, 288 Pfam families had their first structure solved by PSI-2 large-scale centres, which is about 38% of all Pfam families (total of 748) for which a first structure was deposited in the PDB during the same period of time.
Previous analyses of structural coverage of known proteomes suggest that up to 30–40% of protein residues, and ~50% of domain sequences, can currently be assigned a structure by modelling (Liu et al., 2007;Marsden et al., 2007). This proportion varies with the sequence database used, and the prediction methods used to assign sequences to structural families (e.g. PSI-BLAST (Altschul et al., 1997), HMMs, profile-profile comparisons, threading etc). A non-negligible proportion of the remaining domain sequences in these proteomes belong to families that are problematic for high throughput structure determination, because they are membrane associated, intrinsically disordered or have regions of low complexity, and thus more appropriate targets for the specialized centres of PSI (Norvell et al., 2007). A significant proportion of the remaining structurally uncharacterised and non-problematic sequences were targeted by the expanded BIG list (2298 families). The remaining targets chosen by the four centres came from 48 MEGA families and 136 META families. In total, 193249 targets have been selected over the three years since the start of PSI-2.
Figure 5 shows the increase in structural coverage of fractions of proteins and residues from UniProt, obtained by solving structures since the start of PSI-2, and compares the contributions of structures from the entire PDB, PSI-2 only and PSI-2 large-scale centres only (see Methods). Altogether, the fraction of UniProt proteins (residues) that can be structurally modelled is now reaching 48% (42%). This represents an increase of about 10% (6%) over the past three years, with a contribution of more than 2% (1.3%) from PSI-2 structures. In terms of increase in structural coverage, the contribution of PSI-2 is practically entirely due to structures solved by the four large-scale centres. About 23% of the increase in structural coverage of proteins in UniProt (UniProt Consortium, 2008) is due to structures from large-scale centres. The contribution of these structures is about 19% when defining structural coverage at the residue level. These contributions are rather encouraging given that structures from PSI-2 large scale centres only account for around 13% of the distinct structures released in the PDB since July 1st 2005. This result is somewhat expected, particularly because targets have been specifically selected for the coarse-grained sampling of sequence space (i.e. BIG families) rather than to optimally increase modelling coverage. Since the data presented here only considers structures released within the first three years of PSI-2, the proportion of novel structural coverage due to PSI-2 may increase in the final 2 years as PSI-2 large-scale centres reach optimal productivity.
When considering specific proteomes, the contribution of the PSI-2 large-scale centres to the increase in structural coverage greatly depends on the type of organism. For example, there was a total of 7049 novel human proteins (~10% of the total number of human sequences in UniProt 12.8 – i.e. 72034 protein sequences) for which a structure could be modelled thanks to structures deposited in the PDB between July 1st 2005 and July 1st 2008, but only 231 (i.e. 3.3% of the structural coverage increase, and 0.3% of the human proteome) out of these were due to structures solved by the four large-scale centres (for residues, the fraction of the total increase in structural coverage that is due to large-scale centres is also 3.3%). In contrast, the contribution of the large-scale centres to novel structural coverage amounts to 37% for Escherichia coli over the same period of time, i.e. 206 out of 560 proteins for which structure can now be modelled (respectively 5% and 13% of the total number of Escherichia coli sequences – i.e. 4381 protein sequences). For residues, the fraction of the total increase in structural coverage of Escherichia coli that is due to large-scale centres is 28.2%. These discrepancies between human and E. Coli are somewhat expected given that large-scale centres have preferentially targeted prokaryotic proteins.
Whether a new structure is deemed structurally novel depends on the criteria used to recognize structural similarity. Recent analyses of homologous domains in the CATH database revealed a mean value of 5Å for the normalised RMSD following superposition of homologous domains to be an appropriate cut-off for defining structural similarity (see Methods for definition of the normalised RMSD) (Cuff et al., submitted). Relatives superposing with higher normalised RMSD values have been observed to be structurally divergent often due to significant structural embellishments to the cores of the structural domains (Reeves et al., 2006). Therefore a normalised RMSD cut-off of 5Å was applied to determine whether structures solved by PSI-2 and traditional structural biology were significantly structurally different from those previously deposited in the PDB (Berman et al., 2000). Since improved structural alignments can sometimes be obtained by aligning the constituent domains rather than complete multi-domain structures, all the structures were scanned against the CATH non-redundant domain library (CATH version 3.2).
Figure 6 shows that 28% of the domain structures solved by PSI-2 large scale centres are structurally novel when using these criteria. This compares with 3% of domains solved by non-Structural Genomics structural biology worldwide which are structurally novel. These results cover domain structures solved by the PSI-2 large-scale centres, whether or not the targets were selected as part of BIG families. Of the 365 distinct domain structures (less than 98% sequence identity) from BIG families that have been solved and classified in CATH, 155 (42%) were found to be structurally novel according to the normalised RMSD cut-off of 5Å. Encouragingly, a significant proportion of structures from MEGA families were also found to be structurally novel (15%), as computed over the total number of 282 distinct domains from MEGA families and solved by PSI-2 large-scale centres. This suggests that the strategies described above for selecting structurally diverse representatives from these families appear to be performing well.
We also evaluated structural novelty by counting the number of structures that were the first representative of their superfamily or fold in CATH. Of the 859 distinct domain structures solved by PSI-2 large-scale centres, which are classified in CATH, 102 structures comprise novel CATH superfamilies, and 28 comprise novel CATH folds. Unfortunately, equivalent numbers for non-structural genomics structural biology since June 2005 cannot be readily computed for comparison, because a specific effort to classify PSI-2 structures was made by curators for the most recent release of the CATH database (CATH v3.2). Besides, of the 365 distinct domain structures from BIG families, 75 (21%) were found to represent novel CATH superfamilies (including 21 that represented novel folds), whereas 290 (79%) were found to belong to previously existing CATH domain families among which 116 (32%) were assigned to MEGA superfamilies. These BIG families are therefore clearly diverse subfamilies of the CATH families, that were no longer recognizable by sequence based protocols but that showed clear structural similarity to relatives from previously known CATH superfamilies.
In order to assess how well structural genomics was contributing structures towards the aim of increasing the number of functional groups with a representative structure, the number of functional categories in the Gene Ontology (GO) for which PSI-2 or structural biology solved the first representative structure was assessed. Of the 1502 distinct structures solved by PSI-2 large-scale centres, 51% could be mapped to a functional category in the GO database (molecular function ontology). This contrasted with 81% of structures solved by non-structural genomics structural biology worldwide. Similar ratios were obtained when considering the GO biological process ontology. Thus a significant proportion of PSI-2 structures have been functionally annotated, suggesting a non-negligible leverage of structural information from PSI-2 in terms of functional data. More importantly, 2.2% of distinct structures (i.e. 33 structures) solved by PSI-2 large-scale centres represented the first structure solved for one of their GO terms, including 27 structures for molecular function terms and 12 for biological process terms, with 7 structures being first representative for one term of both category. These GO terms, which consist mostly of enzymatic functions, are listed together with their representative PSI-2 structure in Tables 4 and and55 for molecular function and biological process, respectively. For comparison, 6% of distinct structures (i.e. 374 out of 6080 structures) solved by non-Structural Genomics projects and released in the PDB between July 1st 2005 and July 1st 2008 represented the first structure solved for one of their GO terms. Thus, the proportion of structures being first representatives of a given function is of the same order of magnitude for structures from PSI-2 large-scale centres and those from standard structural biology. This is encouraging given that targeting novel functions was not an explicit aim for PSI-2.
META families coverage was initiated in year 3 of PSI-2 and insufficient data is available at this point to fully evaluate this target selection strategy. PSI centres solved a significant number of novel proteins from human gut microbes, including over 25 proteins involved in carbohydrate metabolism and first representatives of over 10 novel protein families first found in the human gut. These preliminary results highlight two dominant mechanisms of adaptation of microbes to the specific challenges of the gut environment, namely expansion and functional diversification of known protein families, and evolution of new specialized families (Ellrott et al., submitted).
The Protein Structure Initiative (PSI) is now more than half way through its second phase. An important stated aim of this effort has been to make structural information available for a large proportion of genome sequences. In order to achieve this, a strategy has been set up to select structural genomics targets in protein domain families of substantial size for which no structural information was available yet. These families have been referred to as BIG families. This target selection strategy, which is extensively presented here, has been made possible by the joint efforts of several bioinformatics groups associated to PSI-2. Early in the second phase of PSI, analyses made it clear that a large fraction of BIG families that were targeted turned out to be remote homologues of previously known structural families. Genomic analyses also suggest that a significant proportion of genome sequences belong to a few universal families that are highly structurally and functionally divergent. It is clear that structural genomics can make a major contribution to biology by understanding the manner in which these families diverge structurally and how this mediates changes in molecular function, biological role and interaction partners. Therefore another important aim of PSI-2 has been to increase the number of representative structures from these families (referred to as MEGA families) in a way that reveals more comprehensively their considerable diversity and that contributes new structural information for the relatives within the superfamilies that clearly have different functional roles.
The results presented here suggest that during its first three years, PSI-2 has been successful at meeting several of its stated aims, by contributing significant numbers of structural representatives of novel structures and functions, and by participating substantially to a general increase in the number of genome sequences that can be modelled structurally. We hope that this analysis, together with previous reports on the success of structural genomics (Todd et al., 2005;Chandonia et al., 2006;Watson et al., 2007) and more specific analyses (Todd et al., 2005;Watson et al., 2007;lali-Hassani et al., 2007), will shed light on the capacity of the Protein Structure Initiative and other similar efforts world-wide to contribute valuable data for facing the new challenges in understanding biology at the molecular and cellular levels (Gerlt, 2007;Blundell, 2007).
At the start of PSI-2, the PSI committee issued a statement publicizing the fact that PSI-2 would aim to ‘increase the number of large families for which a structure was known’. This can be described as coarse-grained coverage of protein structural space. However, it was also recognized that for some large and highly diverged families a single representative would not provide sufficient structural insight for the entire family and that in such cases, structures should be solved for several representatives. This process would be described as fine-grained coverage. Although these definitions appear intuitively obvious, practical use of the guidelines was initially hampered by the lack of universally accepted definitions. For instance, the term “protein family” is used by different authors to designate groups of proteins that share differing levels of similarity, so that coarse-grained coverage according to one author could correspond to fine-grained coverage according to another.
Various groups working on protein families and domain definitions (e.g. Pfam (Finn et al., 2008), TIGRFAMs (Haft et al., 2003)) have used different strategies and protocols to construct databases of domains and families. However, the BIG4 felt that none of the existing resources fully incorporated structural information into the families and domain definitions. Furthermore, in determining a sensible strategy for target selection for structural genomics, there are various experimental issues that have an important bearing on choice of a suitable approach. For example, whilst it may seem tempting to opt for a particular organism of biological significance such as yeast or human, there may be significant experimental difficulties with expressing proteins from this organism or restricting the selection strategy to a few organisms. In order to coordinate target selection, the BIG4 came up with the following working definitions of families:
This describes a group of closely related sequences in which any two targets share a “minimal similarity”. Modelling subfamilies were constructed by multi-linkage clustering, using a clustering threshold of 30% pair-wise sequence identity between any two members of the subfamily. This threshold was chosen as it ensures that once a single structure has been solved for the MS, there is a reasonable probability that homology models can be built for all other relatives with good accuracy. We anticipate that the precise definition of a modelling subfamily will probably change as modelling algorithms evolve and improve.
We refer to BIG families to describe groups of related proteins, with many relatives, identified using profile-based sequence similarity search strategies. Currently a minimum of 10 relatives is being employed to define a BIG family, though this may change in the future. We hypothesize that a BIG family could consist of multiple modelling subfamilies and that members of a BIG family may display non-negligible structural diversity. Since the primary focus of PSI-2 is to solve representatives of large families with unknown structures, BIG families were validated as targets by ensuring that they contained no relative with a known structure in the PDB. Standard bioinformatics approaches were used to eliminate families that could be problematic (as in PSI-1) (Marsden et al., 2008).
Some domain families are extremely large (some are ten-fold or more larger than the average BIG family) and we can anticipate extreme structural divergence within them (Marsden et al., 2007). Multiple targets from these families would be needed to get even approximate models for all structural variants. We refer to such families as MEGA families. In practice, MEGA families were defined as the 200 most populated homologous superfamilies in CATH (H-level). Taken together, these 200 MEGA families cover at least 50% of domain sequences in genomes. Most MEGA families already have representatives of known structure, but an important goal of PSI-2 is to fully characterize structural (and functional) variability in these families.
We use the term META family to refer to clusters of homologous sequences that are over-represented in metagenomic samples from a particular environment (microbiome). This term falls into a slightly different category than MEGA, BIG or modelling subfamily since it does not refer to the size of a family nor to the presence of already determined structures. PSI targets selected from META-families were usually chosen from fully sequenced microbial genomes, but metagenomic sequences were used to calculate their over-representation ratios and, thus, to identify META-families (see below).
The final target list of BIG families was defined by looking for a consensus between the families defined from Pfam and different protocols defined by the BIG4. Consensus mapping between the different BIG family resources was achieved as follows:
Each family was defined by a multiple alignment of the seed sequences. Relatives were then identified by profile based scans of a non-redundant version of the UniProt database (UniProt Consortium, 2008). Two families were deemed to be equivalent if at least 70% of the sequences in the larger family can be matched to sequences in the smaller family, where sequences are identified as matching if they have the same UniProt ID and at least 60% of the residues in the larger sequence are equivalent to residues in the smaller sequence. Some manual inspection was undertaken to check the quality of these family assignments. Families identified by several approaches were eventually considered for assignment to PSI-2 large-scale centres.
The Godzik group performed two rounds of identification of protein families over-represented in human gut microbiome, with the underlying aim to identify protein families that are important for the human gut flora, unique for this environment, or significantly over-represented there. Modelling subfamilies were identified in the first round, and BIG families were identified in the second round.
1) identification of modelling subfamilies over-represented in human gut microbiome: Modelling META-subfamilies were defined as sequence clusters seeded with proteins from four bacteria isolated from human gut flora: Eubacterium rectale, Bacteroides vulgatus, Bacteroides thethaiotaomicron, and Bacteroides fragilis (made available by Jeff Gordon laboratory, http://gordonlab.wustl.edu/). For each seed sequence, BLASTP (Altschul et al., 1990) hits were collected from two sets of sequences:
The ratio between the number of hits of a seed sequence found in the “GUT” and in “ALL” was used as a measure of over-representation of a modelling META-subfamily defined by that seed sequence. The top 20% most over-represented modelling subfamilies were distributed between the four large-scale centres using a random pick mechanism which ensured that all close homologues of any given seed sequence were assigned to the same centre.
2) identification of novel BIG-families from human gut microbiome: the aim in this round was to identify BIG-families with no functional annotation that were over-represented in the human gut microbiome. BIG families were defined by Hidden Markov Models (HMMs) (Eddy, 1996).
Available sequences of human gut metagenomic samples were first collected (from datasets published by the Hattori lab (Kurokawa et al., 2007), and from the above-mentioned US metagenomic samples). Functionally annotated sequences were filtered out from these sets by removing all sequences with significant BLASTP hits (e-value lower than 0.001) to annotated sequences in KEGG (Kanehisa et al., 2000). The remaining sequences were clustered using PDB-Blast (Li et al., 2002) and an e-value equal to 0.001 as the clustering cut-off. The resulting clusters were expanded by collecting non-redundant homologues of all cluster members. Homologues were obtained using PSI-BLAST searches against a database that consists of the NR database and metagenomic datasets clustered at 85% sequence identity using CD-HIT (Li et al., 2006a). Multiple sequence alignments of these homologues were then constructed with CLUSTALW (Thompson et al., 1994), and were used to build HMMs (Eddy, 1996) using HMMBUILD. These HMMs, which represented BIG-families, were used to collect hits from two sets of sequences using HMMPFAM (both programs available from http://hmmer.janelia.org/):
The ratio between the number of hits in “GUT” and in “NHR” was used to define BIG families that were over-represented in the human gut microbiome. The most over-represented families were then distributed between the large-scale centres.
Lists of PSI-2 structures used for computing structural coverage of genomes were obtained directly from the large-scale centres, considering only structures deposited in the PDB (Berman et al., 2000) between July 1st 2005 and July 1st 2008. Corresponding lists of non-PSI structures were downloaded from the PDB website using identical date restrictions. Distinct structures have been defined as lists of structures sharing less than 98% pair-wise sequence identity, and were obtained by running CD-HIT with that cut-off and considering single representatives from all resulting CD-HIT clusters (Li et al., 2006b).
Increase in genome coverage in terms of structural modelling was computed by running PSI-BLAST against UniProt (release 12.8) for each PDB structure in turn, and by considering modelling subfamilies around each structure to decide on sequences for which the structure could be modelled.
Structural novelty was computed by considering domains from novel structures that have been classified in CATH release 3.2 (Greene et al., 2007). All domains in CATH v3.2 were structurally aligned against one another using SSAP (Orengo et al., 1996;Greene et al., 2007), and were clustered into structurally similar groups by complete clustering with a normalised RMSD cut-off of 5.0Å. The normalised RMSD score (normRMSD) is computed as follows:
Where RMSD is the root mean square deviation of the superposition, max(L1,L2) is the length in amino acids of the largest domain in the superposition, and Nmat is the number of aligned residue pairs (Kolodny et al., 2005). Domains that are assigned to the same cluster are considered structurally similar, and the structure from each cluster that was first deposited in the PDB is considered to be structurally novel. Fold and superfamily novelty was evaluated by considering the first structure in each CATH fold and CATH superfamily to have been deposited in the PDB.
Functional novelty was evaluated by mapping PDB structures to GO terms using the PDB to GO mapping provided by the MSD at the EBI (Velankar et al., 2005).
Results and statistics generated by the BIG4 groups, and presented in this article, are also available from the BIG4 website (http://psi-big4.org/).
This work was supported by a grant from the Protein Structure Initiative (PSI) of the National Institute for General Medicine at the National Institutes of Health.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.