A fundamental bridge that enables us to link a protein sequence to its function is knowledge of its structure. It is well known that protein structure tends to be conserved to a greater degree than protein sequence and comparison of protein structure is often able to reveal functional relationships that are hidden at the sequence level (1
). The identification of links between structural relatives can be a powerful method to infer function, in many cases it has been shown that a small number of residues within a protein's active site or binding pocket are critical for biological activity and such residues may only appear to be conserved through structural analysis (2
). Structural biology faces the task of characterizing the shapes and dynamics of the entire protein repertoire of whole genomes in order to facilitate an understanding of biochemical functions and their mechanisms of action within the cell. However, with the ever-growing disparity between the number of known sequences and known structures, the need to structurally and functionally annotate sequence space appears more pressing than ever. Structural genomics projects were instigated to address this issue through the large-scale determination of protein 3D structure (4
). To solve a structure for each genome sequence would be experimentally, practically and financially prohibitive (9
). Rather, many structural genomics initiatives aim to fill in areas of fold space and in doing so, provide structures that will cover surrounding sequence space by acting as a structural template for comparative modelling and fold recognition (1
). Increasing the coverage of structure annotations will reveal new insights between protein sequence, structure and function, which in turn will expedite our understanding of protein function on the molecular level and improve the methods by which we can automatically provide structure-guided functional annotations to new protein structures (12
A variety of structural genomics initiatives are in progress around the world, including the United States, where the Protein Structure Initiative (PSI) funded by the National Institute for General Medical Sciences (NIGMS) under the National Institute of Health (NIH) began its pilot phase in 2000 (16
). Among its principal aims was the development of bioinformatics-based target selection and monitoring strategies that were able to meet the demands of the large amounts of data required for high-throughput genome-scale structure determination (18
). Traditional biology has now been solving protein structures for several decades; however, without a ‘global target plan’, solved structures tend to represent the interests of individual researchers, rather than specifically aiming to enrich our knowledge of structure space. Furthermore, a single structure is often solved more than once, bound to different ligands or with a range of amino acid substitutions. While these studies are fundamental to molecular biology, such endeavours would be considered to be redundant under the guise of many structural genomics projects. In order to map protein structure space more efficiently, most structural genomics groups apply a target selection strategy that increases the likelihood that a new structure will exhibit a novel fold or provide a new homologous superfamily in a previously observed fold group. Accordingly, a central step in target selection is the use of comparative sequence analysis to identify and exclude sequences that have a relative of known structure in the Protein Data Bank (PDB) (20
). However, there is no guarantee that all the remaining target sequences will be amenable to high-throughput analysis—the high attrition rate of target proteins in high-throughput structural genomics pipelines has been well documented [e.g. (21
)]. Many target selection protocols have attempted to reduce the number of these difficult proteins by excluding or truncating sequences that are predicted to contain regions of low-complexity, coiled-coils, and transmembrane helices.
Increasingly, target selection is concerned with the organization of genome sequences into protein families (17
). These families can then be prioritized according to a range of properties, such as size, taxonomic distribution and suitability of family representatives for structure determination, directing efforts towards well-considered regions of sequence space. Although these principles of target selection are widely employed in the structural genomics community, a varied array of target selection strategies have also been developed to meet the particular requirements of different initiatives. Such considerations have included the prioritization of representatives from large families or the identification of ORFan sequences for which little is known in terms of their origin and function. Proteins have also been targeted according to their species distribution, which may correlate to their general function, e.g. proteins found in all three super kingdoms of life or those found only in single pathogenic organisms.
Recently, Chandonia and Brenner (27
) proposed the Pfam5000 target selection strategy which aims to provide a roadmap for coordinated target selection in the second phase of the PSI. This approach aims to guide the selection of a manageable number of target proteins from a list of the largest 5000 Pfam families, many of which lack a member of known structure. By solving structures, such that the top 5000 largest Pfam families have at least one structural representative, it was shown that at least 1-fold assignment could be achieved for 68 and 61% of all prokaryotic and eukaryotic proteins, respectively.
A diverse range of methods have been applied to the problem of automatically clustering large collections of protein sequences, such as Swiss-Prot/TrEMBL (28
), into families. Such methods include ProDom (29
), DIVCLUS (30
), ProtoNet (31
), GeneRAGE (32
), SYSTERS (33
), ADDA (34
) and CHOP (35
). These methods first aim to define domain regions in protein sequences which are then clustered based on some measure of relatedness shared between domain sequences. The TribeMCL algorithm, developed by Enright and co-workers (36
), aims to assign complete protein sequences into families that correlate closely with overall domain architecture. This family assignment protocol has been used to create the Gene3D database (37
), which has been used extensively in the target selection pipeline used by the Midwest Center for Structural Genomics (MCSG). Resources such as 3D-GENOMICS (39
) and SUPERFAMILY (40
) aim to provide domain annotations to genome sequences using sensitive sequence profile methods such as PSI-BLAST (41
) and hidden Markov models (HMMs) (42
). Similarly, sequences in the Gene3D database are annotated at the domain level, using a HMM library of CATH and Pfam domains, which enables us to provide a wide variety of whole-gene and domain level protein family assignments.
Many families of protein domains, particularly in the case of large families, display divergence in their molecular function (43
). It has been proposed that a future direction for target selection in second phase of the protein structure initiative (PSI2) should include fine grain targets, selected from large families of proteins in order to provide a more thorough coverage of functional space as it relates to protein structure. The level of granularity required for the selection of additional targets must consider the number of sequences that can be computationally modelled with ‘useful’ accuracy from each solved structure, based on the medical and biological relevance of individual protein families. It is generally accepted that sequences sharing 30% or more sequence identity are likely to share a similar fold (44
) and accordingly, to confidently construct models of reasonable accuracy, at least 30% sequence identity must be shared between the sequence to be modelled and the structural template (11
Here, we present an analysis of sequences from 203 completed genomes clustered in the Gene3D resource. We analyse the growth in new families together with the distributions of singleton sequences as the number of available completed genomes has increased. Over 90% of the genomes sequences can be provided with CATH, Pfam, transmembrane-helix, coiled-coil, low complexity or N-terminal signal peptide annotations. The mapping of CATH and Pfam domains to the sequences held in Gene3D has enabled us to measure genome coverage provided by the largest CATH and Pfam domain families, where we aim to calculate coverage on the basis of domain sequences, rather than whole gene sequences. We find that many of the remaining unannotated sequence regions appear domain-like in length and belong to a large number of small families that tend to be species specific. The comprehensive domain annotation of a large number of genomes has also enabled us to calculate more accurate estimates of the number of multi-domain proteins in eukaryotes and prokaryotes than previous studies. We also show that while over 70% of domain sequences in 203 completed genomes fall into just 2000 CATH and Pfam domain families, the number of structures that would be required to provide useful homology models for these sequences approaches 90
000, clearly an unachievable demand for structural genomics. With this in mind it will be important to develop and improve methods to identify shifts in function across large superfamilies in order to suggest additional relatives for structure determination.