Mutations in dysferlin, the first protein linked with the cell membrane repair mechanism, causes a group of muscular dystrophies called dysferlinopathies. Dysferlin is a type two-anchored membrane protein, with a single C terminal trans-membrane helix, and most of the protein lying in cytoplasm. Dysferlin contains several C2 domains and two DysF domains which are nested one inside the other. Many pathogenic point mutations fall in the DysF domain region.
We describe the crystal structure of the human dysferlin inner DysF domain with a resolution of 1.9 Ångstroms. Most of the pathogenic mutations are part of aromatic/arginine stacks that hold the domain in a folded conformation. The high resolution of the structure show that these interactions are a mixture of parallel ring/guanadinium stacking, perpendicular H bond stacking and aliphatic chain packing.
The high resolution structure of the Dysferlin DysF domain gives a template on which to interpret in detail the pathogenic mutations that lead to disease.
Dysferlin; Limb girdle muscular dystrophy 2B; Arginine-tryptophan stacking; DysF domain; Crystal structure
Gene3D (http://gene3d.biochem.ucl.ac.uk) is a database of protein domain structure annotations for protein sequences. Domains are predicted using a library of profile HMMs from 2738 CATH superfamilies. Gene3D assigns domain annotations to Ensembl and UniProt sequence sets including >6000 cellular genomes and >20 million unique protein sequences. This represents an increase of 45% in the number of protein sequences since our last publication. Thanks to improvements in the underlying data and pipeline, we see large increases in the domain coverage of sequences. We have expanded this coverage by integrating Pfam and SUPERFAMILY domain annotations, and we now resolve domain overlaps to provide highly comprehensive composite multi-domain architectures. To make these data more accessible for comparative genome analyses, we have developed novel search algorithms for searching genomes to identify related multi-domain architectures. In addition to providing domain family annotations, we have now developed a pipeline for 3D homology modelling of domains in Gene3D. This has been applied to the human genome and will be rolled out to other major organisms over the next year.
The TATA binding protein (TBP) is an essential transcription initiation factor in Archaea and Eucarya. Bacteria lack TBP, and instead use sigma factors for transcription initiation. TBP has a symmetric structure comprising two repeated TBP domains. Using sequence, structural and phylogenetic analyses, we examine the distribution and evolutionary history of the TBP domain, a member of the helix-grip fold family. Our analyses reveal a broader distribution than for TBP, with TBP-domains being present across all three domains of life. In contrast to TBP, all other characterized examples of the TBP domain are present as single copies, primarily within multidomain proteins. The presence of the TBP domain in the ubiquitous DNA glycosylases suggests that this fold traces back to the ancestor of all three domains of life. The TBP domain is also found in RNase HIII, and phylogenetic analyses show that RNase HIII has evolved from bacterial RNase HII via TBP-domain fusion. Finally, our comparative genomic screens confirm and extend earlier reports of proteins consisting of a single TBP domain among some Archaea. These monopartite TBP-domain proteins suggest that this domain is functional in its own right, and that the TBP domain could have first evolved as an independent protein, which was later recruited in different contexts.
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).
CATH version 3.5 (Class, Architecture, Topology, Homology, available at http://www.cathdb.info/) contains 173 536 domains, 2626 homologous superfamilies and 1313 fold groups. When focusing on structural genomics (SG) structures, we observe that the number of new folds for CATH v3.5 is slightly less than for previous releases, and this observation suggests that we may now know the majority of folds that are easily accessible to structure determination. We have improved the accuracy of our functional family (FunFams) sub-classification method and the CATH sequence domain search facility has been extended to provide FunFam annotations for each domain. The CATH website has been redesigned. We have improved the display of functional data and of conserved sequence features associated with FunFams within each CATH superfamily.
The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called “hidden spindle hub”, proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
Gene3D http://gene3d.biochem.ucl.ac.uk is a comprehensive database of protein domain assignments for sequences from the major sequence databases. Domains are directly mapped from structures in the CATH database or predicted using a library of representative profile HMMs derived from CATH superfamilies. As previously described, Gene3D integrates many other protein family and function databases. These facilitate complex associations of molecular function, structure and evolution. Gene3D now includes a domain functional family (FunFam) level below the homologous superfamily level assignments. Additions have also been made to the interaction data. More significantly, to help with the visualization and interpretation of multi-genome scale data sets, we have developed a new, revamped website. Searching has been simplified with more sophisticated filtering of results, along with new tools based on Cytoscape Web, for visualizing protein–protein interaction networks, differences in domain composition between genomes and the taxonomic distribution of individual superfamilies.
InterPro (http://www.ebi.ac.uk/interpro/) is a database that integrates diverse information about protein families, domains and functional sites, and makes it freely available to the public via Web-based interfaces and services. Central to the database are diagnostic models, known as signatures, against which protein sequences can be searched to determine their potential function. InterPro has utility in the large-scale analysis of whole genomes and meta-genomes, as well as in characterizing individual protein sequences. Herein we give an overview of new developments in the database and its associated software since 2009, including updates to database content, curation processes and Web and programmatic interfaces.
The Gene3D structural domain database provides domain annotations for 7 million proteins, based on the manually curated structural domain superfamilies in CATH. These annotations are integrated with functional, genomic and molecular information from external resources, such as GO, EC, UniProt and the NCBI Taxonomy database. We have constructed a set of web services that provide programmatic access to this integrated database, as well as the Gene3D domain recognition tool (Gene3DScan) and protein sequence annotation pipeline for analysing novel protein sequences. Example queries include retrieving all curated GO terms for a domain superfamily or all the multi-domain architectures for the human genome. The services can be accessed using simple HTTP calls and are able to return results in a range of formats for quick downloading and easy parsing, graphical rendering and data storage. Hence, they provide a simple, but flexible means of integrating domain annotations and associated data sets into locally run pipelines and analysis software. The services can be found at http://gene3d.biochem.ucl.ac.uk/WebServices/.
Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.
To model accurate protein networks we need to extend our knowledge of protein associations in molecular systems much further. Biologists believe that high-throughput experiments will fill the gaps in our knowledge. However, if these approaches perform biased screenings, leaving important areas poorly characterized, success in modelling protein networks will require additional approaches to explore these ‘dark’ areas. We assess the value of integrating bio-computational approaches to build accurate and comprehensive network models for human and yeast proteomes and compare these models with models derived by combining multiple experimental datasets. We show that the predicted networks resemble the topological and error features of the experimental networks, and contain information on true protein associations within and beyond their constitutive first order binary predictions. We suggest that the majority of predicted network space is dark matter containing important functional areas, elusive to current experimental designs. Until novel experimental designs emerge as effective tools to screen these hidden regions, computational predictions will be a valuable approach for exploring them.
Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website (http://gene3d.biochem.ucl.ac.uk/). Gene3D provides accurate structural domain family assignments for over 1100 genomes and nearly 10 000 000 proteins. A hidden Markov model library, constructed from the manually curated CATH structural domain hierarchy, is used to search UniProt, RefSeq and Ensembl protein sequences. The resulting matches are refined into simple multi-domain architectures using a recently developed in-house algorithm, DomainFinder 3 (available at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/). The domain assignments are integrated with multiple external protein function descriptions (e.g. Gene Ontology and KEGG), structural annotations (e.g. coiled coils, disordered regions and sequence polymorphisms) and family resources (e.g. Pfam and eggNog) and displayed on the Gene3D website. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. Gene3D also provides a set of services, including an interactive genome coverage graph visualizer, DAS annotation resources, sequence search facilities and SOAP services.
The phenotypic effects of sequence variations in protein-coding regions come about primarily via their effects on the resulting structures, for example by disrupting active sites or affecting structural stability. In order better to understand the mechanisms behind known mutant phenotypes, and predict the effects of novel variations, biologists need tools to gauge the impacts of DNA mutations in terms of their structural manifestation. Although many mutations occur within domains whose structure has been solved, many more occur within genes whose protein products have not been structurally characterized.
Here we present 3DSim (3D Structural Implication of Mutations), a database and web application facilitating the localization and visualization of single amino acid polymorphisms (SAAPs) mapped to protein structures even where the structure of the protein of interest is unknown. The server displays information on 6514 point mutations, 4865 of them known to be associated with disease. These polymorphisms are drawn from SAAPdb, which aggregates data from various sources including dbSNP and several pathogenic mutation databases. While the SAAPdb interface displays mutations on known structures, 3DSim projects mutations onto known sequence domains in Gene3D. This resource contains sequences annotated with domains predicted to belong to structural families in the CATH database. Mappings between domain sequences in Gene3D and known structures in CATH are obtained using a MUSCLE alignment. 1210 three-dimensional structures corresponding to CATH structural domains are currently included in 3DSim; these domains are distributed across 396 CATH superfamilies, and provide a comprehensive overview of the distribution of mutations in structural space.
The server is publicly available at . In addition, the database containing the mapping between SAAPdb, Gene3D and CATH is available on request and most of the functionality is available through programmatic web service access.
The InterPro database (http://www.ebi.ac.uk/interpro/) integrates together predictive models or ‘signatures’ representing protein domains, families and functional sites from multiple, diverse source databases: Gene3D, PANTHER, Pfam, PIRSF, PRINTS, ProDom, PROSITE, SMART, SUPERFAMILY and TIGRFAMs. Integration is performed manually and approximately half of the total ∼58 000 signatures available in the source databases belong to an InterPro entry. Recently, we have started to also display the remaining un-integrated signatures via our web interface. Other developments include the provision of non-signature data, such as structural data, in new XML files on our FTP site, as well as the inclusion of matchless UniProtKB proteins in the existing match XML files. The web interface has been extended and now links out to the ADAN predicted protein–protein interaction database and the SPICE and Dasty viewers. The latest public release (v18.0) covers 79.8% of UniProtKB (v14.1) and consists of 16 549 entries. InterPro data may be accessed either via the web address above, via web services, by downloading files by anonymous FTP or by using the InterProScan search software (http://www.ebi.ac.uk/Tools/InterProScan/).
Gene3D provides comprehensive structural and functional annotation of most available protein sequences, including the UniProt, RefSeq and Integr8 resources. The main structural annotation is generated through scanning these sequences against the CATH structural domain database profile-HMM library. CATH is a database of manually derived PDB-based structural domains, placed within a hierarchy reflecting topology, homology and conservation and is able to infer more ancient and divergent homology relationships than sequence-based approaches. This data is supplemented with Pfam-A, other non-domain structural predictions (i.e. coiled coils) and experimental data from UniProt. In order to enhance the investigations possible with this data, we have also incorporated a variety of protein annotation resources, including protein–protein interaction data, GO functional assignments, KEGG pathways, FUNCAT functional descriptions and links to microarray expression data. All of this data can be accessed through a newly re-designed website that has a focus on flexibility and clarity, with searches that can be restricted to a single genome or across the entire sequence database. Currently Gene3D contains over 3.5 million domain assignments for nearly 5 million proteins including 527 completed genomes. This is available at: http://gene3d.biochem.ucl.ac.uk/
“Phylogenetic profiling” is based on the hypothesis that during evolution functionally or physically interacting genes are likely to be inherited or eliminated in a codependent manner. Creating presence–absence profiles of orthologous genes is now a common and powerful way of identifying functionally associated genes. In this approach, correctly determining orthology, as a means of identifying functional equivalence between two genes, is a critical and nontrivial step and largely explains why previous work in this area has mainly focused on using presence–absence profiles in prokaryotic species. Here, we demonstrate that eukaryotic genomes have a high proportion of multigene families whose phylogenetic profile distributions are poor in presence–absence information content. This feature makes them prone to orthology mis-assignment and unsuited to standard profile-based prediction methods. Using CATH structural domain assignments from the Gene3D database for 13 complete eukaryotic genomes, we have developed a novel modification of the phylogenetic profiling method that uses genome copy number of each domain superfamily to predict functional relationships. In our approach, superfamilies are subclustered at ten levels of sequence identity—from 30% to 100%—and phylogenetic profiles built at each level. All the profiles are compared using normalised Euclidean distances to identify those with correlated changes in their domain copy number. We demonstrate that two protein families will “auto-tune” with strong co-evolutionary signals when their profiles are compared at the similarity levels that capture their functional relationship. Our method finds functional relationships that are not detectable by the conventional presence–absence profile comparisons, and it does not require a priori any fixed criteria to define orthologous genes.
The vast number of protein sequences being determined by the international genomics projects means that it is not possible to functionally characterise all the proteins through direct experimentation. One of the more successful electronic methods for detecting functionally associated genes has been through the comparison of genes' phylogenetic profiles. This method is based on the hypothesis that two functionally related genes will show very similar presence–absence profile patterns throughout different organisms. Whilst these methods have grown increasingly sophisticated, they have largely been based on detecting functionally homologous genes in different species (technically known as orthologous genes) and thus better suited to prokaryotic genomes, where this can be done more easily. We have developed a new type of hierarchical phylogenetic profile by subdividing protein families into subclusters in different sequence identity levels. This new approach encapsulates a more realistic model of the functional variation that uneven natural selection pressure produces on different protein families and organisms, and it can detect functional relationships between protein families without the initial application of rigid sequence similarity thresholds or complex protocols for orthology assignment. These advantages are especially useful in eukaryotes since the larger average size of eukaryotic multigene families makes them more prone to orthology mis-assignment than in prokaryotes.
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms.
We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms.
When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.
protein structure; function; evolution; genome analysis
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
We report the latest release (version 3.0) of the CATH protein domain database (). There has been a 20% increase in the number of structural domains classified in CATH, up to 86 151 domains. Release 3.0 comprises 1110 fold groups and 2147 homologous superfamilies. To cope with the increases in diverse structural homologues being determined by the structural genomics initiatives, more sensitive methods have been developed for identifying boundaries in multi-domain proteins and for recognising homologues. The CATH classification update is now being driven by an integrated pipeline that links these automated procedures with validation steps, that have been made easier by the provision of information rich web pages summarising comparison scores and relevant links to external sites for each domain being classified. An analysis of the population of domains in the CATH hierarchy and several domain characteristics are presented for version 3.0. We also report an update of the CATH Dictionary of homologous structures (CATH-DHS) which now contains multiple structural alignments, consensus information and functional annotations for 1459 well populated superfamilies in CATH. CATH is directly linked to the Gene3D database which is a projection of CATH structural data onto ∼2 million sequences in completed genomes and UniProt.
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.
The Gene3D release 4 database and web portal () provide a combined structural, functional and evolutionary view of the protein world. It is focussed on providing structural annotation for protein sequences without structural representatives—including the complete proteome sets of over 240 different species. The protein sequences have also been clustered into whole-chain families so as to aid functional prediction. The structural annotation is generated using HMM models based on the CATH domain families; CATH is a repository for manually deduced protein domains. Amongst the changes from the last publication are: the addition of over 100 genomes and the UniProt sequence database, domain data from Pfam, metabolic pathway and functional data from COGs, KEGG and GO, and protein–protein interaction data from MINT and BIND. The website has been rebuilt to allow more sophisticated querying and the data returned is presented in a clearer format with greater functionality. Furthermore, all data can be downloaded in a simple XML format, allowing users to carry out complex investigations at their own computers.
Pfam is a large collection of protein families and domains. Over the past 2 years the number of families in Pfam has doubled and now stands at 6190 (version 10.0). Methodology improvements for searching the Pfam collection locally as well as via the web are described. Other recent innovations include modelling of discontinuous domains allowing Pfam domain definitions to be closer to those found in structure databases. Pfam is available on the web in the UK (http://www.sanger.ac.uk/Software/Pfam/), the USA (http://pfam.wustl.edu/), France (http://pfam.jouy.inra.fr/) and Sweden (http://Pfam.cgb.ki.se/).
Streptomyces coelicolor has long been considered a remarkable bacterium with a complex life-cycle, ubiquitous environmental distribution, linear chromosomes and plasmids, and a huge range of pharmaceutically useful secondary metabolites. Completion of the genome sequence demonstrated that this diversity carried through to the genetic level, with over 7000 genes identified. We sought to expand our understanding of this organism at the molecular level through identification and annotation of novel protein domains. Protein domains are the evolutionary conserved units from which proteins are formed.
Two automated methods were employed to rapidly generate an optimised set of targets, which were subsequently analysed manually. A final set of 37 domains or structural repeats, represented 204 times in the genome, was developed. Using these families enabled us to correlate items of information from many different resources. Several immediately enhance our understanding both of S. coelicolor and also general bacterial molecular mechanisms, including cell wall biosynthesis regulation and streptomycete telomere maintenance.
Delineation of protein domain families enables detailed analysis of protein function, as well as identification of likely regions or residues of particular interest. Hence this kind of prior approach can increase the rate of discovery in the laboratory. Furthermore we demonstrate that using this type of in silico method it is possible to fairly rapidly generate new biological information from previously uncorrelated data.