Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations.
We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV.
For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein.
Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated.
This article was reviewed by Daniel Haft and Igor B. Rogozin.
nsSNV; Ortholog; Sequence conservation; FBA; Yeast metabolic modeling
Integrative Next Generation Sequencing (NGS) DNA and RNA analyses have very recently become feasible, and the published to date studies have discovered critical disease implicated pathways, and diagnostic and therapeutic targets. A growing number of exomes, genomes and transcriptomes from the same individual are quickly accumulating, providing unique venues for mechanistic and regulatory features analysis, and, at the same time, requiring new exploration strategies. In this study, we have integrated variation and expression information of four NGS datasets from the same individual: normal and tumor breast exomes and transcriptomes. Focusing on SNPcentered variant allelic prevalence, we illustrate analytical algorithms that can be applied to extract or validate potential regulatory elements, such as expression or growth advantage, imprinting, loss of heterozygosity (LOH), somatic changes, and RNA editing. In addition, we point to some critical elements that might bias the output and recommend alternative measures to maximize the confidence of findings. The need for such strategies is especially recognized within the growing appreciation of the concept of systems biology: integrative exploration of genome and transcriptome features reveal mechanistic and regulatory insights that reach far beyond linear addition of the individual datasets.
Exome; Transcriptome; Breast Tumor; Breast Cancer; SNP; Allelic Imbalance; Allele Preferential Expression; RNA Editing; Somatic Mutations; Imprinting; LOH
Amino acid changes due to non-synonymous variation are included as annotations for individual proteins in UniProtKB/Swiss-Prot and RefSeq which present biological data in a protein- or gene-centric fashion. Unfortunately, proteome-wide analysis of non-synonymous single-nucleotide variations (nsSNVs) is not easy to perform because information on nsSNVs and functionally important sites are not well integrated both within and between databases and their search engines. We have developed SNVDis that allows evaluation of proteome-wide nsSNV distribution in functional sites, domains and pathways. More specifically, we have integrated human-specific data from major variation databases (UniProtKB, dbSNP and COSMIC), comprehensive sequence feature annotation from UniProtKB, Pfam, RefSeq, Conserved Domain Database (CDD) and pathway information from Protein ANalysis THrough Evolutionary Relationships (PANTHER) and mapped all of them in a uniform and comprehensive way to the human reference proteome provided by UniProtKB/Swiss-Prot. Integrated information of active sites, pathways, binding sites, domains, which are extracted from a number of different sources, provides a detailed overview of how nsSNVs are distributed over the human proteome and pathways and how they intersect with functional sites of proteins. Additionally, it is possible to find out whether there is an over- or under-representation of nsSNVs in specific domains, pathways or user-defined protein lists. The underlying datasets are updated once every three months. SNVDis is freely available at http://hive.biochemistry.gwu.edu/tool/snvdis.
Active site; Binding site; N-linked glycosylation; nsSNP; nsSNV; Variation
Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators has led to a rich repository of information on functional sites of genes and proteins. This information along with variation-related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows are becoming more important because thousands of NGS data sets are being made available through projects such as The Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated sequence feature database, provides a framework for automated and manual curation and integration of cancer-related sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 308 986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common nsSNVs that can be prioritized in validation studies.
Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu
The asparagine-X-serine/threonine (NXS/T) motif, where X is any amino acid except proline, is the consensus motif for N-linked glycosylation. Significant numbers of high-resolution crystal structures of glycosylated proteins allows us to carry out structural analysis of the N-linked glycosylation sites (NGS). Our analysis shows that there is enough structural information from diverse glycoproteins to allow development of rules which can be used to predict NGS. A Python-based tool was developed to investigate asparagines implicated in N-glycosylation in five species: Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana and Saccharomyces cerevisiae. Our analysis shows that 78% of all asparagines of NXS/T motif involved in N-glycosylation are localized in the loop/turn conformation in the human proteome. Similar distribution was revealed for all the other species examined. Comparative analysis of the occurrence of NXS/T motifs not known to be glycosylated and their reverse sequence (S/TXN) shows a similar distribution across the secondary structural elements, indicating that the NXS/T motif in itself is not biologically relevant. Based on our analysis, we have defined rules to determine NGS. Using machine learning methods based on these rules we can predict with 93% accuracy if a particular site will be glycosylated. If structural information is not available the tool uses structural prediction results resulting in 74% accuracy. The tool was used to identify glycosylation sites in 108 human proteins with structures and 2247 proteins without structures that have acquired NXS/T site/s due to non-synonymous variation. The tool, Structure Feature Analysis Tool (SFAT), is freely available to the public at http://hive.biochemistry.gwu.edu/tools/sfat.
N-linked glycosylation; Gain and loss of glycosylation; nsSNP; nsSNV; Variation
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
SRA; TCGA; nsSNV; SNV; SNP; Next-gen; NGS; Phylogenetics; Cancer
The Second Annual Symposium of the Global Cancer Genomics Consortium (GCGC) was held at the Tata Memorial Center in Mumbai, India, from November 19 to 20, 2012. Founded in late 2010, the GCGC aims to provide a platform for highly productive, collaborative efforts on next-generation cancer research through bridging the latest scientific and technology developments with clinical oncology challenges. This year’s presenters brought together highly innovative interdisciplinary views and strategies to meet major challenges in cancer research. The symposium featured 3 major themes: OMICS approaches toward the identification of cancer molecular drivers, single-cell analysis in cancer, and clinical and translational genomics. Each theme was represented in presentations of new findings, with an obvious implication in cross-disciplinary components of OMICs and an overwhelming participation by students. In summary, the GCGC symposium provided a discussion and congregation of the latest advances in basic and translational cancer research and offered the participants with a highly cooperative network environment for future collaboration.
genomics medicine; anticancer target; cancer therapy
The 5th International Biocuration Conference brought together over 300 scientists to exchange on their work, as well as discuss issues relevant to the International Society for Biocuration’s (ISB) mission. Recurring themes this year included the creation and promotion of gold standards, the need for more ontologies, and more formal interactions with journals. The conference is an essential part of the ISB's goal to support exchanges among members of the biocuration community. Next year's conference will be held in Cambridge, UK, from 7 to 10 April 2013. In the meanwhile, the ISB website provides information about the society's activities (http://biocurator.org), as well as related events of interest.
N-linked glycosylation is one of the most frequent post-translational modifications of proteins with a profound impact on their biological function. Besides other functions, N-linked glycosylation assists in protein folding, determines protein orientation at the cell surface, or protects proteins from proteases. The N-linked glycans attach to asparagines in the sequence context Asn-X-Ser/Thr, where X is any amino acid except proline. Any variation (e.g. non-synonymous single nucleotide polymorphism or mutation) that abolishes the N-glycosylation sequence motif will lead to the loss of a glycosylation site. On the other hand, variations causing a substitution that creates a new N-glycosylation sequence motif can result in the gain of glycosylation. Although the general importance of glycosylation is well known and acknowledged, the effect of variation on the actual glycoproteome of an organism is still mostly unknown. In this study, we focus on a comprehensive analysis of non-synonymous single nucleotide variations (nsSNV) that lead to either loss or gain of the N-glycosylation motif. We find that 1091 proteins have modified N-glycosylation sequons due to nsSNVs in the genome. Based on analysis of proteins that have a solved 3D structure at the site of variation, we find that 48% of the variations that lead to changes in glycosylation sites occur at the loop and bend regions of the proteins. Pathway and function enrichment analysis show that a significant number of proteins that gained or lost the glycosylation motif are involved in kinase activity, immune response, and blood coagulation. A structure-function analysis of a blood coagulation protein, antithrombin III and a protease, cathepsin D, showcases how a comprehensive study followed by structural analysis can help better understand the functional impact of the nsSNVs.
Motivation: Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006.
The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs’ meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications.
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.
Attempts to engage the scientific community to annotate biological data (such as protein/gene function) stored in databases have not been overly successful. There are several hypotheses on why this has not been successful but it is not clear which of these hypotheses are correct. In this study we have surveyed 50 biologists (who have recently published a paper characterizing a gene or protein) to better understand what would make them interested in providing input/contributions to biological databases. Based on our survey two things become clear: a) database managers need to proactively contact biologists to solicit contributions; and b) potential contributors need to be provided with an easy-to-use interface and clear instructions on what to annotate. Other factors such as 'reward' and 'employer/funding agency recognition' previously perceived as motivators was found to be less important. Based on this study we propose community annotation projects should devote resources to direct solicitation for input and streamlining of the processes or interfaces used to collect this input.
This article was reviewed by I. King Jordan, Daniel Haft and Yuriy Gusev
The NIAID (National Institute for Allergy and Infectious Diseases) Biodefense Proteomics program aims to identify targets for potential vaccines, therapeutics, and diagnostics for agents of concern in bioterrorism, including bacterial, parasitic, and viral pathogens. The program includes seven Proteomics Research Centers, generating diverse types of pathogen-host data, including mass spectrometry, microarray transcriptional profiles, protein interactions, protein structures and biological reagents. The Biodefense Resource Center (www.proteomicsresource.org) has developed a bioinformatics framework, employing a protein-centric approach to integrate and support mining and analysis of the large and heterogeneous data. Underlying this approach is a data warehouse with comprehensive protein + gene identifier and name mappings and annotations extracted from over 100 molecular databases. Value-added annotations are provided for key proteins from experimental findings using controlled vocabulary. The availability of pathogen and host omics data in an integrated framework allows global analysis of the data and comparisons across different experiments and organisms, as illustrated in several case studies presented here. (1) The identification of a hypothetical protein with differential gene and protein expressions in two host systems (mouse macrophage and human HeLa cells) infected by different bacterial (Bacillus anthracis and Salmonella typhimurium) and viral (orthopox) pathogens suggesting that this protein can be prioritized for additional analysis and functional characterization. (2) The analysis of a vaccinia-human protein interaction network supplemented with protein accumulation levels led to the identification of human Keratin, type II cytoskeletal 4 protein as a potential therapeutic target. (3) Comparison of complete genomes from pathogenic variants coupled with experimental information on complete proteomes allowed the identification and prioritization of ten potential diagnostic targets from Bacillus anthracis. The integrative analysis across data sets from multiple centers can reveal potential functional significance and hidden relationships between pathogen and host proteins, thereby providing a systems approach to basic understanding of pathogenicity and target identification.
The Universal Protein Resource (UniProt) provides a central resource on protein sequences and functional annotation with three database components, each addressing a key need in protein bioinformatics. The UniProt Knowledgebase (UniProtKB), comprising the manually annotated UniProtKB/Swiss-Prot section and the automatically annotated UniProtKB/TrEMBL section, is the preeminent storehouse of protein annotation. The extensive cross-references, functional and feature annotations and literature-based evidence attribution enable scientists to analyse proteins and query across databases. The UniProt Reference Clusters (UniRef) speed similarity searches via sequence space compression by merging sequences that are 100% (UniRef100), 90% (UniRef90) or 50% (UniRef50) identical. Finally, the UniProt Archive (UniParc) stores all publicly available protein sequences, containing the history of sequence data with links to the source databases. UniProt databases continue to grow in size and in availability of information. Recent and upcoming changes to database contents, formats, controlled vocabularies and services are described. New download availability includes all major releases of UniProtKB, sequence collections by taxonomic division and complete proteomes. A bibliography mapping service has been added, and an ID mapping service will be available soon. UniProt databases can be accessed online at or downloaded at .
The identification of unique proteins at different taxonomic levels has both scientific and practical value. Strain-, species- and genus-specific proteins can provide insight into the criteria that define an organism and its relationship with close relatives. Such proteins can also serve as taxon-specific diagnostic targets.
A pipeline using a combination of computational and manual analyses of BLAST results was developed to identify strain-, species-, and genus-specific proteins and to catalog the closest sequenced relative for each protein in a proteome. Proteins encoded by a given strain are preliminarily considered to be unique if BLAST, using a comprehensive protein database, fails to retrieve (with an e-value better than 0.001) any protein not encoded by the query strain, species or genus (for strain-, species- and genus-specific proteins respectively), or if BLAST, using the best hit as the query (reverse BLAST), does not retrieve the initial query protein. Results are manually inspected for homology if the initial query is retrieved in the reverse BLAST but is not the best hit. Sequences unlikely to retrieve homologs using the default BLOSUM62 matrix (usually short sequences) are re-tested using the PAM30 matrix, thereby increasing the number of retrieved homologs and increasing the stringency of the search for unique proteins. The above protocol was used to examine several food- and water-borne pathogens. We find that the reverse BLAST step filters out about 22% of proteins with homologs that would otherwise be considered unique at the genus and species levels. Analysis of the annotations of unique proteins reveals that many are remnants of prophage proteins, or may be involved in virulence. The data generated from this study can be accessed and further evaluated from the CUPID (Core and Unique Protein Identification) system web site (updated semi-annually) at .
CUPID provides a set of proteins specific to a genus, species or a strain, and identifies the most closely related organism.
An increasing number of whole viral and bacterial genomes are being sequenced and deposited in public databases. In parallel to the mounting interest in whole genomes, the number of whole genome analyses software tools is also increasing. GeneOrder was originally developed to provide an analysis of genes between two genomes, allowing visualization of gene order and synteny comparisons of any small genomes. It was originally developed for comparing virus, mitochondrion and chloroplast genomes. This is now extended to small bacterial genomes of sizes less than 2 Mb.
GeneOrder3.0 has been developed and validated successfully on several small bacterial genomes (ca. 580 kb to 1.83 Mb) archived in the NCBI GenBank database. It is an updated web-based "on-the-fly" computational tool allowing gene order and synteny comparisons of any two small bacterial genomes. Analyses of several bacterial genomes show that a large amount of gene and genome re-arrangement occurs, as seen with earlier DNA software tools. This can be displayed at the protein level using GeneOrder3.0. Whole genome alignments of genes are presented in both a table and a dot plot. This allows the detection of evolutionary more distant relationships since protein sequences are more conserved than DNA sequences.
GeneOrder3.0 allows researchers to perform comparative analysis of gene order and synteny in genomes of sizes up to 2 Mb "on-the-fly." Availability: and .
We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs from seven eukaryotic genomes. The analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes.
Sequencing the genomes of multiple, taxonomically diverse eukaryotes enables in-depth comparative-genomic analysis which is expected to help in reconstructing ancestral eukaryotic genomes and major events in eukaryotic evolution and in making functional predictions for currently uncharacterized conserved genes.
We examined functional and evolutionary patterns in the recently constructed set of 5,873 clusters of predicted orthologs (eukaryotic orthologous groups or KOGs) from seven eukaryotic genomes: Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens, Arabidopsis thaliana, Saccharomyces cerevisiae, Schizosaccharomyces pombe and Encephalitozoon cuniculi. Conservation of KOGs through the phyletic range of eukaryotes strongly correlates with their functions and with the effect of gene knockout on the organism's viability. The approximately 40% of KOGs that are represented in six or seven species are enriched in proteins responsible for housekeeping functions, particularly translation and RNA processing. These conserved KOGs are often essential for survival and might approximate the minimal set of essential eukaryotic genes. The 131 single-member, pan-eukaryotic KOGs we identified were examined in detail. For around 20 that remained uncharacterized, functions were predicted by in-depth sequence analysis and examination of genomic context. Nearly all these proteins are subunits of known or predicted multiprotein complexes, in agreement with the balance hypothesis of evolution of gene copy number. Other KOGs show a variety of phyletic patterns, which points to major contributions of lineage-specific gene loss and the 'invention' of genes new to eukaryotic evolution. Examination of the sets of KOGs lost in individual lineages reveals co-elimination of functionally connected genes. Parsimonious scenarios of eukaryotic genome evolution and gene sets for ancestral eukaryotic forms were reconstructed. The gene set of the last common ancestor of the crown group consists of 3,413 KOGs and largely includes proteins involved in genome replication and expression, and central metabolism. Only 44% of the KOGs, mostly from the reconstructed gene set of the last common ancestor of the crown group, have detectable homologs in prokaryotes; the remainder apparently evolved via duplication with divergence and invention of new genes.
The KOG analysis reveals a conserved core of largely essential eukaryotic genes as well as major diversification and innovation associated with evolution of eukaryotic genomes. The results provide quantitative support for major trends of eukaryotic evolution noticed previously at the qualitative level and a basis for detailed reconstruction of evolution of eukaryotic genomes and biology of ancestral forms.
The Protein Information Resource (PIR) is an integrated public resource of protein informatics. To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors, PIR has extended its superfamily concept and developed the SuperFamily (PIRSF) classification system. Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The system adopts a network structure for protein classification from superfamily to subfamily levels. Protein family members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). The PIRSF database consists of two data sets, preliminary clusters and curated families. The curated families include family name, protein membership, parent–child relationship, domain architecture, and optional description and bibliography. PIRSF is accessible from the website at http://pir.georgetown.edu/pirsf/ for report retrieval and sequence classification. The report presents family annotation, membership statistics, cross-references to other databases, graphical display of domain architecture, and links to multiple sequence alignments and phylogenetic trees for curated families. PIRSF can be utilized to analyze phylogenetic profiles, to reveal functional convergence and divergence, and to identify interesting relationships between homeomorphic families, domains and structural classes.
The availability of multiple, essentially complete genome sequences of prokaryotes and eukaryotes spurred both the demand and the opportunity for the construction of an evolutionary classification of genes from these genomes. Such a classification system based on orthologous relationships between genes appears to be a natural framework for comparative genomics and should facilitate both functional annotation of genomes and large-scale evolutionary studies.
We describe here a major update of the previously developed system for delineation of Clusters of Orthologous Groups of proteins (COGs) from the sequenced genomes of prokaryotes and unicellular eukaryotes and the construction of clusters of predicted orthologs for 7 eukaryotic genomes, which we named KOGs after eukaryotic orthologous groups. The COG collection currently consists of 138,458 proteins, which form 4873 COGs and comprise 75% of the 185,505 (predicted) proteins encoded in 66 genomes of unicellular organisms. The eukaryotic orthologous groups (KOGs) include proteins from 7 eukaryotic genomes: three animals (the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster and Homo sapiens), one plant, Arabidopsis thaliana, two fungi (Saccharomyces cerevisiae and Schizosaccharomyces pombe), and the intracellular microsporidian parasite Encephalitozoon cuniculi. The current KOG set consists of 4852 clusters of orthologs, which include 59,838 proteins, or ~54% of the analyzed eukaryotic 110,655 gene products. Compared to the coverage of the prokaryotic genomes with COGs, a considerably smaller fraction of eukaryotic genes could be included into the KOGs; addition of new eukaryotic genomes is expected to result in substantial increase in the coverage of eukaryotic genomes with KOGs. Examination of the phyletic patterns of KOGs reveals a conserved core represented in all analyzed species and consisting of ~20% of the KOG set. This conserved portion of the KOG set is much greater than the ubiquitous portion of the COG set (~1% of the COGs). In part, this difference is probably due to the small number of included eukaryotic genomes, but it could also reflect the relative compactness of eukaryotes as a clade and the greater evolutionary stability of eukaryotic genomes.
The updated collection of orthologous protein sets for prokaryotes and eukaryotes is expected to be a useful platform for functional annotation of newly sequenced genomes, including those of complex eukaryotes, and genome-wide evolutionary studies.
Three-dimensional structures are now known within most protein families and it is likely, when searching a sequence database, that one will identify a homolog of known structure. The goal of Entrez's 3D-structure database is to make structure information and the functional annotation it can provide easily accessible to molecular biologists. To this end, Entrez's search engine provides several powerful features: (i) links between databases, for example between a protein's sequence and structure; (ii) pre-computed sequence and structure neighbors; and (iii) structure and sequence/structure alignment visualization. Here, we focus on a new feature of Entrez's Molecular Modeling Database (MMDB): Graphical summaries of the biological annotation available for each 3D structure, based on the results of automated comparative analysis. MMDB is available at: http://www.ncbi.nlm.nih.gov/Entrez/structure.html.
The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE®. This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein–protein queries submitted to BLAST® at http://www.ncbi.nlm.nih.gov/BLAST.
CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate ‘columns’ corresponding to functional sites conserved among family members.