With the advancement of new high throughput sequencing technologies, there has been an increase in the number of genome sequencing projects worldwide, which has yielded complete genome sequences of human, animals and plants. Subsequently, several labs have focused on genome annotation, consisting of assigning functions to gene products, mostly using Gene Ontology (GO) terms. As a consequence, there is an increased heterogeneity in annotations across genomes due to different approaches used by different pipelines to infer these annotations and also due to the nature of the GO structure itself. This makes a curator's task difficult, even if they adhere to the established guidelines for assessing these protein annotations. Here we develop a genome-scale approach for integrating GO annotations from different pipelines using semantic similarity measures. We used this approach to identify inconsistencies and similarities in functional annotations between orthologs of human and Drosophila melanogaster, to assess the quality of GO annotations derived from InterPro2GO mappings compared to manually annotated GO annotations for the Drosophila melanogaster proteome from a FlyBase dataset and human, and to filter GO annotation data for these proteomes. Results obtained indicate that an efficient integration of GO annotations eliminates redundancy up to 27.08 and 22.32% in the Drosophila melanogaster and human GO annotation datasets, respectively. Furthermore, we identified lack of and missing annotations for some orthologs, and annotation mismatches between InterPro2GO and manual pipelines in these two proteomes, thus requiring further curation. This simplifies and facilitates tasks of curators in assessing protein annotations, reduces redundancy and eliminates inconsistencies in large annotation datasets for ease of comparative functional genomics.
functional annotation; Gene Ontology annotation; annotation pipeline; manual annotation; electronic annotation
“Scientific community” refers to a group of people collaborating together on scientific-research-related activities who also share common goals, interests, and values. Such communities play a key role in many bioinformatics activities. Communities may be linked to a specific location or institute, or involve people working at many different institutions and locations. Education and training is typically an important component of these communities, providing a valuable context in which to develop skills and expertise, while also strengthening links and relationships within the community. Scientific communities facilitate: (i) the exchange and development of ideas and expertise; (ii) career development; (iii) coordinated funding activities; (iv) interactions and engagement with professionals from other fields; and (v) other activities beneficial to individual participants, communities, and the scientific field as a whole. It is thus beneficial at many different levels to understand the general features of successful, high-impact bioinformatics communities; how individual participants can contribute to the success of these communities; and the role of education and training within these communities. We present here a quick guide to building and maintaining a successful, high-impact bioinformatics community, along with an overview of the general benefits of participating in such communities. This article grew out of contributions made by organizers, presenters, panelists, and other participants of the ISMB/ECCB 2013 workshop “The ‘How To Guide’ for Establishing a Successful Bioinformatics Network” at the 21st Annual International Conference on Intelligent Systems for Molecular Biology (ISMB) and the 12th European Conference on Computational Biology (ECCB).
The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.
Several approaches have been proposed for computing
term information content (IC) and semantic similarity scores
within the gene ontology (GO) directed acyclic graph (DAG).
These approaches contributed to improving protein analyses at
the functional level. Considering the recent proliferation of these
approaches, a unified theory in a well-defined mathematical
framework is necessary in order to provide a theoretical basis
for validating these approaches. We review the existing IC-based
ontological similarity approaches developed in the context
of biomedical and bioinformatics fields to propose a general
framework and unified description of all these measures. We
have conducted an experimental evaluation to assess the impact
of IC approaches, different normalization models, and correction
factors on the performance of a functional similarity metric.
Results reveal that considering only parents or only children of
terms when assessing information content or semantic similarity
scores negatively impacts the approach under consideration.
This study produces a unified framework for current and future
GO semantic similarity measures and provides theoretical basics
for comparing different approaches. The experimental evaluation
of different approaches based on different term information
content models paves the way towards a solution to the issue of scoring a term's specificity in the GO DAG.
Infectious diseases are the leading cause of death, particularly in developing countries. Although many drugs are available for treating the most common infectious diseases, in many cases the mechanism of action of these drugs or even their targets in the pathogen remain unknown. In addition, the key factors or processes in pathogens that facilitate infection and disease progression are often not well understood. Since proteins do not work in isolation, understanding biological systems requires a better understanding of the interconnectivity between proteins in different pathways and processes, which includes both physical and other functional interactions. Such biological networks can be generated within organisms or between organisms sharing a common environment using experimental data and computational predictions. Though different data sources provide different levels of accuracy, confidence in interactions can be measured using interaction scores. Connections between interacting proteins in biological networks can be represented as graphs and edges, and thus studied using existing algorithms and tools from graph theory. There are many different applications of biological networks, and here we discuss three such applications, specifically applied to the infectious disease tuberculosis, with its causative agent Mycobacterium tuberculosis and host, Homo sapiens. The applications include the use of the networks for function prediction, comparison of networks for evolutionary studies, and the generation and use of host–pathogen interaction networks.
Biological networks; Tuberculosis; Pathogen; Evolution; Protein–protein interaction
The wide coverage and biological relevance of the Gene Ontology (GO), confirmed through its successful use in protein function prediction, have led to the growth in its popularity. In order to exploit the extent of biological knowledge that GO offers in describing genes or groups of genes, there is a need for an efficient, scalable similarity measure for GO terms and GO-annotated proteins. While several GO similarity measures exist, none adequately addresses all issues surrounding the design and usage of the ontology. We introduce a new metric for measuring the distance between two GO terms using the intrinsic topology of the GO-DAG, thus enabling the measurement of functional similarities between proteins based on their GO annotations. We assess the performance of this metric using a ROC analysis on human protein-protein interaction datasets and correlation coefficient analysis on the selected set of protein pairs from the CESSM online tool. This metric achieves good performance compared to the existing annotation-based GO measures. We used this new metric to assess functional similarity between orthologues, and show that it is effective at determining whether orthologues are annotated with similar functions and identifying cases where annotation is inconsistent between orthologues.
Interaction between proteins is one of the most important mechanisms in the execution of cellular functions. The study of these interactions has provided insight into the functioning of an organism’s processes. As of October 2013, Homo sapiens had over 170000 Protein-Protein interactions (PPI) registered in the Interologous Interaction Database, which is only one of the many public resources where protein interactions can be accessed. These numbers exemplify the volume of data that research on the topic has generated. Visualization of large data sets is a well known strategy to make sense of information, and protein interaction data is no exception. There are several tools that allow the exploration of this data, providing different methods to visualize protein network interactions. However, there is still no native web tool that allows this data to be explored interactively online.
Given the advances that web technologies have made recently it is time to bring these interactive views to the web to provide an easily accessible forum to visualize PPI. We have created a Web-based Protein Interaction Network Visualizer: PINV, an open source, native web application that facilitates the visualization of protein interactions (http://biosual.cbio.uct.ac.za/pinv.html). We developed PINV as a set of components that follow the protocol defined in BioJS and use the D3 library to create the graphic layouts. We demonstrate the use of PINV with multi-organism interaction networks for a predicted target from Mycobacterium tuberculosis, its interacting partners and its orthologs.
The resultant tool provides an attractive view of complex, fully interactive networks with components that allow the querying, filtering and manipulation of the visible subset. Moreover, as a web resource, PINV simplifies sharing and publishing, activities which are vital in today’s research collaborative environments. The source code is freely available for download at https://github.com/4ndr01d3/biosual.
Visualization; Protein-Protein Interactions; PPI; Web development
Technological developments in large-scale biological experiments, coupled with bioinformatics tools, have opened the doors to computational approaches for the global analysis of whole genomes. This has provided the opportunity to look at genes within their context in the cell. The integration of vast
amounts of data generated by these technologies provides a strategy for identifying potential drug targets
within microbial pathogens, the causative agents of infectious diseases. As proteins are druggable targets,
functional interaction networks between proteins are used to identify proteins essential to the survival,
growth, and virulence of these microbial pathogens. Here we have integrated functional genomics data to
generate functional interaction networks between Mycobacterium tuberculosis proteins and carried out computational analyses to dissect the functional interaction network produced for identifying drug targets
using network topological properties. This study has provided the opportunity to expand the range of potential drug targets and to move towards optimal target-based strategies.
The use of Gene Ontology (GO) data in protein analyses have largely contributed to
the improved outcomes of these analyses. Several GO semantic similarity measures
have been proposed in recent years and provide tools that allow the integration of
biological knowledge embedded in the GO structure into different biological
analyses. There is a need for a unified tool that provides the scientific
community with the opportunity to explore these different GO similarity measure
approaches and their biological applications.
We have developed DaGO-Fun, an online tool available at
http://web.cbio.uct.ac.za/ITGOM, which incorporates many different
GO similarity measures for exploring, analyzing and comparing GO terms and
proteins within the context of GO. It uses GO data and UniProt proteins with their
GO annotations as provided by the Gene Ontology Annotation (GOA) project to
precompute GO term information content (IC), enabling rapid response to user
The DaGO-Fun online tool presents the advantage of integrating all the relevant
IC-based GO similarity measures, including topology- and annotation-based
approaches to facilitate effective exploration of these measures, thus enabling
users to choose the most relevant approach for their application. Furthermore,
this tool includes several biological applications related to GO semantic
similarity scores, including the retrieval of genes based on their GO annotations,
the clustering of functionally related genes within a set, and term enrichment
Admixed populations can make an important contribution to the discovery of disease susceptibility genes if the parental populations exhibit substantial variation in susceptibility. Admixture mapping has been used successfully, but is not designed to cope with populations that have more than two or three ancestral populations. The inference of admixture proportions and local ancestry and the imputation of missing genotypes in admixed populations are crucial in both understanding variation in disease and identifying novel disease loci. These inferences make use of reference populations, and accuracy depends on the choice of ancestral populations. Using an insufficient or inaccurate ancestral panel can result in erroneously inferred ancestry and affect the detection power of GWAS and meta-analysis when using imputation. Current algorithms are inadequate for multi-way admixed populations. To address these challenges we developed PROXYANC, an approach to select the best proxy ancestral populations. From the simulation of a multi-way admixed population we demonstrate the capability and accuracy of PROXYANC and illustrate the importance of the choice of ancestry in both estimating admixture proportions and imputing missing genotypes. We applied this approach to a complex, uniquely admixed South African population. Using genome-wide SNP data from over 764 individuals, we accurately estimate the genetic contributions from the best ancestral populations: isiXhosa , ‡Khomani SAN , European , Indian , and Chinese . We also demonstrate that the ancestral allele frequency differences correlate with increased linkage disequilibrium in the South African population, which originates from admixture events rather than population bottlenecks.
The collective term for people of mixed ancestry in southern Africa is “Coloured,” and this is officially recognized in South Africa as a census term, and for self-classification. Whilst we acknowledge that some cultures may use this term in a derogatory manner, these connotations are not present in South Africa, and are certainly not intended here.
The outcome of infection by Mycobacterium tuberculosis (Mtb) depends greatly on how the host responds to the bacteria and how the bacteria manipulates the host, which is facilitated by protein–protein interactions. Thus, to understand this process, there is a need for elucidating protein interactions between human and Mtb, which may enable us to characterize specific molecular mechanisms allowing the bacteria to persist and survive under different environmental conditions. In this work, we used the interologs method based on experimentally verified intra-species and inter-species interactions to predict human-Mtb functional interactions. These interactions were further filtered using known human-Mtb interactions and genes that are differentially expressed during infection, producing 190 interactions. Further analysis of the subcellular location of proteins involved in these human-Mtb interactions confirms feasibility of these interactions. We also conducted functional analysis of human and Mtb proteins involved in these interactions, checking whether these proteins play a role in infection and/or disease, and enriching Mtb proteins in a previously predicted list of drug targets. We found that the biological processes of the human interacting proteins suggested their involvement in apoptosis and production of nitric oxide, whereas those of the Mtb interacting proteins were relevant to the intracellular environment of Mtb in the host. Mapping these proteins onto KEGG pathways highlighted proteins belonging to the tuberculosis pathway and also suggested that Mtb proteins might use the host to acquire nutrients, which is in agreement with the intracellular lifestyle of Mtb. This indicates that these interactions can shed light on the interplay between Mtb and its human host and thus, contribute to the process of designing novel drugs with new biological mechanisms of action.
The abundance of diverse biological data from various sources constitutes a rich source of knowledge, which has the power to advance our understanding of organisms. This requires computational methods in order to integrate and exploit these data effectively and elucidate local and genome wide functional connections between protein pairs, thus enabling functional inferences for uncharacterized proteins. These biological data are primarily in the form of sequences, which determine functions, although functional properties of a protein can often be predicted from just the domains it contains. Thus, protein sequences and domains can be used to predict protein pair-wise functional relationships, and thus contribute to the function prediction process of uncharacterized proteins in order to ensure that knowledge is gained from sequencing efforts. In this work, we introduce information-theoretic based approaches to score protein-protein functional interaction pairs predicted from protein sequence similarity and conserved protein signature matches. The proposed schemes are effective for data-driven scoring of connections between protein pairs. We applied these schemes to the Mycobacterium tuberculosis proteome to produce a homology-based functional network of the organism with a high confidence and coverage. We use the network for predicting functions of uncharacterised proteins.
Protein pair-wise functional relationship scores for Mycobacterium tuberculosis strain CDC1551 sequence data and python scripts to compute these scores are available at http://web.cbio.uct.ac.za/~gmazandu/scoringschemes.
In order to interpret the results obtained from a microarray experiment, researchers often shift focus from analysis of individual differentially expressed genes to analyses of sets of genes. These gene-set analysis (GSA) methods use previously accumulated biological knowledge to group genes into sets and then aim to rank these gene sets in a way that reflects their relative importance in the experimental situation in question. We suspect that the presence of paralogs affects the ability of GSA methods to accurately identify the most important sets of genes for subsequent research.
We show that paralogs, which typically have high sequence identity and similar molecular functions, also exhibit high correlation in their expression patterns. We investigate this correlation as a potential confounding factor common to current GSA methods using Indygene http://www.cbio.uct.ac.za/indygene, a web tool that reduces a supplied list of genes so that it includes no pairwise paralogy relationships above a specified sequence similarity threshold. We use the tool to reanalyse previously published microarray datasets and determine the potential utility of accounting for the presence of paralogs.
The Indygene tool efficiently removes paralogy relationships from a given dataset and we found that such a reduction, performed prior to GSA, has the ability to generate significantly different results that often represent novel and plausible biological hypotheses. This was demonstrated for three different GSA approaches when applied to the reanalysis of previously published microarray datasets and suggests that the redundancy and non-independence of paralogs is an important consideration when dealing with GSA methodologies.
InterPro is an integrated resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. The latter two new member databases have been integrated since the last publication in this journal. There have been several new developments in InterPro, including an additional reading field, new database links, extensions to the web interface and additional match XML files. InterPro has always provided matches to UniProtKB proteins on the website and in the match XML file on the FTP site. Additional matches to proteins in UniParc (UniProt archive) are now available for download in the new match XML files only. The latest InterPro release (13.0) contains more than 13 000 entries, covering over 78% of all proteins in UniProtKB. The database is available for text- and sequence-based searches via a webserver (), and for download by anonymous FTP (). The InterProScan search tool is now also available via a web service at .
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).
With the large influx of raw sequence data from genome sequencing projects, there is a need for reliable automatic methods for protein sequence analysis and classification. The most useful tools use various methods for identifying motifs or domains found in previously characterized protein families. This article reviews the tools and resources available on the web for identifying signatures within proteins and discusses how they may be used in the analysis of new or unknown protein sequences.