PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (48)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
1.  Three reasons protein disorder analysis makes more sense in the light of collagen 
Abstract
We have identified that the collagen helix has the potential to be disruptive to analyses of intrinsically disordered proteins. The collagen helix is an extended fibrous structure that is both promiscuous and repetitive. Whilst its sequence is predicted to be disordered, this type of protein structure is not typically considered as intrinsic disorder. Here, we show that collagen‐encoding proteins skew the distribution of exon lengths in genes. We find that previous results, demonstrating that exons encoding disordered regions are more likely to be symmetric, are due to the abundance of the collagen helix. Other related results, showing increased levels of alternative splicing in disorder‐encoding exons, still hold after considering collagen‐containing proteins. Aside from analyses of exons, we find that the set of proteins that contain collagen significantly alters the amino acid composition of regions predicted as disordered. We conclude that research in this area should be conducted in the light of the collagen helix.
doi:10.1002/pro.2913
PMCID: PMC4838654  PMID: 26941008
intrinsically disordered proteins; collagen helix; exons; phase symmetry; splicing
2.  An expanded evaluation of protein function prediction methods shows an improvement in accuracy 
Jiang, Yuxiang | Oron, Tal Ronnen | Clark, Wyatt T. | Bankapur, Asma R. | D’Andrea, Daniel | Lepore, Rosalba | Funk, Christopher S. | Kahanda, Indika | Verspoor, Karin M. | Ben-Hur, Asa | Koo, Da Chen Emily | Penfold-Brown, Duncan | Shasha, Dennis | Youngs, Noah | Bonneau, Richard | Lin, Alexandra | Sahraeian, Sayed M. E. | Martelli, Pier Luigi | Profiti, Giuseppe | Casadio, Rita | Cao, Renzhi | Zhong, Zhaolong | Cheng, Jianlin | Altenhoff, Adrian | Skunca, Nives | Dessimoz, Christophe | Dogan, Tunca | Hakala, Kai | Kaewphan, Suwisa | Mehryary, Farrokh | Salakoski, Tapio | Ginter, Filip | Fang, Hai | Smithers, Ben | Oates, Matt | Gough, Julian | Törönen, Petri | Koskinen, Patrik | Holm, Liisa | Chen, Ching-Tai | Hsu, Wen-Lian | Bryson, Kevin | Cozzetto, Domenico | Minneci, Federico | Jones, David T. | Chapman, Samuel | BKC, Dukka | Khan, Ishita K. | Kihara, Daisuke | Ofer, Dan | Rappoport, Nadav | Stern, Amos | Cibrian-Uhalte, Elena | Denny, Paul | Foulger, Rebecca E. | Hieta, Reija | Legge, Duncan | Lovering, Ruth C. | Magrane, Michele | Melidoni, Anna N. | Mutowo-Meullenet, Prudence | Pichler, Klemens | Shypitsyna, Aleksandra | Li, Biao | Zakeri, Pooya | ElShal, Sarah | Tranchevent, Léon-Charles | Das, Sayoni | Dawson, Natalie L. | Lee, David | Lees, Jonathan G. | Sillitoe, Ian | Bhat, Prajwal | Nepusz, Tamás | Romero, Alfonso E. | Sasidharan, Rajkumar | Yang, Haixuan | Paccanaro, Alberto | Gillis, Jesse | Sedeño-Cortés, Adriana E. | Pavlidis, Paul | Feng, Shou | Cejuela, Juan M. | Goldberg, Tatyana | Hamp, Tobias | Richter, Lothar | Salamov, Asaf | Gabaldon, Toni | Marcet-Houben, Marina | Supek, Fran | Gong, Qingtian | Ning, Wei | Zhou, Yuanpeng | Tian, Weidong | Falda, Marco | Fontana, Paolo | Lavezzo, Enrico | Toppo, Stefano | Ferrari, Carlo | Giollo, Manuel | Piovesan, Damiano | Tosatto, Silvio C.E. | del Pozo, Angela | Fernández, José M. | Maietta, Paolo | Valencia, Alfonso | Tress, Michael L. | Benso, Alfredo | Di Carlo, Stefano | Politano, Gianfranco | Savino, Alessandro | Rehman, Hafeez Ur | Re, Matteo | Mesiti, Marco | Valentini, Giorgio | Bargsten, Joachim W. | van Dijk, Aalt D. J. | Gemovic, Branislava | Glisic, Sanja | Perovic, Vladmir | Veljkovic, Veljko | Veljkovic, Nevena | Almeida-e-Silva, Danillo C. | Vencio, Ricardo Z. N. | Sharan, Malvika | Vogel, Jörg | Kansakar, Lakesh | Zhang, Shanshan | Vucetic, Slobodan | Wang, Zheng | Sternberg, Michael J. E. | Wass, Mark N. | Huntley, Rachael P. | Martin, Maria J. | O’Donovan, Claire | Robinson, Peter N. | Moreau, Yves | Tramontano, Anna | Babbitt, Patricia C. | Brenner, Steven E. | Linial, Michal | Orengo, Christine A. | Rost, Burkhard | Greene, Casey S. | Mooney, Sean D. | Friedberg, Iddo | Radivojac, Predrag
Genome Biology  2016;17(1):184.
Background
A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.
Results
We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.
Conclusions
The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-016-1037-6) contains supplementary material, which is available to authorized users.
doi:10.1186/s13059-016-1037-6
PMCID: PMC5015320  PMID: 27604469
Protein function prediction; Disease gene prioritization
3.  Molecular Principles of Gene Fusion Mediated Rewiring of Protein Interaction Networks in Cancer 
Molecular Cell  2016;63(4):579-592.
Summary
Gene fusions are common cancer-causing mutations, but the molecular principles by which fusion protein products affect interaction networks and cause disease are not well understood. Here, we perform an integrative analysis of the structural, interactomic, and regulatory properties of thousands of putative fusion proteins. We demonstrate that genes that form fusions (i.e., parent genes) tend to be highly connected hub genes, whose protein products are enriched in structured and disordered interaction-mediating features. Fusion often results in the loss of these parental features and the depletion of regulatory sites such as post-translational modifications. Fusion products disproportionately connect proteins that did not previously interact in the protein interaction network. In this manner, fusion products can escape cellular regulation and constitutively rewire protein interaction networks. We suggest that the deregulation of central, interaction-prone proteins may represent a widespread mechanism by which fusion proteins alter the topology of cellular signaling pathways and promote cancer.
Graphical Abstract
Highlights
•Parents of fusion proteins occupy central positions in protein interaction networks•Parents are rich in interaction-mediating features, which are often lost via fusion•Fusions preferentially join proteins with no previous connection in protein networks•Fusion proteins escape regulation by losing post-translational modification sites
The molecular mechanisms of fusion-mediated interactome disruption are currently unclear. Latysheva et al. find that fusion-forming proteins occupy central positions in interaction networks. They lose much of their extensive interaction-mediating ability and capacity for regulation upon fusion. These findings provide insights into how fusion proteins could rewire networks in cancer.
doi:10.1016/j.molcel.2016.07.008
PMCID: PMC5003813  PMID: 27540857
gene fusion; fusion protein; cancer genomics; protein interaction networks
4.  Did Viruses Evolve As a Distinct Supergroup from Common Ancestors of Cells? 
Genome Biology and Evolution  2016;8(8):2474-2481.
The evolutionary origins of viruses according to marker gene phylogenies, as well as their relationships to the ancestors of host cells remains unclear. In a recent article Nasir and Caetano-Anollés reported that their genome-scale phylogenetic analyses based on genomic composition of protein structural-domains identify an ancient origin of the “viral supergroup” (Nasir et al. 2015. A phylogenomic data-driven exploration of viral origins and evolution. Sci Adv. 1(8):e1500527.). It suggests that viruses and host cells evolved independently from a universal common ancestor. Examination of their data and phylogenetic methods indicates that systematic errors likely affected the results. Reanalysis of the data with additional tests shows that small-genome attraction artifacts distort their phylogenomic analyses, particularly the location of the root of the phylogenetic tree of life that is central to their conclusions. These new results indicate that their suggestion of a distinct ancestry of the viral supergroup is not well supported by the evidence.
doi:10.1093/gbe/evw175
PMCID: PMC5010908  PMID: 27497315
tree of life; origins of viruses; systematic error; rooting artifact; small genome attraction; homoplasy
5.  Evolution of the Calcium-Based Intracellular Signaling System 
Genome Biology and Evolution  2016;8(7):2118-2132.
To progress our understanding of molecular evolution from a collection of well-studied genes toward the level of the cell, we must consider whole systems. Here, we reveal the evolution of an important intracellular signaling system. The calcium-signaling toolkit is made up of different multidomain proteins that have undergone duplication, recombination, sequence divergence, and selection. The picture of evolution, considering the repertoire of proteins in the toolkit of both extant organisms and ancestors, is radically different from that of other systems. In eukaryotes, the repertoire increased in both abundance and diversity at a far greater rate than general genomic expansion. We describe how calcium-based intracellular signaling evolution differs not only in rate but in nature, and how this correlates with the disparity of plants and animals.
doi:10.1093/gbe/evw139
PMCID: PMC4987107  PMID: 27358427
protein architecture; calcium signaling; evolution; diversification; specialization
6.  Three reasons protein disorder analysis makes more sense in the light of collagen 
Abstract
We have identified that the collagen helix has the potential to be disruptive to analyses of intrinsically disordered proteins. The collagen helix is an extended fibrous structure that is both promiscuous and repetitive. Whilst its sequence is predicted to be disordered, this type of protein structure is not typically considered as intrinsic disorder. Here, we show that collagen‐encoding proteins skew the distribution of exon lengths in genes. We find that previous results, demonstrating that exons encoding disordered regions are more likely to be symmetric, are due to the abundance of the collagen helix. Other related results, showing increased levels of alternative splicing in disorder‐encoding exons, still hold after considering collagen‐containing proteins. Aside from analyses of exons, we find that the set of proteins that contain collagen significantly alters the amino acid composition of regions predicted as disordered. We conclude that research in this area should be conducted in the light of the collagen helix.
doi:10.1002/pro.2913
PMCID: PMC4838654  PMID: 26941008
intrinsically disordered proteins; collagen helix; exons; phase symmetry; splicing
7.  Hologenome analysis of two marine sponges with different microbiomes 
BMC Genomics  2016;17:158.
Background
Sponges (Porifera) harbor distinct microbial consortia within their mesohyl interior. We herein analysed the hologenomes of Stylissa carteri and Xestospongia testudinaria, which notably differ in their microbiome content.
Results
Our analysis revealed that S. carteri has an expanded repertoire of immunological domains, specifically Scavenger Receptor Cysteine-Rich (SRCR)-like domains, compared to X. testudinaria. On the microbial side, metatranscriptome analyses revealed an overrepresentation of potential symbiosis-related domains in X. testudinaria.
Conclusions
Our findings provide genomic insights into the molecular mechanisms underlying host-symbiont coevolution and may serve as a roadmap for future hologenome analyses.
Electronic supplementary material
The online version of this article (doi:10.1186/s12864-016-2501-0) contains supplementary material, which is available to authorized users.
doi:10.1186/s12864-016-2501-0
PMCID: PMC4772301  PMID: 26926518
Sponge; Stylissa carteri; Xestospongia testudinaria; Innate immune system; Host; Microbial symbionts; Hologenome
8.  Function-selective domain architecture plasticity potentials in eukaryotic genome evolution 
Biochimie  2015;119:269-277.
To help evaluate how protein function impacts on genome evolution, we introduce a new concept of ‘architecture plasticity potential’ – the capacity to form distinct domain architectures – both for an individual domain, or more generally for a set of domains grouped by shared function. We devise a scoring metric to measure the plasticity potential for these domain sets, and evaluate how function has changed over time for different species. Applying this metric to a phylogenetic tree of eukaryotic genomes, we find that the involvement of each function is not random but highly selective. For certain lineages there is strong bias for evolution to involve domains related to certain functions. In general eukaryotic genomes, particularly animals, expand complex functional activities such as signalling and regulation, but at the cost of reducing metabolic processes. We also observe differential evolution of transcriptional regulation and a unique evolutionary role of channel regulators; crucially this is only observable in terms of the architecture plasticity potential. Our findings provide a new layer of information to understand the significance of function in eukaryotic genome evolution. A web search tool, available at http://supfam.org/Pevo, offers a wide spectrum of options for exploring functional importance in eukaryotic genome evolution.
Graphical abstract
Highlights
•A new concept to measure domain architecture plasticity potential in a genome.•We reveal the function-selective role in eukaryotic genome evolution.•Eukaryotic genomes expand signalling and regulations but reduce metabolism.•We observe differential evolution between trans- and cis-acting regulations.•We observe a unique role of channel regulators in separating eukaryotic kingdoms.
doi:10.1016/j.biochi.2015.05.003
PMCID: PMC4679076  PMID: 25980317
Domain architectures; Eukaryotic genomes; Evolution; Function; FDR, false discovery rate; GO, Gene Ontology; HMM, hidden Markov models; PP, plasticity potential; SCOP, Structural Classification of Proteins; sTOL, tree of (sequenced) life
9.  Splice junctions are constrained by protein disorder 
Nucleic Acids Research  2015;43(10):4814-4822.
We have discovered that positions of splice junctions in genes are constrained by the tolerance for disorder-promoting amino acids in the translated protein region. It is known that efficient splicing requires nucleotide bias at the splice junction; the preferred usage produces a distribution of amino acids that is disorder-promoting. We observe that efficiency of splicing, as seen in the amino-acid distribution, is not compromised to accommodate globular structure. Thus we infer that it is the positions of splice junctions in the gene that must be under constraint by the local protein environment. Examining exonic splicing enhancers found near the splice junction in the gene, reveals that these (short DNA motifs) are more prevalent in exons that encode disordered protein regions than exons encoding structured regions. Thus we also conclude that local protein features constrain efficient splicing more in structure than in disorder.
doi:10.1093/nar/gkv407
PMCID: PMC4446445  PMID: 25934802
11.  Sequential transcriptional changes dictate safe and effective antigen-specific immunotherapy 
Nature Communications  2014;5:4741.
Antigen-specific immunotherapy combats autoimmunity or allergy by reinstating immunological tolerance to target antigens without compromising immune function. Optimization of dosing strategy is critical for effective modulation of pathogenic CD4+ T-cell activity. Here we report that dose escalation is imperative for safe, subcutaneous delivery of the high self-antigen doses required for effective tolerance induction and elicits anergic, interleukin (IL)-10-secreting regulatory CD4+ T cells. Analysis of the CD4+ T-cell transcriptome, at consecutive stages of escalating dose immunotherapy, reveals progressive suppression of transcripts positively regulating inflammatory effector function and repression of cell cycle pathways. We identify transcription factors, c-Maf and NFIL3, and negative co-stimulatory molecules, LAG-3, TIGIT, PD-1 and TIM-3, which characterize this regulatory CD4+ T-cell population and whose expression correlates with the immunoregulatory cytokine IL-10. These results provide a rationale for dose escalation in T-cell-directed immunotherapy and reveal novel immunological and transcriptional signatures as surrogate markers of successful immunotherapy.
Dose escalation in antigen-specific therapies is recognized as safe and effective, but the underlying effects of dosing variables on the immune system are not understood. Here, the authors demonstrate that dose escalation causes sequential modulation of gene expression among antigen-specific lymphocytes.
doi:10.1038/ncomms5741
PMCID: PMC4167604  PMID: 25182274
12.  Sequential transcriptional changes dictate safe and effective antigen-specific immunotherapy 
Nature communications  2014;5:4741.
Antigen-specific immunotherapy combats autoimmunity or allergy by reinstating immunological tolerance to target antigens without compromising immune function. Optimisation of dosing strategy is critical for effective modulation of pathogenic CD4+ T cell activity. Here we report that dose escalation is imperative for safe, subcutaneous delivery of the high self-antigen doses required for effective tolerance induction and elicits anergic, IL-10-secreting regulatory CD4+ T cells. Analysis of the CD4+ T cell transcriptome, at consecutive stages of escalating dose immunotherapy, reveals progressive suppression of transcripts positively regulating inflammatory effector function and repression of cell cycle pathways. We identify transcription factors, c-Maf and NFIL3, and negative co-stimulatory molecules, LAG-3, TIGIT, PD-1 and TIM-3, which characterise this regulatory CD4+ T cell population and whose expression correlates with the immunoregulatory cytokine IL-10. These results provide a rationale for dose escalation in T cell-directed immunotherapy and reveal novel immunological and transcriptional signatures as surrogate markers of successful immunotherapy.
doi:10.1038/ncomms5741
PMCID: PMC4167604  PMID: 25182274
13.  DGEclust: differential expression analysis of clustered count data 
Genome Biology  2015;16(1):39.
We present a statistical methodology, DGEclust, for differential expression analysis of digital expression data. Our method treats differential expression as a form of clustering, thus unifying these two concepts. Furthermore, it simultaneously addresses the problem of how many clusters are supported by the data and uncertainty in parameter estimation. DGEclust successfully identifies differentially expressed genes under a number of different scenarios, maintaining a low error rate and an excellent control of its false discovery rate with reasonable computational requirements. It is formulated to perform particularly well on low-replicated data and be applicable to multi-group data. DGEclust is available at http://dvav.github.io/dgeclust/.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-015-0604-6) contains supplementary material, which is available to authorized users.
doi:10.1186/s13059-015-0604-6
PMCID: PMC4365804  PMID: 25853652
14.  An integrative approach to predicting the functional effects of non-coding and coding sequence variation 
Bioinformatics  2015;31(10):1536-1543.
Motivation: Technological advances have enabled the identification of an increasingly large spectrum of single nucleotide variants within the human genome, many of which may be associated with monogenic disease or complex traits. Here, we propose an integrative approach, named FATHMM-MKL, to predict the functional consequences of both coding and non-coding sequence variants. Our method utilizes various genomic annotations, which have recently become available, and learns to weight the significance of each component annotation source.
Results: We show that our method outperforms current state-of-the-art algorithms, CADD and GWAVA, when predicting the functional consequences of non-coding variants. In addition, FATHMM-MKL is comparable to the best of these algorithms when predicting the impact of coding variants. The method includes a confidence measure to rank order predictions.
Availability and implementation: The FATHMM-MKL webserver is available at: http://fathmm.biocompute.org.uk
Contact: H.Shihab@bristol.ac.uk or Mark.Rogers@bristol.ac.uk or C.Campbell@bristol.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btv009
PMCID: PMC4426838  PMID: 25583119
15.  The InterPro protein families database: the classification resource after 15 years 
Nucleic Acids Research  2014;43(Database issue):D213-D221.
The InterPro database (http://www.ebi.ac.uk/interpro/) is a freely available resource that can be used to classify sequences into protein families and to predict the presence of important domains and sites. Central to the InterPro database are predictive models, known as signatures, from a range of different protein family databases that have different biological focuses and use different methodological approaches to classify protein families and domains. InterPro integrates these signatures, capitalizing on the respective strengths of the individual databases, to produce a powerful protein classification resource. Here, we report on the status of InterPro as it enters its 15th year of operation, and give an overview of new developments with the database and its associated Web interfaces and software. In particular, the new domain architecture search tool is described and the process of mapping of Gene Ontology terms to InterPro is outlined. We also discuss the challenges faced by the resource given the explosive growth in sequence data in recent years. InterPro (version 48.0) contains 36 766 member database signatures integrated into 26 238 InterPro entries, an increase of over 3993 entries (5081 signatures), since 2012.
doi:10.1093/nar/gku1243
PMCID: PMC4383996  PMID: 25428371
16.  The SUPERFAMILY 1.75 database in 2014: a doubling of data 
Nucleic Acids Research  2014;43(Database issue):D227-D233.
We present updates to the SUPERFAMILY 1.75 (http://supfam.org) online resource and protein sequence collection. The hidden Markov model library that provides sequence homology to SCOP structural domains remains unchanged at version 1.75. In the last 4 years SUPERFAMILY has more than doubled its holding of curated complete proteomes over all cellular life, from 1400 proteomes reported previously in 2010 up to 3258 at present. Outside of the main sequence collection, SUPERFAMILY continues to provide domain annotation for sequences provided by other resources such as: UniProt, Ensembl, PDB, much of JGI Phytozome and selected subcollections of NCBI RefSeq. Despite this growth in data volume, SUPERFAMILY now provides users with an expanded and daily updated phylogenetic tree of life (sTOL). This tree is built with genomic-scale domain annotation data as before, but constantly updated when new species are introduced to the sequence library. Our Gene Ontology and other functional and phenotypic annotations previously reported have stood up to critical assessment by the function prediction community. We have now introduced these data in an integrated manner online at the level of an individual sequence, and—in the case of whole genomes—with enrichment analysis against a taxonomically defined background.
doi:10.1093/nar/gku1041
PMCID: PMC4383889  PMID: 25414345
17.  Genome3D: exploiting structure to help users understand their sequences 
Nucleic Acids Research  2014;43(Database issue):D382-D386.
Genome3D (http://www.genome3d.eu) is a collaborative resource that provides predicted domain annotations and structural models for key sequences. Since introducing Genome3D in a previous NAR paper, we have substantially extended and improved the resource. We have annotated representatives from Pfam families to improve coverage of diverse sequences and added a fast sequence search to the website to allow users to find Genome3D-annotated sequences similar to their own. We have improved and extended the Genome3D data, enlarging the source data set from three model organisms to 10, and adding VIVACE, a resource new to Genome3D. We have analysed and updated Genome3D's SCOP/CATH mapping. Finally, we have improved the superposition tools, which now give users a more powerful interface for investigating similarities and differences between structural models.
doi:10.1093/nar/gku973
PMCID: PMC4384030  PMID: 25348407
18.  The ‘dnet’ approach promotes emerging research on cancer patient survival 
Genome Medicine  2014;6(8):64.
We present the ‘dnet’ package and apply it to the ‘TCGA’ mutation and clinical data of >3,000 patients. We uncover the existence of an underlying gene network that at least partially controls cancer ‘survivalness’, with mutations that are significantly correlated with patient survival, yet independent of tumour origin and type. The survivalness network has natural community structure corresponding to tumour hallmarks, and contains genes that are potentially druggable in the clinic. This network has evolutionary roots in Deuterostomia identifying PTK2 and VAV1 as under-valued relative to more studied genes from that era. The ‘dnet’ R package is available at http://cran.r-project.org/package=dnet.
Electronic supplementary material
The online version of this article (doi:10.1186/s13073-014-0064-8) contains supplementary material, which is available to authorized users.
doi:10.1186/s13073-014-0064-8
PMCID: PMC4160547  PMID: 25246945
19.  Ranking non-synonymous single nucleotide polymorphisms based on disease concepts 
Human Genomics  2014;8(1):11.
As the number of non-synonymous single nucleotide polymorphisms (nsSNPs) identified through whole-exome/whole-genome sequencing programs increases, researchers and clinicians are becoming increasingly reliant upon computational prediction algorithms designed to prioritize potential functional variants for further study. A large proportion of existing prediction algorithms are ‘disease agnostic’ but are nevertheless quite capable of predicting when a mutation is likely to be deleterious. However, most clinical and research applications of these algorithms relate to specific diseases and would therefore benefit from an approach that discriminates between functional variants specifically related to that disease from those which are not. In a whole-exome/whole-genome sequencing context, such an approach could substantially reduce the number of false positive candidate mutations. Here, we test this postulate by incorporating a disease-specific weighting scheme into the Functional Analysis through Hidden Markov Models (FATHMM) algorithm. When compared to traditional prediction algorithms, we observed an overall reduction in the number of false positives identified using a disease-specific approach to functional prediction across 17 distinct disease concepts/categories. Our results illustrate the potential benefits of making disease-specific predictions when prioritizing candidate variants in relation to specific diseases. A web-based implementation of our algorithm is available at http://fathmm.biocompute.org.uk.
doi:10.1186/1479-7364-8-11
PMCID: PMC4083756  PMID: 24980617
SNV; nsSNPs; Disease-causing; Disease-specific; FATHMM; HMMs; SIFT; PolyPhen; Bioinformatics
20.  The Evolution of Human Cells in Terms of Protein Innovation 
Molecular Biology and Evolution  2014;31(6):1364-1374.
Humans are composed of hundreds of cell types. As the genomic DNA of each somatic cell is identical, cell type is determined by what is expressed and when. Until recently, little has been reported about the determinants of human cell identity, particularly from the joint perspective of gene evolution and expression. Here, we chart the evolutionary past of all documented human cell types via the collective histories of proteins, the principal product of gene expression. FANTOM5 data provide cell-type–specific digital expression of human protein-coding genes and the SUPERFAMILY resource is used to provide protein domain annotation. The evolutionary epoch in which each protein was created is inferred by comparison with domain annotation of all other completely sequenced genomes. Studying the distribution across epochs of genes expressed in each cell type reveals insights into human cellular evolution in terms of protein innovation. For each cell type, its history of protein innovation is charted based on the genes it expresses. Combining the histories of all cell types enables us to create a timeline of cell evolution. This timeline identifies the possibility that our common ancestor Coelomata (cavity-forming animals) provided the innovation required for the innate immune system, whereas cells which now form the brain of human have followed a trajectory of continually accumulating novel proteins since Opisthokonta (boundary of animals and fungi). We conclude that exaptation of existing domain architectures into new contexts is the dominant source of cell-type–specific domain architectures.
doi:10.1093/molbev/mst139
PMCID: PMC4032124  PMID: 24692656
CAGE; transcriptome; protein domains; evolution
21.  supraHex: An R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map☆ 
Highlights
•supraHex is an open-source R/Bioconductor package for tabular omics data analysis.•A supra-hexagonal map is designed to self-organise omics data.•The supraHex map analyses both genes and samples at the same time.•The supraHex map can be overlaid by additional data for multilayer omics data comparisons.•supraHex can tell inherent relations between replication timing, CpG and expression.
Biologists are increasingly confronted with the challenge of quickly understanding genome-wide biological data, which usually involve a large number of genomic coordinates (e.g. genes) but a much smaller number of samples. To meet the need for data of this shape, we present an open-source package called ‘supraHex’ for training, analysing and visualising omics data. This package devises a supra-hexagonal map to self-organise the input data, offers scalable functionalities for post-analysing the map, and more importantly, allows for overlaying additional data for multilayer omics data comparisons. Via applying to DNA replication timing data of mouse embryogenesis, we demonstrate that supraHex is capable of simultaneously carrying out gene clustering and sample correlation, providing intuitive visualisation at each step of the analysis. By overlaying CpG and expression data onto the trained replication-timing map, we also show that supraHex is able to intuitively capture an inherent relationship between late replication, low CpG density promoters and low expression levels. As part of the Bioconductor project, supraHex makes accessible to a wide community in a simple way, what would otherwise be a complex framework for the ultrafast understanding of any tabular omics data, both scientifically and artistically. This package can run on Windows, Mac and Linux, and is freely available together with many tutorials on featuring real examples at http://supfam.org/supraHex.
doi:10.1016/j.bbrc.2013.11.103
PMCID: PMC3905187  PMID: 24309102
Bioinformatics; Clustering; Sample correlation; Visualisation; DNA replication timing; Gene expression
22.  A large-scale evaluation of computational protein function prediction 
Radivojac, Predrag | Clark, Wyatt T | Ronnen Oron, Tal | Schnoes, Alexandra M | Wittkop, Tobias | Sokolov, Artem | Graim, Kiley | Funk, Christopher | Verspoor, Karin | Ben-Hur, Asa | Pandey, Gaurav | Yunes, Jeffrey M | Talwalkar, Ameet S | Repo, Susanna | Souza, Michael L | Piovesan, Damiano | Casadio, Rita | Wang, Zheng | Cheng, Jianlin | Fang, Hai | Gough, Julian | Koskinen, Patrik | Törönen, Petri | Nokso-Koivisto, Jussi | Holm, Liisa | Cozzetto, Domenico | Buchan, Daniel W A | Bryson, Kevin | Jones, David T | Limaye, Bhakti | Inamdar, Harshal | Datta, Avik | Manjari, Sunitha K | Joshi, Rajendra | Chitale, Meghana | Kihara, Daisuke | Lisewski, Andreas M | Erdin, Serkan | Venner, Eric | Lichtarge, Olivier | Rentzsch, Robert | Yang, Haixuan | Romero, Alfonso E | Bhat, Prajwal | Paccanaro, Alberto | Hamp, Tobias | Kassner, Rebecca | Seemayer, Stefan | Vicedo, Esmeralda | Schaefer, Christian | Achten, Dominik | Auer, Florian | Böhm, Ariane | Braun, Tatjana | Hecht, Maximilian | Heron, Mark | Hönigschmid, Peter | Hopf, Thomas | Kaufmann, Stefanie | Kiening, Michael | Krompass, Denis | Landerer, Cedric | Mahlich, Yannick | Roos, Manfred | Björne, Jari | Salakoski, Tapio | Wong, Andrew | Shatkay, Hagit | Gatzmann, Fanny | Sommer, Ingolf | Wass, Mark N | Sternberg, Michael J E | Škunca, Nives | Supek, Fran | Bošnjak, Matko | Panov, Panče | Džeroski, Sašo | Šmuc, Tomislav | Kourmpetis, Yiannis A I | van Dijk, Aalt D J | ter Braak, Cajo J F | Zhou, Yuanpeng | Gong, Qingtian | Dong, Xinran | Tian, Weidong | Falda, Marco | Fontana, Paolo | Lavezzo, Enrico | Di Camillo, Barbara | Toppo, Stefano | Lan, Liang | Djuric, Nemanja | Guo, Yuhong | Vucetic, Slobodan | Bairoch, Amos | Linial, Michal | Babbitt, Patricia C | Brenner, Steven E | Orengo, Christine | Rost, Burkhard | Mooney, Sean D | Friedberg, Iddo
Nature methods  2013;10(3):221-227.
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based Critical Assessment of protein Function Annotation (CAFA) experiment. Fifty-four methods representing the state-of-the-art for protein function prediction were evaluated on a target set of 866 proteins from eleven organisms. Two findings stand out: (i) today’s best protein function prediction algorithms significantly outperformed widely-used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is significant need for improvement of currently available tools.
doi:10.1038/nmeth.2340
PMCID: PMC3584181  PMID: 23353650
23.  Predicting the functional consequences of cancer-associated amino acid substitutions 
Bioinformatics  2013;29(12):1504-1510.
Motivation: The number of missense mutations being identified in cancer genomes has greatly increased as a consequence of technological advances and the reduced cost of whole-genome/whole-exome sequencing methods. However, a high proportion of the amino acid substitutions detected in cancer genomes have little or no effect on tumour progression (passenger mutations). Therefore, accurate automated methods capable of discriminating between driver (cancer-promoting) and passenger mutations are becoming increasingly important. In our previous work, we developed the Functional Analysis through Hidden Markov Models (FATHMM) software and, using a model weighted for inherited disease mutations, observed improved performances over alternative computational prediction algorithms. Here, we describe an adaptation of our original algorithm that incorporates a cancer-specific model to potentiate the functional analysis of driver mutations.
Results: The performance of our algorithm was evaluated using two separate benchmarks. In our analysis, we observed improved performances when distinguishing between driver mutations and other germ line variants (both disease-causing and putatively neutral mutations). In addition, when discriminating between somatic driver and passenger mutations, we observed performances comparable with the leading computational prediction algorithms: SPF-Cancer and TransFIC.
Availability and implementation: A web-based implementation of our cancer-specific model, including a downloadable stand-alone package, is available at http://fathmm.biocompute.org.uk.
Contact: fathmm@biocompute.org.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
doi:10.1093/bioinformatics/btt182
PMCID: PMC3673218  PMID: 23620363
24.  A domain-centric solution to functional genomics via dcGO Predictor 
BMC Bioinformatics  2013;14(Suppl 3):S9.
Background
Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics.
Results
Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool.
Conclusions
As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
doi:10.1186/1471-2105-14-S3-S9
PMCID: PMC3584936  PMID: 23514627
25.  Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains 
Nucleic Acids Research  2012;41(Database issue):D499-D507.
Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence–structure–function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker’s yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus (532 bronze pairs, 527 silver pairs and 370 gold pairs).
doi:10.1093/nar/gks1266
PMCID: PMC3531217  PMID: 23203986

Results 1-25 (48)