Disordered regions of proteins often bind to structured domains, mediating interactions within and between proteins. However, it is difficult to identify a priori the short disordered regions involved in binding. We set out to determine if docking such peptide regions to peptide binding domains would assist in these predictions.We assembled a redundancy reduced dataset of SLiM (Short Linear Motif) containing proteins from the ELM database. We selected 84 sequences which had an associated PDB structures showing the SLiM bound to a protein receptor, where the SLiM was found within a 50 residue region of the protein sequence which was predicted to be disordered. First, we investigated the Vina docking scores of overlapping tripeptides from the 50 residue SLiM containing disordered regions of the protein sequence to the corresponding PDB domain. We found only weak discrimination of docking scores between peptides involved in binding and adjacent non-binding peptides in this context (AUC 0.58).Next, we trained a bidirectional recurrent neural network (BRNN) using as input the protein sequence, predicted secondary structure, Vina docking score and predicted disorder score. The results were very promising (AUC 0.72) showing that multiple sources of information can be combined to produce results which are clearly superior to any single source.We conclude that the Vina docking score alone has only modest power to define the location of a peptide within a larger protein region known to contain it. However, combining this information with other knowledge (using machine learning methods) clearly improves the identification of peptide binding regions within a protein sequence. This approach combining docking with machine learning is primarily a predictor of binding to peptide-binding sites, and is not intended as a predictor of specificity of binding to particular receptors.
YAP (Yes-associated protein) is a potent oncogene and a major effector of the mammalian Hippo tumor suppressor pathway. In this review, our emphasis is on the structural basis of how YAP recognizes its various cellular partners. In particular, we discuss the role of LATS kinase and AMOTL1 junction protein, two key cellular partners of YAP that bind to its WW domain, in mediating cytoplasmic localization of YAP and thereby playing a key role in the regulation of its transcriptional activity. Importantly, the crystal structure of an amino-terminal domain of YAP in complex with the carboxy-terminal domain of TEAD transcription factor was only recently solved at atomic resolution, while the structure of WW domain of YAP in complex with a peptide containing the PPxY motif has been available for more than a decade. We discuss how such structural information may be exploited for the rational development of novel anti-cancer therapeutics harboring greater efficacy coupled with low toxicity. We also embark on a brief discussion of how recent in silico studies led to identification of the cardiac glycoside digitoxin as a potential modulator of WW domain-ligand interactions. Conversely, dobutamine was identified in a screen of known drugs as a compound that promotes cytoplasmic localization of YAP, thereby resulting in growth suppressing activity. Finally, we discuss how a recent study on the dynamics of WW domain folding on a biologically critical time scale may provide a tool to generate repertoires of WW domain variants for regulation of the Hippo pathway toward desired, non-oncogenic outputs.
TEAD transcription factor; WW domain; PDZ domain; Nuclear localization; Digitoxin; Dobutamine
Computational protein short linear motif discovery can use protein interaction information to search for motifs among proteins which share a common interactor. Cytoscape provides a visual interface for protein networks but there is no streamlined way to rapidly visualize motifs in a network of proteins, or to integrate computational discovery with such visualizations.
We present SLiMScape, a Cytoscape plugin, which enables both de novo motif discovery and searches for instances of known motifs. Data is presented using Cytoscape’s visualization features thus providing an intuitive interface for interpreting results. The distribution of discovered or user-defined motifs may be selectively displayed and the distribution of protein domains may be viewed simultaneously. To facilitate this SLiMScape automatically retrieves domains for each protein.
SLiMScape provides a platform for performing short linear motif analyses of protein interaction networks by integrating motif discovery and search tools in a network visualization environment. This significantly aids in the discovery of novel short linear motifs and in visualizing the distribution of known motifs.
We carried out a genome-wide association study (GWAS) of LDL-c response to statin using data from participants in the Collaborative Atorvastatin Diabetes Study (CARDS; n = 1,156), the Anglo-Scandinavian Cardiac Outcomes Trial (ASCOT; n = 895), and the observational phase of ASCOT (n = 651), all of whom were prescribed atorvastatin 10 mg. Following genome-wide imputation, we combined data from the three studies in a meta-analysis. We found associations of LDL-c response to atorvastatin that reached genome-wide significance at rs10455872 (P = 6.13 × 10−9) within the LPA gene and at two single nucleotide polymorphisms (SNP) within the APOE region (rs445925; P = 2.22 × 10−16 and rs4420638; P = 1.01 × 10−11) that are proxies for the ϵ2 and ϵ4 variants, respectively, in APOE. The novel association with the LPA SNP was replicated in the PROspective Study of Pravastatin in the Elderly at Risk (PROSPER) trial (P = 0.009). Using CARDS data, we further showed that atorvastatin therapy did not alter lipoprotein(a) [Lp(a)] and that Lp(a) levels accounted for all of the associations of SNPs in the LPA gene and the apparent LDL-c response levels. However, statin therapy had a similar effect in reducing cardiovascular disease (CVD) in patients in the top quartile for serum Lp(a) levels (HR = 0.60) compared with those in the lower three quartiles (HR = 0.66; P = 0.8 for interaction). The data emphasize that high Lp(a) levels affect the measurement of LDL-c and the clinical estimation of LDL-c response. Therefore, an apparently lower LDL-c response to statin therapy may indicate a need for measurement of Lp(a). However, statin therapy seems beneficial even in those with high Lp(a).
genetics; low density lipoprotein; LDL/metabolism; lipoprotein(a); statins
Intrinsically disordered regions in eukaryotic proteomes contain key signaling and regulatory modules and mediate interactions with many proteins. Many viral proteomes encode disordered proteins and modulate host factors through the use of short linear motifs (SLiMs) embedded within disordered regions. However, the degree of viral protein disorder across different viruses is not well understood, so we set out to establish the constraints acting on viruses, in terms of their use of disordered protein regions. We surveyed predicted disorder across 2,278 available viral genomes in 41 families, and correlated the extent of disorder with genome size and other factors. Protein disorder varies strikingly between viral families (from 2.9% to 23.1% of residues), and also within families. However, this substantial variation did not follow the established trend among their hosts, with increasing disorder seen across eubacterial, archaebacterial, protists, and multicellular eukaryotes. For example, among large mammalian viruses, poxviruses and herpesviruses showed markedly differing disorder (5.6% and 17.9%, respectively). Viral families with smaller genome sizes have more disorder within each of five main viral types (ssDNA, dsDNA, ssRNA+, dsRNA, retroviruses), except for negative single-stranded RNA viruses, where disorder increased with genome size. However, surveying over all viruses, which compares tiny and enormous viruses over a much bigger range of genome sizes, there is no strong association of genome size with protein disorder. We conclude that there is extensive variation in the disorder content of viral proteomes. While a proportion of this may relate to base composition, to extent of gene overlap, and to genome size within viral types, there remain important additional family and virus-specific effects. Differing disorder strategies are likely to impact on how different viruses modulate host factors, and on how rapidly viruses can evolve novel instances of SLiMs subverting host functions, such as innate and acquired immunity.
Large portions of higher eukaryotic proteomes are intrinsically disordered, and abundant evidence suggests that these unstructured regions of proteins are rich in regulatory interaction interfaces. A major class of disordered interaction interfaces are the compact and degenerate modules known as short linear motifs (SLiMs). As a result of the difficulties associated with the experimental identification and validation of SLiMs, our understanding of these modules is limited, advocating the use of computational methods to focus experimental discovery. This article evaluates the use of evolutionary conservation as a discriminatory technique for motif discovery. A statistical framework is introduced to assess the significance of relatively conserved residues, quantifying the likelihood a residue will have a particular level of conservation given the conservation of the surrounding residues. The framework is expanded to assess the significance of groupings of conserved residues, a metric that forms the basis of SLiMPrints (short linear motif fingerprints), a de novo motif discovery tool. SLiMPrints identifies relatively overconstrained proximal groupings of residues within intrinsically disordered regions, indicative of putatively functional motifs. Finally, the human proteome is analysed to create a set of highly conserved putative motif instances, including a novel site on translation initiation factor eIF2A that may regulate translation through binding of eIF4E.
Short linear protein motifs are attracting increasing attention as functionally independent sites, typically 3–10 amino acids in length that are enriched in disordered regions of proteins. Multiple methods have recently been proposed to discover over-represented motifs within a set of proteins based on simple regular expressions. Here, we extend these approaches to profile-based methods, which provide a richer motif representation.
The profile motif discovery method MEME performed relatively poorly for motifs in disordered regions of proteins. However, when we applied evolutionary weighting to account for redundancy amongst homologous proteins, and masked out poorly conserved regions of disordered proteins, the performance of MEME is equivalent to that of regular expression methods. However, the two approaches returned different subsets within both a benchmark dataset, and a more realistic discovery dataset.
Profile-based motif discovery methods complement regular expression based methods. Whilst profile-based methods are computationally more intensive, they are likely to discover motifs currently overlooked by regular expression methods.
Protein-protein interactions; Motif discovery; Peptide binding; Short linear motifs; Mini-motifs; SLiMs
Autism spectrum disorder (ASD) is a highly heritable disorder of complex and heterogeneous aetiology. It is primarily characterized by altered cognitive ability including impaired language and communication skills and fundamental deficits in social reciprocity. Despite some notable successes in neuropsychiatric genetics, overall, the high heritability of ASD (~90%) remains poorly explained by common genetic risk variants. However, recent studies suggest that rare genomic variation, in particular copy number variation, may account for a significant proportion of the genetic basis of ASD. We present a large scale analysis to identify candidate genes which may contain low-frequency recessive variation contributing to ASD while taking into account the potential contribution of population differences to the genetic heterogeneity of ASD. Our strategy, homozygous haplotype (HH) mapping, aims to detect homozygous segments of identical haplotype structure that are shared at a higher frequency amongst ASD patients compared to parental controls. The analysis was performed on 1,402 Autism Genome Project trios genotyped for 1 million single nucleotide polymorphisms (SNPs). We identified 25 known and 1,218 novel ASD candidate genes in the discovery analysis including CADM2, ABHD14A, CHRFAM7A, GRIK2, GRM3, EPHA3, FGF10, KCND2, PDZK1, IMMP2L and FOXP2. Furthermore, 10 of the previously reported ASD genes and 300 of the novel candidates identified in the discovery analysis were replicated in an independent sample of 1,182 trios. Our results demonstrate that regions of HH are significantly enriched for previously reported ASD candidate genes and the observed association is independent of gene size (odds ratio 2.10). Our findings highlight the applicability of HH mapping in complex disorders such as ASD and offer an alternative approach to the analysis of genome-wide association data.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1094-6) contains supplementary material, which is available to authorized users.
Intrinsically disordered regions are enriched in short interaction motifs that play a critical role in many protein-protein interactions. Since new short interaction motifs may easily evolve, they have the potential to rapidly change protein interactions and cellular signaling. In this work we examined the dynamics of gain and loss of intrinsically disordered regions in duplicated proteins to inspect if changes after genome duplication can create functional divergence. For this purpose we used Saccharomyces cerevisiae and the outgroup species Lachancea kluyveri.
We find that genes duplicated as part of a genome duplication (ohnologs) are significantly more intrinsically disordered than singletons (p<2.2e-16, Wilcoxon), reflecting a preference for retaining intrinsically disordered proteins in duplicate. In addition, there have been marked changes in the extent of intrinsic disorder following duplication. A large number of duplicated genes have more intrinsic disorder than their L. kluyveri ortholog (29% for duplicates versus 25% for singletons) and an even greater number have less intrinsic disorder than the L. kluyveri ortholog (37% for duplicates versus 25% for singletons). Finally, we show that the number of physical interactions is significantly greater in the more intrinsically disordered ohnolog of a pair (p = 0.003, Wilcoxon).
This work shows that intrinsic disorder gain and loss in a protein is a mechanism by which a genome can also diverge and innovate. The higher number of interactors for proteins that have gained intrinsic disorder compared with their duplicates may reflect the acquisition of new interaction partners or new functional roles.
Gene and protein interactions are commonly represented as networks, with the genes or proteins comprising the nodes and the relationship between them as edges. Motifs, or small local configurations of edges and nodes that arise repeatedly, can be used to simplify the interpretation of networks.
We examined triplet motifs in a network of quantitative epistatic genetic relationships, and found a non-random distribution of particular motif classes. Individual motif classes were found to be associated with different functional properties, suggestive of an underlying biological significance. These associations were apparent not only for motif classes, but for individual positions within the motifs. As expected, NNN (all negative) motifs were strongly associated with previously reported genetic (i.e. synthetic lethal) interactions, while PPP (all positive) motifs were associated with protein complexes. The two other motif classes (NNP: a positive interaction spanned by two negative interactions, and NPP: a negative spanned by two positives) showed very distinct functional associations, with physical interactions dominating for the former but alternative enrichments, typical of biochemical pathways, dominating for the latter.
We present a model showing how NNP motifs can be used to recognize supportive relationships between protein complexes, while NPP motifs often identify opposing or regulatory behaviour between a gene and an associated pathway. The ability to use motifs to point toward underlying biological organizational themes is likely to be increasingly important as more extensive epistasis mapping projects in higher organisms begin.
Milk proteins are required to proceed through a variety of conditions of radically varying pH, which are not identical across mammalian digestive systems. We wished to investigate if the shifts in these requirements have resulted in marked changes in the isoelectric point and charge of milk proteins during evolution.
We investigated nine major milk proteins in 13 mammals. In comparison with a group of orthologous non-milk proteins, we found that 3 proteins κ-casein, lactadherin, and muc1 have undergone the highest change in isoelectric point during evolution. The pattern of non-synonymous substitutions indicate that selection has played a role in the isoelectric point shift, since residues that show significant evidence of positive selection are much more likely to be charged (p = 0.03 for κ-casein; p < 10-8 for muc1). However, this selection does not appear to be solely due to adaptation to the diversity of mammalian digestive systems, since striking changes are seen among species that resemble each other in terms of their digestion.
The changes in charge are most likely due to changes of other protein functions, rather than an adaptation to the different mammalian digestive systems. These functions may include differences in bioactive peptide releases in the gut between different mammals, which are known to be a major contributing factor in the functional and nutritional value of mammalian milk. This raises the question of whether bovine milk is optimal in terms of particular protein functions, for human nutrition and possibly disease resistance.
This article was reviewed by Fyodor Kondrashov, David Liberles (nominated by David Ardell), and Christophe Lefevre (nominated by Mark Ragan).
Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch 2.0 (Short, Linear Motif Search) web server allows researchers to identify occurrences of a user-defined SLiM in a proteome, using conservation and protein disorder context statistics to rank occurrences. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. For each motif occurrence, overlapping UniProt features and annotated SLiMs are displayed. Visualization also includes annotated multiple sequence alignments surrounding each occurrence, showing conservation and protein disorder statistics in addition to known and predicted SLiMs, protein domains and known post-translational modifications. In addition, enrichment of Gene Ontology terms and protein interaction partners are provided as indicators of possible motif function. All web server results are available for download. Users can search motifs against the human proteome or a subset thereof defined by Uniprot accession numbers or GO term. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch2.html.
Previous relatively small studies have associated particular amino acid replacements and deletions in the HIV-1 nef gene with differences in the rate of HIV disease progression. We tested more rigorously whether particular nef amino acid differences and deletions are associated with HIV disease progression. Amino acid replacements and deletions in patients' consensus sequences were investigated for 153 progressor (P), 615 long-term nonprogressor (LTNP), and 2,311 unknown progressor sequences from 582 subtype B HIV-infected patients. LTNPs had more defective nefs (interrupted by frameshifts or stop codons), but on a per-patient basis there was no excess of LTNP patients with one or more defective nef sequences compared to the Ps (P = 0.47). The high frequency of amino acid replacement at residues S8, V10, I11, A15, V85, V133, N157, S163, V168, D174, R178, E182, and R188 in LTNPs was also seen in permuted datasets, implying that these are simply rapidly evolving residues. Permutation testing revealed that residues showing the greatest excess over expectation (A15, V85, N157, S163, V168, D174, R178, and R188) were not significant (P = 0.77). Exploratory analysis suggested a hypothetical excess of frameshifting in the regions 9SVIG and 118QGYF among LTNPs. The regions V10 and 152KVEEA of nef were commonly deleted in LTNPs. However, permutation testing indicated that none of the regions displayed significantly excessive deletion in LTNPs. In conclusion, meta-analysis of HIV-1 nef sequences provides no clear evidence of whether defective nef sequences or particular regions of the protein play a significant role in disease progression.
Short, linear motifs (SLiMs) play a critical role in many biological processes, particularly in protein–protein interactions. The Short, Linear Motif Finder (SLiMFinder) web server is a de novo motif discovery tool that identifies statistically over-represented motifs in a set of protein sequences, accounting for the evolutionary relationships between them. Motifs are returned with an intuitive P-value that greatly reduces the problem of false positives and is accessible to biologists of all disciplines. Input can be uploaded by the user or extracted directly from UniProt. Numerous masking options give the user great control over the contextual information to be included in the analyses. The SLiMFinder server combines these with user-friendly output and visualizations of motif context to allow the user to quickly gain insight into the validity of a putatively functional motif. These visualizations include alignments of motif occurrences, alignments of motifs and their homologues and a visual schematic of the top-ranked motifs. Returned motifs can also be compared with known SLiMs from the literature using CompariMotif. All results are available for download. The SLiMFinder server is available at: http://bioware.ucd.ie/slimfinder.html.
Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.
A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.
A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.
Motivation: Pairwise experimental perturbation is increasingly used to probe gene and protein function because these studies offer powerful insight into the activity and regulation of biological systems. Symmetric two-dimensional datasets, such as pairwise genetic interactions are amenable to an optimally designed measurement procedure because of the equivalence of cases and conditions where fewer experimental measurements may be required to extract the underlying structure.
Results: We show that optimal experimental design can provide improvements in efficiency when collecting data in an iterative manner. We develop a method built on a statistical clustering model for symmetric data and the Fisher information uncertainty estimates, and we also provide simple heuristic approaches that have comparable performance. Using yeast epistatic miniarrays as an example, we show that correct assignment of the major subnetworks could be achieved with <50% of the measurements in the complete dataset. Optimization is likely to become critical as pairwise functional studies extend to more complex mammalian systems where all by all experiments are currently intractable.
Supplementary data are available at Bioinformatics online.
Tandem repeat (TR) variants in the human genome play key roles in a number of diseases. However, current models predicting variability are based on limited training sets. We conducted a systematic analysis of TRs of unit lengths 2–12 nucleotides in Whole Genome Shotgun (WGS) sequences to define the extent of variation of 209,214 unique repeat loci throughout the genome.
We applied a multivariate statistical model to predict TR variability. Predicted heterozygosity correlated with heterozygosity in the CEPH polymorphism database (correlation ρ = 0.29, p < 0.0005) better than the correlation between the CEPH and WGS data (ρ = 0.17), presumably because the model smoothes noise from small sample sizes. A multivariate logistic model of 8 parameters accounted for 36% of the variation in the WGS data. Validation studies of 70 experimentally investigated TRs revealed high concordance with the model's predictions (p < 0.0001).
Variability among 2–12-mer TRs in the genome can be modeled by a few parameters, which do not markedly differ according to unit length, consistent with a common mechanism for the generation of variability among such TRs. Analysis of the distributions of observed and predicted variants across the genome showed a general concordance, indicating that the repeat variation dataset does not exhibit strong regional ascertainment biases. This revealed a deficit of variant repeats in chromosomes 19 and Y – likely to reflect a reduction in 2-mer repeats in the former and a reduced level of recombination in the latter – and excesses in chromosomes 6, 13, 20 and 21.
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular, it can be very difficult to distinguish a randomly recurring “motif” from a truly over-represented one. Incorporating ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates the matter.
In this paper we present two algorithms. SLiMBuild identifies convergently evolved, short motifs in a dataset of proteins. Motifs are built by combining dimers into longer patterns, retaining only those motifs occurring in a sufficient number of unrelated proteins. Motifs with fixed amino acid positions are identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. The algorithm is computationally efficient compared to alternatives, particularly when datasets include homologous proteins, and provides great flexibility in the nature of motifs returned. The SLiMChance algorithm estimates the probability of returned motifs arising by chance, correcting for the size and composition of the dataset, and assigns a significance value to each motif. These algorithms are implemented in a software package, SLiMFinder. SLiMFinder default settings identify known SLiMs with 100% specificity, and have a low false discovery rate on random test data.
The efficiency of SLiMBuild and low false discovery rate of SLiMChance make SLiMFinder highly suited to high throughput motif discovery and individual high quality analyses alike. Examples of such analyses on real biological data, and how SLiMFinder results can help direct future discoveries, are provided. SLiMFinder is freely available for download under a GNU license from http://bioinformatics.ucd.ie/shields/software/slimfinder/.
Short, linear motifs (SLiMs) play a critical role in many biological processes, particularly in protein–protein interactions. Overrepresentation of convergent occurrences of motifs in proteins with a common attribute (such as similar subcellular location or a shared interaction partner) provides a feasible means to discover novel occurrences computationally. The SLiMDisc (Short, Linear Motif Discovery) web server corrects for common ancestry in describing shared motifs, concentrating on the convergently evolved motifs. The server returns a listing of the most interesting motifs found within unmasked regions, ranked according to an information content-based scoring scheme. It allows interactive input masking, according to various criteria. Scoring allows for evolutionary relationships in the data sets through treatment of BLAST local alignments. Alongside this ranked list, visualizations of the results improve understanding of the context of suggested motifs, helping to identify true motifs of interest. These visualizations include alignments of motif occurrences, alignments of motifs and their homologues and a visual schematic of the top-ranked motifs. Additional options for filtering and/or re-ranking motifs further permit the user to focus on motifs with desired attributes. Returned motifs can also be compared with known SLiMs from the literature. SLiMDisc is available at: http://bioware.ucd.ie/~slimdisc/.
Tandem repeat arrays showing variation between sequences within a population, between strains or across species may have functional effects. The increasing availability of genomic sequence data makes routine description of observed variation possible, creating a need for tools to describe such variability.
We present a set of programs that facilitate the identification of tandem repeats showing variation across multiple sequences or genomes, and the prediction of potentially polymorphic tandem repeats. The VNTRfinder (Variable Number of Tandem Repeats finder) program enables the detection of sequence length variation between arrays of inter-specific or intra-specific tandem repeats. In the absence of comparable sequences to explore observed variation, predictions are provided describing which tandem repeats are more likely to be variable, to help guide and focus further experimental evaluation.
These tools represent a resource for researchers interested in tandem repeats in nucleotide sequences that are most likely to be of clinical and evolutionary interest. The tools are available at . Downloadable versions for UNIX/LINUX and WINDOWS which permit the consideration of longer and more numerous sequences are also available.
Many important interactions of proteins are facilitated by short, linear motifs (SLiMs) within a protein's primary sequence. Our aim was to establish robust methods for discovering putative functional motifs. The strongest evidence for such motifs is obtained when the same motifs occur in unrelated proteins, evolving by convergence. In practise, searches for such motifs are often swamped by motifs shared in related proteins that are identical by descent. Prediction of motifs among sets of biologically related proteins, including those both with and without detectable similarity, were made using the TEIRESIAS algorithm. The number of motif occurrences arising through common evolutionary descent were normalized based on treatment of BLAST local alignments. Motifs were ranked according to a score derived from the product of the normalized number of occurrences and the information content. The method was shown to significantly outperform methods that do not discount evolutionary relatedness, when applied to known SLiMs from a subset of the eukaryotic linear motif (ELM) database. An implementation of Multiple Spanning Tree weighting outperformed two other weighting schemes, in a variety of settings.
The recessive disorder trimethylaminuria is caused by defects in the FMO3 gene, and may be associated with hypertension. We investigated whether common polymorphisms of the FMO3 gene confer an increased risk for elevated blood pressure and/or essential hypertension.
FMO3 genotypes (E158K, V257M, E308G) were determined in 387 healthy subjects with ambulatory systolic and diastolic blood pressure measurements, and in a cardiovascular disease population of 1649 individuals, 691(41.9%) of whom had a history of hypertension requiring drug treatment. Haplotypes were determined and their distribution noted.
There was no statistically significant association found between any of the 4 common haplotypes and daytime systolic blood pressure in the healthy population (p = 0.65). Neither was a statistically significant association found between the 4 common haplotypes and hypertension status among the cardiovascular disease patients (p = 0.80).
These results suggest that the variants in the FMO3 gene do not predispose to essential hypertension in this population.
Tandem repeat polymorphisms in human proteins were characterized using the UniGene dataset. This analysis suggests that 1 in 20 proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions; these were prevalent among protein-binding proteins.
Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles.
Protein-coding tandem repeat copy-number polymorphisms were detected in 249 tandem repeats found in 218 UniGene clusters; observed length differences ranged from 2 to 144 nucleotides, with unit copy lengths ranging from 2 to 57. This corresponded to 1.59% (218/13,749) of proteins investigated carrying detectable polymorphisms in the copy-number of protein-coding tandem repeats. We found no evidence that tandem repeat copy-number polymorphism was significantly elevated in defense-response proteins (p = 0.882). An association with the Gene Ontology term 'protein-binding' remained significant after covariate adjustment and correction for multiple testing. Combining this analysis with previous experimental evaluations of tandem repeat polymorphism, we estimate the approximate mean frequency of tandem repeat polymorphisms in human proteins to be 6%. Because 13.9% of the polymorphisms were not a multiple of three nucleotides, up to 1% of proteins may contain frameshifting tandem repeat polymorphisms.
Around 1 in 20 human proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions. Such polymorphisms are not more frequent among defense-response proteins; their prevalence among protein-binding proteins may reflect lower selective constraints on their structural modification. The impact of frameshifting and longer copy-number variants on protein function and disease merits further investigation.
Searching databases for distant homologues using alignments instead of individual sequences increases the power of detection. However, most methods assume that protein evolution proceeds in a regular fashion, with the inferred tree of sequences providing a good estimation of the evolutionary process. We investigated the combined HMMER search results from random alignment subsets (with three sequences each) drawn from the parent alignment (Rand-shuffle algorithm), using the SCOP structural classification to determine true similarities. At false-positive rates of 5%, the Rand-shuffle algorithm improved HMMER's sensitivity, with a 37.5% greater sensitivity compared with HMMER alone, when easily identified similarities (identifiable by BLAST) were excluded from consideration. An extension of the Rand-shuffle algorithm (Ali-shuffle) weighted towards more informative sequence subsets. This approach improved the performance over HMMER alone and PSI-BLAST, particularly at higher false-positive rates. The improvements in performance of these sequence sub-sampling methods may reflect lower sensitivity to alignment error and irregular evolutionary patterns. The Ali-shuffle and Rand-shuffle sequence homology search programs are available by request from the authors.
The prediction of ancestral protein sequences from multiple sequence alignments is useful for many bioinformatics analyses. Predicting ancestral sequences is not a simple procedure and relies on accurate alignments and phylogenies. Several algorithms exist based on Maximum Parsimony or Maximum Likelihood methods but many current implementations are unable to process residues with gaps, which may represent insertion/deletion (indel) events or sequence fragments.
Here we present a new algorithm, GASP (Gapped Ancestral Sequence Prediction), for predicting ancestral sequences from phylogenetic trees and the corresponding multiple sequence alignments. Alignments may be of any size and contain gaps. GASP first assigns the positions of gaps in the phylogeny before using a likelihood-based approach centred on amino acid substitution matrices to assign ancestral amino acids. Important outgroup information is used by first working down from the tips of the tree to the root, using descendant data only to assign probabilities, and then working back up from the root to the tips using descendant and outgroup data to make predictions. GASP was tested on a number of simulated datasets based on real phylogenies. Prediction accuracy for ungapped data was similar to three alternative algorithms tested, with GASP performing better in some cases and worse in others. Adding simple insertions and deletions to the simulated data did not have a detrimental effect on GASP accuracy.
GASP (Gapped Ancestral Sequence Prediction) will predict ancestral sequences from multiple protein alignments of any size. Although not as accurate in all cases as some of the more sophisticated maximum likelihood approaches, it can process a wide range of input phylogenies and will predict ancestral sequences for gapped and ungapped residues alike.