Short linear motifs (SLiMs) are functional stretches of protein sequence that are of crucial importance for numerous biological processes by mediating protein–protein interactions. These motifs often comprise peptides of less than 10 amino acids that modulate protein–protein interactions. While well-characterized in eukaryotic intracellular signaling, their role in prokaryotic signaling is less well-understood. We surveyed the distribution of known motifs in prokaryotic extracellular and virulence proteins across a range of bacterial species and conducted searches for novel motifs in virulence proteins. Many known motifs in virulence effector proteins mimic eukaryotic motifs and enable the pathogen to control the intracellular processes of their hosts. Novel motifs were detected by finding those that had evolved independently in three or more unrelated virulence proteins. The search returned several significantly over-represented linear motifs of which some were known motifs and others are novel candidates with potential roles in bacterial pathogenesis. A putative C-terminal G[AG].$ motif found in type IV secretion system proteins was among the most significant detected. A KK$ motif that has been previously identified in a plasminogen-binding protein, was demonstrated to be enriched across a number of adhesion and lipoproteins. While there is some potential to develop peptide drugs against bacterial infection based on bacterial peptides that mimic host components, this could have unwanted effects on host signaling. Thus, novel SLiMs in virulence factors that do not mimic host components but are crucial for bacterial pathogenesis, such as the type IV secretion system, may be more useful to develop as leads for anti-microbial peptides or drugs.
short linear motifs (SLiMs); virulence factor; motif mimicry; antibacterial; bioinformatics; pathogen
Human milk is known to contain several proteases, but little is known about whether these enzymes are active, which proteins they cleave, and their relative contribution to milk protein digestion in vivo. This study analyzed the mass spectrometry-identified protein fragments found in pooled human milk by comparing their cleavage sites with the enzyme specificity patterns of an array of enzymes. The results indicate that several enzymes are actively taking part in the digestion of human milk proteins within the mammary gland, including plasmin and/or trypsin, elastase, cathepsin D, pepsin, chymotrypsin, a glutamyl endopeptidase-like enzyme, and proline endopeptidase. Two proteins were most affected by enzyme hydrolysis: β-casein and polymeric immunoglobulin receptor. In contrast, other highly abundant milk proteins such as α-lactalbumin and lactoferrin appear to have undergone no proteolytic cleavage. A peptide sequence containing a known antimicrobial peptide is released in breast milk by elastase and cathepsin D.
hydrolysate; human milk digestion; milk; nutrition; proteolytic enzymes; bioactive peptide
Little is known about the digestive process in infants. In particular, the chronological activity of enzymes across the course of digestion in the infant remains largely unknown. To create a temporal picture of how milk proteins are digested, enzyme activity was compared between intact human milk samples from three mothers and the gastric samples from each of their 4–12 day postpartum infants, 2 h after breast milk ingestion. The activities of 7 distinct enzymes are predicted in the infant stomach based on their observed cleavage pattern in peptidomics data. We found that the same patterns of cleavage were evident in both intact human milk and gastric milk samples, demonstrating that the enzyme activities that begin in milk persist in the infant stomach. However, the extent of enzyme activity is found to vary greatly between the intact milk and gastric samples. Overall, we observe that milk-specific proteins are cleaved at higher levels in the stomach compared to human milk. Notably, the enzymes we predict here only explain 78% of the cleavages uniquely observed in the gastric samples, highlighting that further investigation of the specific enzyme activities associated with digestion in infants is warranted.
milk enzymes; enzyme activity; digestive enzymes; infant digestion; proteolytic enzymes; human milk; indigenous enzymes
Statins effectively lower LDL cholesterol levels in large studies and the observed interindividual response variability may be partially explained by genetic variation. Here we perform a pharmacogenetic meta-analysis of genome-wide association studies (GWAS) in studies addressing the LDL cholesterol response to statins, including up to 18,596 statin-treated subjects. We validate the most promising signals in a further 22,318 statin recipients and identify two loci, SORT1/CELSR2/PSRC1 and SLCO1B1, not previously identified in GWAS. Moreover, we confirm the previously described associations with APOE and LPA. Our findings advance the understanding of the pharmacogenetic architecture of statin response.
Statins are effectively used to prevent and manage cardiovascular disease, but patient response to these drugs is highly variable. Here, the authors identify two new genes associated with the response of LDL cholesterol to statins and advance our understanding of the genetic basis of drug response.
Large portions of higher eukaryotic proteomes are intrinsically disordered, and abundant evidence suggests that these unstructured regions of proteins are rich in regulatory interaction interfaces. A major class of disordered interaction interfaces are the compact and degenerate modules known as short linear motifs (SLiMs). As a result of the difficulties associated with the experimental identification and validation of SLiMs, our understanding of these modules is limited, advocating the use of computational methods to focus experimental discovery. This article evaluates the use of evolutionary conservation as a discriminatory technique for motif discovery. A statistical framework is introduced to assess the significance of relatively conserved residues, quantifying the likelihood a residue will have a particular level of conservation given the conservation of the surrounding residues. The framework is expanded to assess the significance of groupings of conserved residues, a metric that forms the basis of SLiMPrints (short linear motif fingerprints), a de novo motif discovery tool. SLiMPrints identifies relatively overconstrained proximal groupings of residues within intrinsically disordered regions, indicative of putatively functional motifs. Finally, the human proteome is analysed to create a set of highly conserved putative motif instances, including a novel site on translation initiation factor eIF2A that may regulate translation through binding of eIF4E.
Background and Purpose
Visit-to-visit variability in BP is associated with ischemic stroke. We sought to determine whether such variability has a genetic aetiology and whether genetic variants associated with BP variability are also associated with ischemic stroke.
A GWAS for loci influencing BP variability was undertaken in 3,802 individuals from the Anglo-Scandinavian Cardiac Outcome Trial (ASCOT) study where long-term visit-to-visit and within visit BP measures were available. Since BP variability is strongly associated with ischemic stroke, we genotyped the sentinel SNP in an independent ischemic stroke population comprising of 8,624 cases and 12,722 controls and in 3,900 additional (Scandinavian) participants from the ASCOT study in order to replicate our findings.
The ASCOT discovery GWAS identified a cluster of 17 correlated SNPs within the NLGN1 gene (3q26.31) associated with BP variability. The strongest association was with rs976683 (p=1.4×10−8). Conditional analysis on rs976683 provided no evidence of additional independent associations at the locus. Analysis of rs976683 in ischemic stroke patients found no association for overall stroke (OR 1.02; 95% CI 0.97-1.07; p=0.52) or its sub-types: CE (OR 1.07; 95% CI 0.97-1.16; p=0.17), LVD (OR 0.98; 95% 0.89-1.07; p=0.60) and SVD (OR 1.07; 95% CI 0.97-1.17; p=0.19). No evidence for association was found between rs976683 and BP variability in the additional (Scandinavian) ASCOT participants (p=0.18).
We identified a cluster of SNPs at the NLGN1 locus showing significant association with BP variability. Follow up analyses did not support an association with risk of ischemic stroke and its subtypes.
Blood pressure variability; stroke; GWAS; gene; polymorphism
Bioactive cyclic peptides derived from natural sources are well studied, particularly those derived from non-ribosomal synthetases in fungi or bacteria. Ribosomally synthesised bioactive disulphide-bonded loops represent a large, naturally enriched library of potential bioactive compounds, worthy of systematic investigation.
We examined the distribution of short cyclic loops on the surface of a large number of proteins, especially membrane or extracellular proteins. Available three-dimensional structures highlighted a number of disulphide-bonded loops responsible for the majority of the likely binding interactions in a variety of protein complexes, due to their location at protein-protein interfaces. We find that disulphide-bonded loops at protein-protein interfaces may, but do not necessarily, show biological activity independent of their parent protein. Examining the conservation of short disulphide bonded loops in proteins, we find a small but significant increase in conservation inside these loops compared to surrounding residues. We identify a subset of these loops that exhibit a high relative conservation, particularly among peptide hormones.
We conclude that short disulphide-bonded loops are found in a wide variety of biological interactions. They may retain biological activity outside their parent proteins. Such structurally independent peptides may be useful as biologically active templates for the development of novel modulators of protein-protein interactions.
Electronic supplementary material
The online version of this article (doi:10.1186/1471-2105-15-305) contains supplementary material, which is available to authorized users.
Cyclic peptide; Protein loop; Protein interface; Bioactive peptide; Ribosomal cyclic peptide
Elevated resting heart rate is associated with greater risk of cardiovascular disease and mortality. In a 2-stage meta-analysis of genome-wide association studies in up to 181,171 individuals, we identified 14 new loci associated with heart rate and confirmed associations with all 7 previously established loci. Experimental downregulation of gene expression in Drosophila melanogaster and Danio rerio identified 20 genes at 11 loci that are relevant for heart rate regulation and highlight a role for genes involved in signal transmission, embryonic cardiac development and the pathophysiology of dilated cardiomyopathy, congenital heart failure and/or sudden cardiac death. In addition, genetic susceptibility to increased heart rate is associated with altered cardiac conduction and reduced risk of sick sinus syndrome, and both heart rate–increasing and heart rate–decreasing variants associate with risk of atrial fibrillation. Our findings provide fresh insights into the mechanisms regulating heart rate and identify new therapeutic targets.
Disordered regions of proteins often bind to structured domains, mediating interactions within and between proteins. However, it is difficult to identify a priori the short disordered regions involved in binding. We set out to determine if docking such peptide regions to peptide binding domains would assist in these predictions.We assembled a redundancy reduced dataset of SLiM (Short Linear Motif) containing proteins from the ELM database. We selected 84 sequences which had an associated PDB structures showing the SLiM bound to a protein receptor, where the SLiM was found within a 50 residue region of the protein sequence which was predicted to be disordered. First, we investigated the Vina docking scores of overlapping tripeptides from the 50 residue SLiM containing disordered regions of the protein sequence to the corresponding PDB domain. We found only weak discrimination of docking scores between peptides involved in binding and adjacent non-binding peptides in this context (AUC 0.58).Next, we trained a bidirectional recurrent neural network (BRNN) using as input the protein sequence, predicted secondary structure, Vina docking score and predicted disorder score. The results were very promising (AUC 0.72) showing that multiple sources of information can be combined to produce results which are clearly superior to any single source.We conclude that the Vina docking score alone has only modest power to define the location of a peptide within a larger protein region known to contain it. However, combining this information with other knowledge (using machine learning methods) clearly improves the identification of peptide binding regions within a protein sequence. This approach combining docking with machine learning is primarily a predictor of binding to peptide-binding sites, and is not intended as a predictor of specificity of binding to particular receptors.
YAP (Yes-associated protein) is a potent oncogene and a major effector of the mammalian Hippo tumor suppressor pathway. In this review, our emphasis is on the structural basis of how YAP recognizes its various cellular partners. In particular, we discuss the role of LATS kinase and AMOTL1 junction protein, two key cellular partners of YAP that bind to its WW domain, in mediating cytoplasmic localization of YAP and thereby playing a key role in the regulation of its transcriptional activity. Importantly, the crystal structure of an amino-terminal domain of YAP in complex with the carboxy-terminal domain of TEAD transcription factor was only recently solved at atomic resolution, while the structure of WW domain of YAP in complex with a peptide containing the PPxY motif has been available for more than a decade. We discuss how such structural information may be exploited for the rational development of novel anti-cancer therapeutics harboring greater efficacy coupled with low toxicity. We also embark on a brief discussion of how recent in silico studies led to identification of the cardiac glycoside digitoxin as a potential modulator of WW domain-ligand interactions. Conversely, dobutamine was identified in a screen of known drugs as a compound that promotes cytoplasmic localization of YAP, thereby resulting in growth suppressing activity. Finally, we discuss how a recent study on the dynamics of WW domain folding on a biologically critical time scale may provide a tool to generate repertoires of WW domain variants for regulation of the Hippo pathway toward desired, non-oncogenic outputs.
TEAD transcription factor; WW domain; PDZ domain; Nuclear localization; Digitoxin; Dobutamine
Computational protein short linear motif discovery can use protein interaction information to search for motifs among proteins which share a common interactor. Cytoscape provides a visual interface for protein networks but there is no streamlined way to rapidly visualize motifs in a network of proteins, or to integrate computational discovery with such visualizations.
We present SLiMScape, a Cytoscape plugin, which enables both de novo motif discovery and searches for instances of known motifs. Data is presented using Cytoscape’s visualization features thus providing an intuitive interface for interpreting results. The distribution of discovered or user-defined motifs may be selectively displayed and the distribution of protein domains may be viewed simultaneously. To facilitate this SLiMScape automatically retrieves domains for each protein.
SLiMScape provides a platform for performing short linear motif analyses of protein interaction networks by integrating motif discovery and search tools in a network visualization environment. This significantly aids in the discovery of novel short linear motifs and in visualizing the distribution of known motifs.
We carried out a genome-wide association study (GWAS) of LDL-c response to statin
using data from participants in the Collaborative Atorvastatin Diabetes Study (CARDS;
n = 1,156), the Anglo-Scandinavian Cardiac Outcomes Trial (ASCOT; n =
895), and the observational phase of ASCOT (n = 651), all of whom were
prescribed atorvastatin 10 mg. Following genome-wide imputation, we combined data
from the three studies in a meta-analysis. We found associations of LDL-c response to
atorvastatin that reached genome-wide significance at rs10455872 (P
= 6.13 × 10−9) within the LPA gene and
at two single nucleotide polymorphisms (SNP) within the APOE region
(rs445925; P = 2.22 × 10−16 and
rs4420638; P = 1.01 × 10−11) that are
proxies for the ϵ2 and ϵ4 variants, respectively, in APOE. The novel
association with the LPA SNP was replicated in the PROspective Study
of Pravastatin in the Elderly at Risk (PROSPER) trial (P =
0.009). Using CARDS data, we further showed that atorvastatin therapy did not alter
lipoprotein(a) [Lp(a)] and that Lp(a) levels accounted for all of the associations of
SNPs in the LPA gene and the apparent LDL-c response levels. However, statin therapy
had a similar effect in reducing cardiovascular disease (CVD) in patients in the top
quartile for serum Lp(a) levels (HR = 0.60) compared with those in the lower
three quartiles (HR = 0.66; P = 0.8 for interaction).
The data emphasize that high Lp(a) levels affect the measurement of LDL-c and the
clinical estimation of LDL-c response. Therefore, an apparently lower LDL-c response
to statin therapy may indicate a need for measurement of Lp(a). However, statin
therapy seems beneficial even in those with high Lp(a).
genetics; low density lipoprotein; LDL/metabolism; lipoprotein(a); statins
Intrinsically disordered regions in eukaryotic proteomes contain key signaling and regulatory modules and mediate interactions with many proteins. Many viral proteomes encode disordered proteins and modulate host factors through the use of short linear motifs (SLiMs) embedded within disordered regions. However, the degree of viral protein disorder across different viruses is not well understood, so we set out to establish the constraints acting on viruses, in terms of their use of disordered protein regions. We surveyed predicted disorder across 2,278 available viral genomes in 41 families, and correlated the extent of disorder with genome size and other factors. Protein disorder varies strikingly between viral families (from 2.9% to 23.1% of residues), and also within families. However, this substantial variation did not follow the established trend among their hosts, with increasing disorder seen across eubacterial, archaebacterial, protists, and multicellular eukaryotes. For example, among large mammalian viruses, poxviruses and herpesviruses showed markedly differing disorder (5.6% and 17.9%, respectively). Viral families with smaller genome sizes have more disorder within each of five main viral types (ssDNA, dsDNA, ssRNA+, dsRNA, retroviruses), except for negative single-stranded RNA viruses, where disorder increased with genome size. However, surveying over all viruses, which compares tiny and enormous viruses over a much bigger range of genome sizes, there is no strong association of genome size with protein disorder. We conclude that there is extensive variation in the disorder content of viral proteomes. While a proportion of this may relate to base composition, to extent of gene overlap, and to genome size within viral types, there remain important additional family and virus-specific effects. Differing disorder strategies are likely to impact on how different viruses modulate host factors, and on how rapidly viruses can evolve novel instances of SLiMs subverting host functions, such as innate and acquired immunity.
The conventional wisdom is that certain classes of bioactive peptides have specific structural features that endow their particular functions. Accordingly, predictions of bioactivity have focused on particular subgroups, such as antimicrobial peptides. We hypothesized that bioactive peptides may share more general features, and assessed this by contrasting the predictive power of existing antimicrobial predictors as well as a novel general predictor, PeptideRanker, across different classes of peptides.
We observed that existing antimicrobial predictors had reasonable predictive power to identify peptides of certain other classes i.e. toxin and venom peptides. We trained two general predictors of peptide bioactivity, one focused on short peptides (4–20 amino acids) and one focused on long peptides ( amino acids). These general predictors had performance that was typically as good as, or better than, that of specific predictors. We noted some striking differences in the features of short peptide and long peptide predictions, in particular, high scoring short peptides favour phenylalanine. This is consistent with the hypothesis that short and long peptides have different functional constraints, perhaps reflecting the difficulty for typical short peptides in supporting independent tertiary structure.
We conclude that there are general shared features of bioactive peptides across different functional classes, indicating that computational prediction may accelerate the discovery of novel bioactive peptides and aid in the improved design of existing peptides, across many functional classes. An implementation of the predictive method, PeptideRanker, may be used to identify among a set of peptides those that may be more likely to be bioactive.
Intracellular juxtamembrane regions of transmembrane proteins play pivotal roles in cell signalling, mediated by protein-protein interactions. Disordered protein regions, and short conserved motifs within them, are emerging as key determinants of many such interactions. Here, we investigated whether disorder and conserved motifs are enriched in the juxtamembrane area of human single-pass transmembrane proteins. Conserved motifs were defined as short disordered regions that were much more conserved than the adjacent disordered residues. Human single-pass proteins had higher mean disorder in their cytoplasmic segments than their extracellular parts. Some, but not all, of this effect reflected the shorter length of the cytoplasmic tail. A peak of cytoplasmic disorder was seen at around 30 residues from the membrane. We noted a significant increase in the incidence of conserved motifs within the disordered regions at the same location, even after correcting for the extent of disorder. We conclude that elevated disorder within the cytoplasmic tail of many transmembrane proteins is likely to be associated with enrichment for signalling interactions mediated by conserved short motifs.
Short linear protein motifs are attracting increasing attention as functionally independent sites, typically 3–10 amino acids in length that are enriched in disordered regions of proteins. Multiple methods have recently been proposed to discover over-represented motifs within a set of proteins based on simple regular expressions. Here, we extend these approaches to profile-based methods, which provide a richer motif representation.
The profile motif discovery method MEME performed relatively poorly for motifs in disordered regions of proteins. However, when we applied evolutionary weighting to account for redundancy amongst homologous proteins, and masked out poorly conserved regions of disordered proteins, the performance of MEME is equivalent to that of regular expression methods. However, the two approaches returned different subsets within both a benchmark dataset, and a more realistic discovery dataset.
Profile-based motif discovery methods complement regular expression based methods. Whilst profile-based methods are computationally more intensive, they are likely to discover motifs currently overlooked by regular expression methods.
Protein-protein interactions; Motif discovery; Peptide binding; Short linear motifs; Mini-motifs; SLiMs
Autism spectrum disorder (ASD) is a highly heritable disorder of complex and heterogeneous aetiology. It is primarily characterized by altered cognitive ability including impaired language and communication skills and fundamental deficits in social reciprocity. Despite some notable successes in neuropsychiatric genetics, overall, the high heritability of ASD (~90%) remains poorly explained by common genetic risk variants. However, recent studies suggest that rare genomic variation, in particular copy number variation, may account for a significant proportion of the genetic basis of ASD. We present a large scale analysis to identify candidate genes which may contain low-frequency recessive variation contributing to ASD while taking into account the potential contribution of population differences to the genetic heterogeneity of ASD. Our strategy, homozygous haplotype (HH) mapping, aims to detect homozygous segments of identical haplotype structure that are shared at a higher frequency amongst ASD patients compared to parental controls. The analysis was performed on 1,402 Autism Genome Project trios genotyped for 1 million single nucleotide polymorphisms (SNPs). We identified 25 known and 1,218 novel ASD candidate genes in the discovery analysis including CADM2, ABHD14A, CHRFAM7A, GRIK2, GRM3, EPHA3, FGF10, KCND2, PDZK1, IMMP2L and FOXP2. Furthermore, 10 of the previously reported ASD genes and 300 of the novel candidates identified in the discovery analysis were replicated in an independent sample of 1,182 trios. Our results demonstrate that regions of HH are significantly enriched for previously reported ASD candidate genes and the observed association is independent of gene size (odds ratio 2.10). Our findings highlight the applicability of HH mapping in complex disorders such as ASD and offer an alternative approach to the analysis of genome-wide association data.
Electronic supplementary material
The online version of this article (doi:10.1007/s00439-011-1094-6) contains supplementary material, which is available to authorized users.
Intrinsically disordered regions are enriched in short interaction motifs that play a critical role in many protein-protein interactions. Since new short interaction motifs may easily evolve, they have the potential to rapidly change protein interactions and cellular signaling. In this work we examined the dynamics of gain and loss of intrinsically disordered regions in duplicated proteins to inspect if changes after genome duplication can create functional divergence. For this purpose we used Saccharomyces cerevisiae and the outgroup species Lachancea kluyveri.
We find that genes duplicated as part of a genome duplication (ohnologs) are significantly more intrinsically disordered than singletons (p<2.2e-16, Wilcoxon), reflecting a preference for retaining intrinsically disordered proteins in duplicate. In addition, there have been marked changes in the extent of intrinsic disorder following duplication. A large number of duplicated genes have more intrinsic disorder than their L. kluyveri ortholog (29% for duplicates versus 25% for singletons) and an even greater number have less intrinsic disorder than the L. kluyveri ortholog (37% for duplicates versus 25% for singletons). Finally, we show that the number of physical interactions is significantly greater in the more intrinsically disordered ohnolog of a pair (p = 0.003, Wilcoxon).
This work shows that intrinsic disorder gain and loss in a protein is a mechanism by which a genome can also diverge and innovate. The higher number of interactors for proteins that have gained intrinsic disorder compared with their duplicates may reflect the acquisition of new interaction partners or new functional roles.
Gene and protein interactions are commonly represented as networks, with the genes or proteins comprising the nodes and the relationship between them as edges. Motifs, or small local configurations of edges and nodes that arise repeatedly, can be used to simplify the interpretation of networks.
We examined triplet motifs in a network of quantitative epistatic genetic relationships, and found a non-random distribution of particular motif classes. Individual motif classes were found to be associated with different functional properties, suggestive of an underlying biological significance. These associations were apparent not only for motif classes, but for individual positions within the motifs. As expected, NNN (all negative) motifs were strongly associated with previously reported genetic (i.e. synthetic lethal) interactions, while PPP (all positive) motifs were associated with protein complexes. The two other motif classes (NNP: a positive interaction spanned by two negative interactions, and NPP: a negative spanned by two positives) showed very distinct functional associations, with physical interactions dominating for the former but alternative enrichments, typical of biochemical pathways, dominating for the latter.
We present a model showing how NNP motifs can be used to recognize supportive relationships between protein complexes, while NPP motifs often identify opposing or regulatory behaviour between a gene and an associated pathway. The ability to use motifs to point toward underlying biological organizational themes is likely to be increasingly important as more extensive epistasis mapping projects in higher organisms begin.
Milk proteins are required to proceed through a variety of conditions of radically varying pH, which are not identical across mammalian digestive systems. We wished to investigate if the shifts in these requirements have resulted in marked changes in the isoelectric point and charge of milk proteins during evolution.
We investigated nine major milk proteins in 13 mammals. In comparison with a group of orthologous non-milk proteins, we found that 3 proteins κ-casein, lactadherin, and muc1 have undergone the highest change in isoelectric point during evolution. The pattern of non-synonymous substitutions indicate that selection has played a role in the isoelectric point shift, since residues that show significant evidence of positive selection are much more likely to be charged (p = 0.03 for κ-casein; p < 10-8 for muc1). However, this selection does not appear to be solely due to adaptation to the diversity of mammalian digestive systems, since striking changes are seen among species that resemble each other in terms of their digestion.
The changes in charge are most likely due to changes of other protein functions, rather than an adaptation to the different mammalian digestive systems. These functions may include differences in bioactive peptide releases in the gut between different mammals, which are known to be a major contributing factor in the functional and nutritional value of mammalian milk. This raises the question of whether bovine milk is optimal in terms of particular protein functions, for human nutrition and possibly disease resistance.
This article was reviewed by Fyodor Kondrashov, David Liberles (nominated by David Ardell), and Christophe Lefevre (nominated by Mark Ragan).
Short, linear motifs (SLiMs) play a critical role in many biological processes. The SLiMSearch 2.0 (Short, Linear Motif Search) web server allows researchers to identify occurrences of a user-defined SLiM in a proteome, using conservation and protein disorder context statistics to rank occurrences. User-friendly output and visualizations of motif context allow the user to quickly gain insight into the validity of a putatively functional motif occurrence. For each motif occurrence, overlapping UniProt features and annotated SLiMs are displayed. Visualization also includes annotated multiple sequence alignments surrounding each occurrence, showing conservation and protein disorder statistics in addition to known and predicted SLiMs, protein domains and known post-translational modifications. In addition, enrichment of Gene Ontology terms and protein interaction partners are provided as indicators of possible motif function. All web server results are available for download. Users can search motifs against the human proteome or a subset thereof defined by Uniprot accession numbers or GO term. The SLiMSearch server is available at: http://bioware.ucd.ie/slimsearch2.html.
Previous relatively small studies have associated particular amino acid replacements and deletions in the HIV-1 nef gene with differences in the rate of HIV disease progression. We tested more rigorously whether particular nef amino acid differences and deletions are associated with HIV disease progression. Amino acid replacements and deletions in patients' consensus sequences were investigated for 153 progressor (P), 615 long-term nonprogressor (LTNP), and 2,311 unknown progressor sequences from 582 subtype B HIV-infected patients. LTNPs had more defective nefs (interrupted by frameshifts or stop codons), but on a per-patient basis there was no excess of LTNP patients with one or more defective nef sequences compared to the Ps (P = 0.47). The high frequency of amino acid replacement at residues S8, V10, I11, A15, V85, V133, N157, S163, V168, D174, R178, E182, and R188 in LTNPs was also seen in permuted datasets, implying that these are simply rapidly evolving residues. Permutation testing revealed that residues showing the greatest excess over expectation (A15, V85, N157, S163, V168, D174, R178, and R188) were not significant (P = 0.77). Exploratory analysis suggested a hypothetical excess of frameshifting in the regions 9SVIG and 118QGYF among LTNPs. The regions V10 and 152KVEEA of nef were commonly deleted in LTNPs. However, permutation testing indicated that none of the regions displayed significantly excessive deletion in LTNPs. In conclusion, meta-analysis of HIV-1 nef sequences provides no clear evidence of whether defective nef sequences or particular regions of the protein play a significant role in disease progression.
Short, linear motifs (SLiMs) play a critical role in many biological processes, particularly in protein–protein interactions. The Short, Linear Motif Finder (SLiMFinder) web server is a de novo motif discovery tool that identifies statistically over-represented motifs in a set of protein sequences, accounting for the evolutionary relationships between them. Motifs are returned with an intuitive P-value that greatly reduces the problem of false positives and is accessible to biologists of all disciplines. Input can be uploaded by the user or extracted directly from UniProt. Numerous masking options give the user great control over the contextual information to be included in the analyses. The SLiMFinder server combines these with user-friendly output and visualizations of motif context to allow the user to quickly gain insight into the validity of a putatively functional motif. These visualizations include alignments of motif occurrences, alignments of motifs and their homologues and a visual schematic of the top-ranked motifs. Returned motifs can also be compared with known SLiMs from the literature using CompariMotif. All results are available for download. The SLiMFinder server is available at: http://bioware.ucd.ie/slimfinder.html.
Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations.
A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce pv, which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'v, which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure.
A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.
Motivation: Pairwise experimental perturbation is increasingly used to probe gene and protein function because these studies offer powerful insight into the activity and regulation of biological systems. Symmetric two-dimensional datasets, such as pairwise genetic interactions are amenable to an optimally designed measurement procedure because of the equivalence of cases and conditions where fewer experimental measurements may be required to extract the underlying structure.
Results: We show that optimal experimental design can provide improvements in efficiency when collecting data in an iterative manner. We develop a method built on a statistical clustering model for symmetric data and the Fisher information uncertainty estimates, and we also provide simple heuristic approaches that have comparable performance. Using yeast epistatic miniarrays as an example, we show that correct assignment of the major subnetworks could be achieved with <50% of the measurements in the complete dataset. Optimization is likely to become critical as pairwise functional studies extend to more complex mammalian systems where all by all experiments are currently intractable.
Supplementary data are available at Bioinformatics online.