Minimotif Function Elements
A disambiguated and extensible semantic basis for minimotif functionality was derived from a set of rules which characterizes the approximately 5000 minimotifs in the Minimotif Miner (MnM) database [
1] without information loss. We have not created a formal grammar, but rather a set of rules that characterize minimotif descriptions. For any minimotif clause, the syntax is
Minimotif (subject),
Activity (verb), and
Target (object) which can be derived from a set of rules. We define these three major elements as follows:
Minimotifs consist of sequence definitions and sources. The sequence definition can be an instance, a consensus sequence, or a PSSM; all three classes of minimotifs are commonly reported in the literature. Instances represent primary data, whereas consensus sequences and PSSMs are interpretations of the data.
Minimotifs may require one or more post-translation modifications such as phosphorylation or proline isomerization. In each motif, these modifications can be described by one or more residue names, type(s) of modification, and position(s) in the
Minimotif sequence. Another approach for modeling residue modifications could be the atomic model previously described [
22]. A source is the protein or peptide that contains the minimotif sequence. For example, in ' [PKTPAK in Kalirin] [binds] [Crk]', 'PKTPAK' is a sequence definition and 'Kalirin' is the minimotif source [
23]. Alternatively, PxxPxK is a consensus definition that describes a consensus sequence for multiple instances.
Targets are proteins, nucleic acids, carbohydrates, lipids, small molecules, elements, metals, drugs, or complexes. In the case of proteins and nucleic acids,
Targets may be associated with sequence definitions.
Target proteins may contain domains as defined by the Conserved Domain Database [
24], belong to a hierarchical classification based on fold [
25] or refer to determined structure elements [
26]. In the above example of the PKTPAK minimotif, the
Target 'Crk' can be expanded to be more specific '1st SH3 domain of Crk'; referring to the N-terminal of two SH3 domains in Crk.
Activities are the actions of minimotifs and all minimotif activities can be generally classified as binds, modifies or traffics. The 'Binds' Activity describes an interaction of a protein containing a minimotif with another molecule. The 'Modifies' Activity defines a chemical change to a minimotif sequence that can be further subcategorized into enzymatic activities such as phosphorylates, amidates, geranyl gernaylates, cleaves etc The 'Traffics' Activity describes minimotif sequences required for a protein to be shuttled between cell compartments or other specific locations within or outside of cells.
In a number of minimotifs, a
Minimotif and
Activity are known, but the
Target has not yet been identified or it is not yet known if the interaction of the
Minimotif with the
Target is direct. This information is still useful, thus we utilize a 'Required'
Activity category which indicates that a minimotif sequence is necessary for a molecular or cellular activity. For example, the PNAY minimotif in Crk is required for Abl kinase activation [
27]. In this case, Abl kinase activation is a subcategory of 'Required'. As in this example, the
Target is null for the 'Required'
Activity.
Minimotif Syntax
In order to combine these major minimotif elements and the minimotif sequence definition into human-interpretable semantic sentences we have defined 22 different attributes of minimotifs (Table ) and derived the set of syntax rules listed below. Our goal was to identify a minimal set of rules that combine minimotif elements in order to regenerate valid minimotif sentences for the ~5000 minimotifs in the Minimotif Miner database. Valid minimotif sentences are based on these syntax rules, and biological entity categories of innumerable size (i.e. protein domains, protein names, molecule names, etc.).
| Table 1Attributes of a minimotif definition |
Syntax Rules
Format: Minimotif elements in quotes are variable and defined in Table . Additional definitions are shown in Table . Bold text does not change and italicized elements are optional. Each minimotif function conforms to one of four rules (binds, modified, traffics, required).
| Table 2Definitions of minimotif elements |
'Minimotif' = 'Minimotif Sequence' ('Required Modification') in 'Peptide' OR 'Protein'
'Protein target' = 'Domain position' 'domain' domain of' 'Protein'
'Target' = 'Molecule' OR 'Protein target'
'Required modification' = 'Amino acid' 'Position' residue is 'posttranslational modification'
'Activity modification' = 'Amino acid' 'Position' residue is 'posttranslational modification'
BIND RULE: 'Minimotif' binds 'Target'
MODIFICATION RULE: 'Minimotif' is modified by the 'enzyme activity' of the 'Protein target' ('activity modification').
TRAFFIC RULE: 'Minimotif' is trafficked by 'Target' to 'Cellular compartment' OR 'Minimotif' is trafficked to 'Cellular compartment'
REQUIRED RULE: 'Minimotif' is required for 'Chemical Process' OR 'Cellular Process'
Syntax Examples
BIND RULE: [IL]xxxxNPxY (tyrosine 497 residue is phosphorylated) in Interleukin 4 receptor
binds PTB domain of IRS-1 [
28].
MODIFICATION RULE: GRG in myelin basic protein
is modified by the N arginine methylation activity of PRMT1 (Arginine 107 is methylated) [
29].
TRAFFIC RULE: WHTL in Synaptotagmin
is trafficked to synaptic vesicles [
30].
REQUIRED RULE: GKFC in peptide
is required for cell adhesion [
31].
Minimotif Model and Implementation
The minimotif syntax was abstracted as a conceptual data model, which was used to derive logical and physical data models. An entity-relationship (ER) diagram of our conceptual data model is shown in Figure . The primary objects in the ER diagram are the Minimotif (green), Activities (orange), and Target (Cyan), each of which contains details regarding their attributes. Each Minimotif has a sequence and may have a modification (e.g. tyrosine phosphorylation in BIND RULE). All Minimotifs are in proteins which may have orthologues and domains. Each Minimotif can have a Target which is a molecule (Protein, Nucleic acid and small molecule are molecules; cyan). Molecules are in cell compartments. The Target has two relationships with the Minimotif (orange): modifies refers to a change in chemistry of the Minimotif, thus the Target is an enzyme in this case (MODIFIES RULE). For example, a Minimotif that is cut by a protease is chemically modified by an enzyme. The Target can also bind the Minimotif (BIND RULE). In the case where a Target molecule is not known, the Minimotif may be required for some Activity as in the REQUIRED RULE above. The TRAFFIC RULE is not represented in this diagram, but a Minimotif is trafficked by a Target from one cell compartment to another; the Target need not be known for the TRAFFIC RULE.
The physical implementation of the database is shown in Figure . The design of the minimotif relational database shows an intersection table (motif_source) of the Minimotif, Activity, and Target tables. Each minimotif in the database table has its own specific attributes such as minimotif type (consensus sequence or instance), a structure from the Protein Data Bank, an affinity for the Minimotif/Target complex, and published experimental techniques that support the Minimotif/Activity/Target relationship.
We have previously reported the MnM 2 database which contains more than 5000 minimotifs [
2]. We have now refactored the MnM 2 database to use controlled vocabularies. These include the Gene Ontology (GO; the
Activity term names and id's for common molecular functions), NCBI Taxonomy for id's and species names, NCBI Conserved Domain Database (CDD; the names and identifiers for protein domains in motif
Targets), NCBI Reference Sequences (RefSeq; for
Target and
Minimotif source protein names and ids), Human Proteome Organization (HUPO; for experimental evidence names and id's), Psi-Mod for post translational modifications of
Minimotifs, and the Protein databank (PDB, for accession numbers for protein structure files). The new relational database that uses these controlled vocabularies enforces, normalizes, integrates, and explicitly defines the minimotif semantics. Details concerning the database are in
Methods.
The minimotifs in the Minimotif Miner (MnM) database were refactored and implemented in MnM 2 [
2]. Our implementation of this model supports an integrative, semantically-rich minimotif analysis via the Structured Query Language (SQL), and importantly, is compatible with external motif analysis algorithms. This implementation enables extraction of groups of
Minimotifs which share common values for any subset or combinations of subsets for the 22 different attributes in the model (Table ). A set of 10 rules can be used to regenerate structured unambiguous human readable annotations [see Additional file
1].
We have built a user interface that enables users to query this database. This webpage is available as a link from the MnM 2 website. Users can select identifiers or text based descriptions from controlled vocabularies to query the database. For example, all SH3 binding motifs can by identified by selecting this domain from the CDD controlled vocabulary for domains [
24]. Many minimotif attributes can be queried from this page.
Once the query system is used to retrieve and group primary minimotif data (instances), interpretations of this data are often the next step in minimotif analysis. The interpretations of this data most commonly reported in the literature are consensus sequences, PSSMs, and groupings of families of minimotifs; these can be automatically generated based on query results generated by the aforementioned query system.
Often a single laboratory does an experiment that identifies a consensus sequence, PSSM or grouping. MnM stores individual instances as reported in the literature, as well as inferred consensus sequences as reported by the authors. Our new query page has the advantage that consensus sequences, PSSMs or families of motifs can be generated from user-selected instances from one or more independent studies. Thus, this tool can be used to study groupings, consensus sequences, and PSSMs, which can vary significantly between different studies. Once groupings of instances are selected from the new query page, users can then generate consensus sequences or PSSMs.
Grouping SH3 Domain Binding Minimotifs
There are many advantages expected to be gained by the use of a standardized minimotif syntax and query system. One such advantage is the simplified clustering of data within the database based on these new syntactical rules. As a case example, we classified 1363 SH3 binding minimotifs queried from the MnM 2 database. We selected this collection of data because of both the large number of reported SH3 binding minimotifs and the growing number of reported consensus sequences (e.g. PxxP, RxxPxxP, and PxxPxx [KR]). We posed a number of questions which would have been difficult to address without the syntax, but which are now easily addressed by querying the new relational database: Which SH3 consensus sequences are most common? How many SH3 binding consensuses are present in different instances? Do SH3 minimotifs bind to the same site? Is there a residue preference for degenerate positions?
A number of these questions had already been answered in an
ad hoc fashion, but our goal in this case study was to address these questions in a systematic manner. Additional details for this analysis are provided [see Additional file
1].
The groups of SH3 binders were extracted by custom SQL statements filtering
Minimotifs by type (consensus vs. instance),
Target (SH3 containing proteins), and
Activity (binds). This resulted in 1363 (741 unique) SH3 binding minimotifs, which could further be segregated into 69 consensus sequences and 672 instances. These sequences were compared inside our database for similarity based on the Shannon Information Content similarity metric as implemented by the Comparimotif library [
32]. This analysis resulted in 10 minimotif groups that describe all SH3 binding minimotifs in the database (Figure ). Details concerning the clustering analysis, queries, and results that lead to the distinct minimotif groups are provided [see Additional file
1].
Structural analysis of SH3 ligands
In order to better understand how these 10 SH3 binding minimotif groups were related to each other, we analyzed their known SH3/ligand complex structures. We queried the Minimotif Miner database and located representative structures for eight of the 10 groups. The
fit function of Molmol was used to align the backbones of the eight SH3 domains using 6 residues in the β1 sheet, 4 residues in the 3-10 helix and 6 residues in the β4 sheet [
33]. The root mean squared deviation (RMSD) for alignment of the backbone residues in these regions was 0.9 Å indicating a good alignment (Figure ). We then examined the relationships of the binding sites of the different minimotifs by adding the sidechain bonds of the conserved residue positions and backbone atoms for each minimotif. For two structures we were only able to identify the binding sites based on nuclear magnetic resonance chemical shift mapping experiments [
34,
35].
Our analysis revealed that although SH3 domains are most commonly discussed for their ability to bind PxxP containing peptides, members of the SH3 domain family bind several different consensus sequences and have specialized structural interfaces. Of the 10 minimotif groups, many used different binding pockets on the SH3 domain. Four minimotifs bound in a similar region to the standard PxxP binding site (RxxPxxP, BxxB, PxxxPR, and KPTVY). The BxxB (B = basic) shares only one of two binding pockets with PxxP as previously noted [
36,
37]. Two of the motifs (RxxPxxP and PxxxPR) were found to bind in two different orientations with the peptides flipped ~180° in the binding sites. Two other consensus sequences bound previously identified alternative sites not near the PxxP site, and two had no structural information. This analysis confirms the distinction of the minimotif clusters derived by the sequence based-analysis.
Most SH3 domain binding peptides have multiple consensus sequences
Until recently, BxxB, PxxxPR, and several other types of SH3 binding minimotifs were not known. Given that there were 10 different types of SH3 binding consensus minimotifs, we wanted to know to what extent did previously studied ligands have multiple consensus sequences. We designed a query (query 9) that assessed how many consensus sequences were present in each ligand excluding the pairing of PxxP with RxxPxxP and PxxPx [KR] because these minimotifs are children of PxxP.
The average number of minimotif consensa per SH3 ligand was 2.3 indicating a tendency for each ligand sequence to have multiple SH3 consensus sequences. In the most extreme examples the SPTPPPVPRRGTHT, QPPVPSLPPRNIKP, KKPPPPVPKKPAKS, RRPPVPPR, and RRAPPPVPKKPAKG ligands each have five of the 10 different SH3 binding consensus sequences. For each consensus sequence, we have also reported the percent ambiguity in Figure which is the percentage of each minimotif for which there are multiple consensus sequences. It is obvious from this analysis that a high proportion of previous SH3 binding experiments assessed ligands with potential to have multiple ligand binding modes. Thus, the majority of SH3 binding data may be subject to ambiguous interpretation (Figure ). In interpreting many previous SH3 binding experiments, new ligand binding modes may now need to be considered in the experimental interpretation. Our database contains only 50 of the 270 known human proteins with SH3 domains, thus the 10 SH3 minimotif groups we identified may become even more complex with a comprehensive analysis of all SH3 domains.
All SH3 domain binding peptides have basic residues
To further characterize the SH3 binding landscape, we performed analysis of residue content in all SH3 ligands using queries as described in methods. Compositional analysis showed a high preference for proline (4.2 fold), arginine (1.7 fold), and lysine (1.8 fold)(Table ). In fact, all SH3 ligands in the database contained either a lysine or arginine, suggesting that a positive charge may be an important factor in ligand binding to SH3 domains. Another study has previously suggested a role for positively charged residues in SH3 domain interactions [
38]. Consistent with this observation, the least enriched residues in SH3 ligands were the negatively charged residues.
| Table 3Residue frequencies in SH3 domain ligands |
The overall average calculated charge of SH3-binding peptides in our database was +3.2 ± 1.4 (average length of 12.1 ± 3.1 residues); this calculation is based on summing charges of basic and acid residues assuming a neutral pH. Of nine other groups of minimotifs with common domain targets in MnM 2 only minimotifs for Calmodulin (n = 31) and 14-3-3 (n = 44) had net positive charges of 3.0 ± 1.3 and 1.0 ± 0.9, respectively; PDZ (n = 1089), SH2 (n = 952), kinase (n = 206), PTB (n = 168), protease (n = 93), FHA (n = 67), WW (n = 27) and phosphatase (n = 25) domains had ligands or substrates with an average neutral or net negative charge.
Collectively, these query results strongly suggest that known SH3 peptide ligands have a more positive overall charge than proteins in the human proteome. It is important to note that when restricting the SH3 ligand query to non-BxxB sequences, the average ligand charge was still +2.2 ± 1.2. Only 11 of the 1363 sequences had a neutral or negative charge and several of these were for WxxxFxxLE and PxxDY minimotifs, which have few instances in the dataset.