|Home | About | Journals | Submit | Contact Us | Français|
In many countries regulatory agencies have adopted safety guidelines, based on bioinformatics rules from the WHO/FAO and EFSA recommendations, to prevent potentially allergenic novel foods or agricultural products from reaching consumers. We created the Structural Database of Allergenic Proteins (SDAP, http://fermi.utmb.edu/SDAP/) to combine data that had previously been available only as flat files on Web pages or in the literature. SDAP was designed to be user friendly, to be of maximum use to regulatory agencies, clinicians, as well as to scientists interested in assessing the potential allergenic risk of a protein. We developed methods, unique to SDAP, to compare the physicochemical properties of discrete areas of allergenic proteins to known IgE epitopes. We developed a new similarity measure, the property distance (PD) value that can be used to detect related segments in allergens with clinical observed crossreactivity. We have now expanded this work to obtain experimental validation of the PD index as a quantitative predictor of IgE cross-reactivity, by designing peptide variants with predetermined PD scores relative to known IgE epitopes. In complementary work we show how sequence motifs characteristic of allergenic proteins in protein families can be used as fingerprints for allergenicity.
It is well known that clinically important cross-reactivities among environmental triggers of allergy and asthma can be accounted for by proteins in those sources that have common molecular properties (Breiteneder and Ebner, 2000; Mari, 2001; Ferreira et al., 2004; Breiteneder and Mills, 2005; Jenkins et al., 2005). For example, the major allergenic proteins isolated from peanut (Burks et al., 1997; Shin et al., 1998; Rabjohn et al., 1999) have homologues in other foods that are known to elicit clinically significant responses in atopic individuals (Schein et al., 2005a), such as tree nuts (de Leon et al., 2003), soy (Eigenmann et al., 1996), and legumes (Lopez-Torrejon et al., 2003; Wensing et al., 2003). Pathogenesis response (PR) proteins of plants are another major group of protein families that are commonly found among allergens (Midoro-Horiuti et al., 2001). The cedar pollen allergen Jun a 3, classified as pathogenesis response protein group 5 (PR5) (Soman et al., 2000), was subsequently shown to be similar to allergenic proteins isolated from many different plants (Midoro-Horiuti et al., 2001; Elbez et al., 2002; Hoffmann-Sommergruber, 2002; Asensio et al., 2004). These included many food sources, including cherries, bell pepper, apple and tomato (Ivanciuc et al., 2003a). Other cedar pollen allergens (Midoro-Horiuti et al., 1999a,b, 2003, 2006; Czerwinski et al., 2005; Varshney et al., 2007) were shown to be similar to proteins from other pollen allergens including birch (Fedorov et al., 1997; Ferreira et al., 1998; Spangfort et al., 1999), and grass (Lalla et al., 1996; Schramm et al., 1997; Petersen et al., 1998; Flicker et al., 2000).
There are now several databases that contain sequences and information about allergenic proteins, as reviewed in several publications (Hileman et al., 2002; Brusic et al., 2003; Gendel, 2004; Gendel and Jenkins, 2006; Goodman, 2006; Schein et al., 2007). Most of these databases are simply lists of allergenic proteins or sources, with limited cross-indexing, including the IUIS (International Union of Immunological Societies) Website, (http://www.allergen.org), AllAllergy (http://allallergy.net/), or the Biotechnology Information for Food Safety Database (National Center for Food Safety and Technology, http://www.iit.edu/~sgendel/fa.htm). Cross-indexed databases, more useful for identifying potentially cross-reacting allergens and their natural sources, include Allergome (http://www.allergome.org), CSL (Central Science Laboratory, UK, http://www.csl.gov.uk/allergen/index.htm), Protall (http://www.ifrn.bbsrc.ac.uk/protall/), and the FARRP database (http://allergenonline.com/asp/public/login.asp). Here, we will present details of our Structural Database of Allergenic Proteins (SDAP, http://fermi.utmb.edu/SDAP/) (Ivanciuc et al., 2002, 2003b), described as the “most ambitious of the molecular databases” in a recent review (Gendel and Jenkins, 2006), which is unique in that it contains several bioinformatics search tools beyond standard FASTA to identify cross-reactive allergens. We will discuss and highlight those specific features that have been recently added to SDAP (Ivanciuc et al., 2008a; Oezguen et al., 2008) and that are of particular relevance for regulatory purposes (Schein et al., 2007).
Current guidelines recommend the use of standard sequence comparison methods, such as BLAST or FASTA, to determine whether a protein could cause reactions in allergic individuals (Goodman, 2006). We have implemented these tools in SDAP, and one can now conveniently test their ability to: first, discriminate allergenic proteins based on their identity to officially recognized allergens, and secondly, distinguish proteins that should not be allergenic. As we show below, we have found the first task to be easier than the second one. Using the sequence information of known allergens as archived in SDAP, we used a large-scale statistical analysis to test the bioinformatics guidelines proposed by WHO and EFSA committees (WHO, 2000, 2001, 2003; EFSA, 2004). We found that, in seeking to identify all proteins that could be allergens, strict adherence to these guidelines would suggest eliminating about a third of all known proteins from our environment! Conundrums abounded in our results: proteins known to cause anaphylaxis in sensitive individuals, such as the tropomyosins of shrimp and other crustaceans, have high sequence identity to mammalian homologues that are not allergens (Schein et al., 2007). Our results emphasized that the nature of allergenicity was local, and that to identify the true allergenic potential of a protein one had to catalogue discrete areas of allergenic proteins that would bind IgE. Thus we included in SDAP a cross-referenced list of sequences known to bind IgE from patient sera, coupled with tools designed to compare their sequences to those of other allergens.
As the current bioinformatics guidelines for allergenicity, based on simple sequence comparisons, are far from optimal (van Ree et al., 2006; Schein et al., 2007; Goodman, 2008), we present alternative classification methods that use analysis of local sequence and structure to identify common features of allergenic proteins that distinguish them from related, non-allergenic proteins. Recent efforts to include structural information on the allergens in predicting cross-reactivity (Aalberse, 2007; Chapman et al., 2007; Bonds et al., 2008; Oezguen et al., 2008) require classification into discrete PFAM classes (Ivanciuc et al., 2008a; Radauer et al., 2008). In addition, we developed and validated a PD (“physicochemical property distance”) scale (Ivanciuc et al., 2002, 2003b) expressly to identify, with statistical significance, areas of allergens catalogued in SDAP that are similar to known IgE binding sequences (Schein et al., 2007; Ivanciuc et al., 2008b). The compiled information in SDAP, in addition to the sequences and epitopes for all allergens listed in the IUIS Website, from published literature and from other databases, also includes substantial 3D-structural information. Classification of all the allergens in SDAP according to their protein family (Pfam) also allowed us to characterize sequence motifs, which can be used as fingerprints for allergenicity (Ivanciuc et al., 2008a). Those sequence motifs are publicly available on the MotifMate web server (http://born.utmb.edu/motifmate/). We also explored the 3D-structural characteristics of conformational epitopes that can be of importance for more refined bioinformatics rules in the future (Oezguen et al., 2008).
SDAP has integrated search tools to allow a user to rapidly compare the molecular properties of allergenic proteins and their epitopes (Ivanciuc et al., 2002). SDAP was developed for basic research to determine common molecular characteristics of group of allergens, and to provide regulatory agencies, food scientists and biomedical researchers software support to determine if a novel protein has allergenic potential (Ivanciuc et al., 2003a). No special training is needed to access the data, and the tools are implemented in a user friendly fashion. Software tools integrated in SDAP include the FAO/WHO bioinformatics rules, standard BLAST (Schaffer et al., 2001) and FASTA (Pearson, 1994) search methods, ExPaSy (Schneider et al., 2004), PIR (Barker et al., 1999) and PRO-SITE (Hulo et al., 2006). The special tools of SDAP, such as the PD scale (Ivanciuc et al., 2002, 2003b), were developed to compare short sequences to one another in a mathematically rigorous, unbiased fashion (as opposed to using simple sequence comparisons, “by eye”, or applying limited rules with respect to identity or homology).
SDAP is also integrated with other bioinformatics servers, allowing the user to investigate structural similarity and neighbors using SCOP (Structural Classification Of Proteins) (Conte et al., 2000), TOPS (TOpological representation of Protein Structure) (Gilbert et al., 1999), CATH (Class, Architecture, Topology and Homologous superfamily) (Pearl et al., 2001), CE (Combinatorial Extension of the optimal path) (Shindyalov and Bourne, 1998), FSSP (Fold Classification based on Structure–Structure alignment of Proteins) (Holm and Sander, 1996), and VAST (Vector Alignment Search Tool) (Gibrat et al., 1996).
The information content in SDAP for a given allergen is illustrated for the Ole e 8 protein from olive trees (Fig. 1). This descriptive page is shown after a user selects the allergen of interest. The page contains a summary of all the data archived in SDAP for the selected allergen, including the official name (according to the IUIS Website listing, http://allergen.org/), scientific and common name for the species, general source of the allergens, allergen type; species; systematic name; brief description; sequence accession numbers from SwissProt, PIR, NCBI and, where available, the PDB file name for a structure. All of this information is also cross-referenced to other data sources, which can be directly accessed by clicking on the appropriate links.
Several methods implemented in the SDAP web server are designed for regulatory purposes. The most widely used method to determine potential allergenicity of a novel protein is to do a global sequence search method to other allergens by FASTA (Pearson, 1990). FASTA can be run automatically from any sequence file in SDAP by a mouse-click, and outputs a table that lists all similar allergens in SDAP according to their “E-value”, or expectation value, to the target, which indicates the statistical significance of the hit. The E-value is a measure of how many matches with the same sequence similarity one would expect to occur randomly in a database of a given size. Thus a low E-value (e.g. less than 10−6) indicates a high significance of the sequence match.
The FAO/WHO reports (Bindsley-Jensen et al., 2003; WHO, 2003) proposed that cross-reactivity between a query protein and a known allergen has to be considered when there is (a) more than 35% identity in the amino acid sequence of the query protein, using a window of 80 amino acids and a suitable gap penalty, or: (b) identity of six contiguous amino acids of the query protein in a known allergen. To carry-out a search based on these criteria, the SDAP user only needs to cut and paste a query protein sequence in the appropriate window at the SDAP Website (Fig. 2). The output lists all similar proteins in SDAP (i.e. those that satisfy the FAO/WHO cross-reactivity conditions). Several variations of the search can be performed by altering the parameters; e.g. a full length FASTA search in SDAP or searching for larger segments of contiguous identical residues. The output is a summary table listing allergens that have an E-score alignment with the query protein lower than the user-set maximum. The output also contains the individual pairwise alignments and full sequence identities. The user can examine each pairwise alignment and use as a guide to estimate the allergenic potential of the query sequence.
The first questions about these rules are: how many proteins will be incorrectly determined to be allergenic, and more importantly, how many allergens will be missed? To validate the bioinformatics guidelines of the FAO/WHO committee, we used all SDAP entries as positive controls, and we filtered the SwissProt database to generate a set of non-allergenic proteins (negative control). For the negative control set, we removed all SDAP entries in the SwissProt database and then used keyword filters to remove all SwissProt records that (a) contain an allergen-related keyword (allergen, allergy, lipid transfer protein, profilin, lipocalin, pectate lyase, tropomyosin, melittin, thaumatin, seed storage protein), (b) have a sequence shorter than 80 amino acids, or (c) belong to Inter-Pro, Pfam, or Prosite allergen-related classes. For every SDAP protein we recorded the best match among all windows of 80 amino acids. A protein is classified as an allergen if the sequence identity to an allergen is higher than a given threshold in a window of 80 amino acids (Fig. 3). A comparison between the fraction of positive controls (SDAP allergens, blue line) and the fraction of negative controls (set of non-allergenic proteins, red line) suggests that a good threshold for the sequence identity should be between 35% and 45%. We found that the threshold of 35% for sequence identity is a good estimate for separating allergens from non-allergens, but with 6.6% of non-allergenic proteins classified as allergenic there is still a relatively high number of false positives.
The sensitivity of criterion 1 was evaluated by comparing each SDAP allergen with the remaining SDAP sequences (blue line). For a threshold sequence identity of 35%, the test correctly identifies 92.29% allergens. Increasing this threshold to 45% decreases the fraction of SDAP allergens identified to 90.45%. Decreasing the threshold sequence identity to 15% will identify 99.10% of known allergens, but at this level 78.25% from SwissProt (95725 sequences) would also be considered allergenic! Thus, sequence identify alone cannot be used to absolutely identify allergens. The results from Fig. 3 indicate that while this bioinformatics test is able to filter non-allergenic proteins when the sequence identity is between 35% and 45%, the overall sequence identity is not the only determinant for allergenicity. Additional quantitative descriptors need to be developed for computational predictions.
The FASTA search in SDAP is a rapid way to determine the overall similarity of large proteins. However, FASTA was not designed to compare short sequences, such as the linear IgE epitopes that have been identified by peptide mapping for many allergens (Jarvinen et al., 2001; Elsayed et al., 2004; Shreffler et al., 2004; Schein et al., 2005a). Two different tools were incorporated in SDAP to look for short sequences in other known allergens, an “exact search”, that finds short sequences identical to that of a known epitope, and a second tool, to determine sequences that are close to the IgE epitope in the PD “property-distance space” (Ivanciuc et al., 2002, 2003b). The PD tool determines similar sequences in other allergen entries in SDAP that have similar overall physicochemical properties. Peptides with identical sequences have a PD value of 0, and peptides with conservative substitutions of a few amino acids have a small PD value, typically in the range of 0–3. Peptides with a recognizable similarity in their physicochemical properties generally have PD values lower than 10, while unrelated peptides have PD values that are much higher.
The PD score is based on the amino acid descriptors E1–E5 that were determined by the multidimensional scaling of 237 physico–chemical properties of amino acids (Venkatarajan and Braun, 2001). Using the amino physicochemical descriptors E1–E5, the properties of the 20 naturally occurring amino acids can be numerically summarized as five values. These five dimensions define a physicochemical property space for all amino acids, with each axis representing a distinct feature. For example, the first three E descriptors correlate with the amino acid’s hydrophobicity, size, and polarity, respectively. Each amino acid is represented as a point in the five-dimensional space E1–E5, and the similarity between two amino acids is inversely correlated to the distance between the two points representing the two amino acids. The PD sequence similarity score for two sequences A and B each containing N amino acids is (Ivanciuc et al., 2002, 2003b):
where λj is the eigenvalue of the j-th E component, Ej(Ai) is the Ej value for the amino acid in the i-th position from sequence A, and Ej(Bi) is the Ej value for the amino acid in the i-th position from sequence B.
Table 1 illustrates the usefulness of using the PD value to identify related potential epitopes and potentially cross-reactive allergenic proteins for the IgE epitope VQGKEKEP of Par j 1. The PD search identifies a fragment from the related allergen Par j 2 (VKGEEKEP; Table 1) as the most similar region to the IgE epitotpe of Par j 1. More distant similarities are identified in a number of SDAP allergens. We should at this point emphasize that the PD search is a computational way to define the sequence relationship between known IgE epitopes and other sequences in allergenic proteins. Our initial tests indicate that PD is a reliable index to quantify local similarities in known allergens.
To obtain experimental validation of the PD index as a quantitative predictor of IgE cross-reactivity we designed peptide variants with predetermined PD scores relative to three linear IgE epitopes of Jun a 1 (Midoro-Horiuti et al., 2003, 2006). The peptides synthesized on a derivatized cellulose membrane were probed with sera from patients allergic to Jun a 1, and the experimental data were interpreted with a PD classification method, giving a percentage of correct predictions up to 80% (Ivanciuc et al., 2008b). Peptides similar to a Jun a 1 epitope (PD < 6) were more likely to bind IgE from the sera than were those with PD values larger than 6. Control sequences, with PD values between 18 and 20 to all the three epitopes, did not bind patient IgE, thus validating our procedure for identifying negative control peptides. These results demonstrate that the PD index may identify peptides that have a high probability of cross-reacting with IgE from allergic patients.
Classification of allergens into functional groups of proteins can indicate important relationships and has the additional advantage that structural and sequence groupings allow one to identify significant similarities in proteins with diverse origins. We annotated all allergens in SDAP according to their Pfam classification (Ivanciuc et al., 2008a). Pfam (http://www.sanger.ac.uk/Software/Pfam/) is a list of multiple sequence alignments of related protein domains, classified in two ways. The Pfam-A database lists protein families that are grouped by their common function as well as sequence, using expert knowledge and experimental data. Pfam-B is computer-generated and contains alignments of proteins sequences selected based on a minimum level of sequence identity, regardless of their protein function. Most SDAP entries have now been classified to families from the Pfam-A database. Easy access to this Pfam classification for any allergen can be accessed from the “List SDAP” menu item.
Allergens from the same Pfam class exhibit a high structural similarity, as it is shown in Fig. 4 for three pairs of allergens: Act c 1 (kiwi, PDB 2ACT) and Car p 1 (papaya, PDB 1KHQ) from the family PF00112, Papain family cysteine protease; Phl p 5 (timothy, PDB 1L3P) and Phl p 6 (PDB 1NLX) from the family PF01620, Ribonuclease (pollen allergen); Der f 2 (American house dust mite, PDB 1XWV) and Der p 2 (European house dust mite, PDB 1KTJ), from the family PF02221, ML domain. We found that allergens populate only a small subset of all known Pfam families, as all allergenic proteins in SDAP could be grouped to only 130 (of 9318 total) Pfams, and only 31 families contain more than 4 allergens, which is consistent with results obtained by others (Radauer et al., 2008). The limited number of Pfam families suggests new criteria to estimate the potential risk of allergenic recombinant protein products. For example, if a novel protein product belongs to a Pfam class different from all listed Pfam classes as found in SDAP, it should be considered to have little allergenic potential.
Alternatively, one can define discrete areas of residue conservation, “motifs”, in related allergenic proteins of known clinical cross-reactivity, as possible areas for IgE binding. Several groups have defined conserved sequences in groups of allergens (Mills et al., 2002; Brusic and Petrovsky, 2003; Stadler and Stadler, 2003; Li et al., 2004; Marti et al., 2007). Unlike motifs defined by others, which can be quite long (to the point that they be more properly called protein domains), we define areas more likely to be discrete IgE epitopes, with a normal length is between 6 and 15 amino acids. In our work, we look for areas where the side chains show conserved physicochemical properties (PCPs), such as hydrophobicity, size or alpha-helical propensity, rather than strict identity. The underlying assumption is that for a group of cross-reactive allergenic proteins, the IgE epitopes areas have similar binding affinities for the same antibodies, and have thus common physico chemical properties in the antibody binding sites.
Our method begins by aligning the sequences of known allergens that are related to one another, such as those in the tropomyosin or vicilin family. The PCPMer suite (available at http://landau.utmb.edu:8080/WebPCPMer/HomePage/index.html) finds sequence motifs in protein families by identifying regions with highly conserved physicochemical properties. These “PCP-motifs” are determined by conservation of the five quantitative property vectors E1–E5 which summarize many different physicochemical properties of the side chains of the amino acids, including size, hydrophobicity, and tendency to form helical or strand secondary structures (Venkatarajan and Braun, 2001; Venkatarajan et al., 2003). Sequence motifs are contiguous segments of high relative entropy values for at least one of the five descriptors. Alternatively, the program allows the user to set thresholds of relative entropy, gap cutoff and minimum motif length to balance the specificity and sensitivity of motifs. Each motif identified by PCPMer is quantitatively expressed as a profile, in this case (for a motif of length N), a series of N × 5 matrices consisting of the average values, standard deviations and the relative entropies of the descriptors E1–E5 at each position (column in the multiple sequence alignment) in the motif. This profile can be used to search for similar sequences in protein databases. Details of the algorithm for motif generation are described in our previous publications (Venkatarajan et al., 2003; Schein et al., 2005b).
As an illustration for the generation of sequence motifs we show the (truncated) multiple sequence alignment for the walnut allergen Jug r 1 with other allergens in the same Pfam classification (Fig. 5A) and the corresponding output of PCPMer in Fig. 5B. The sequence of the first protein in the alignment is given as reference for those columns where the relative entropy values exceed the value given in the first column of Fig. 5B. Two motifs, CQYYLR and CCQQLS, are identified as local maxims of the relative entropy values, and are regions of high conservation of physicochemical properties. Our PCP motifs do not require that residues within a motif are identical among all sequences, just that the overall pattern of property presentation be similar.
Motifs can also be mapped onto the 3D-structure of a protein to identify epitopes and conserved functional areas (Schein et al., 2005b). Combining sequence analysis with structural representations can answer many questions about the nature of the IgE epitopes of allergens. For example, why do some individuals show cross-reactivity to homologous proteins in peanuts and tree nuts, while others react to one or another of the homologous proteins (Teuber and Beyer, 2004)? While single amino acid differences may be quite important in individual reactivity, a 3D view of the identified IgE binding sites can provide missing information about the possible relationships between structure and sequence. If IgE binding sequences of related proteins have similar properties, the proposed methods that combine PD values with structural details will have higher predictive ability, if properly calibrated. Thus we are building up a library of models, based on homology of allergens to proteins of known structure, to determine clusters of residues that are conserved on the surface of allergens.
Once similar sequences have been identified by PD values, the structural information in SDAP can be used to understand which parts of an allergen sequence are likely to be surface exposed, and thus able to form an IgE binding surface. In order to investigate the structural features of allergens we computed reliable models for more than 80% of allergens in SDAP for which the experimental structure is unknown (Oezguen et al., 2008). We initially attempted to generate 3D homology models for 645 allergens in SDAP for which no experimental structure or close homolog is deposited in the Protein Databank. Each model of our automatic procedure was evaluated critically by three quality criteria, namely: (1) negative overall conformational energy after FANTOM minimization, which indicates favorable local packing of the side chains; (2) an RMSD to the template for the aligned regions less than 1.8 Å; and (3) not more than 5% of the ϕ/ψ dihedral angles situated in the disallowed region of a Ramachandran plot. Overall, 433 allergen sequences passed these criteria and gave reliable 3D homology models that are currently deposited in SDAP and are available for viewing or for download. These allergen models can be used to determine areas of local structure that correlate with allergenicity. For example, using our models, linear IgE epitopes taken from SDAP are mapped onto the surface of the pollen allergen Par j 1 (Fig. 6): epitope 1, VQGKEKEP, red; epitope 2, SKGCCSGAKRLD, green; epitope 3, KTGPQRV, gold; epitope 4, PKHCGIVD, blue. The surface mapping of linear epitopes on the 3D models of allergens may be used to identify buried residues and surface accessible residues, thus highlighting the amino acids that may bind to the IgE.
Phage display technology is an alternative approach to characterize conformational epitopes of proteins. It identifies a discontinuous group of amino acids on the protein surface by binding to a monoclonal antibody (Smith and Petrenko, 1997). Therefore, to locate the interaction site on the protein surface mimicked by the epitope is not possible using sequence analysis alone. We developed a fully automated method EpiSearch (available at http://curie.utmb.edu/episearch.html), that locates the antibody binding site on the antigen surface using the peptide sequences obtained from phage display. The method is a further development of our approach to predict interface residues in a monomeric protein (Negi et al., 2006, 2007; Negi and Braun, 2007).
Bioinformatics analysis of the properties of allergens has progressed greatly in the last few years. As we have shown, SDAP has reliable tools that go beyond the initial guidelines for determining the potential allergenicity of new food products for regulatory purposes. SDAP contains now a broad array of bioinformatics and computational tools that: (1) can evaluate the overall sequence similarity to a known allergen based on FASTA alignments, (2) evaluate the WHO/FAO rules, (3) find regions identical with known IgE epitopes, (4) identify regions similar with known IgE epitopes and rank them with the PD score, (5) use 3D homology models to identify the amino acids that are important in IgE binding.
We regard our studies important in providing a solid scientific foundation in the general discussion on the potential risk of genetically modified (GM) foods. The statistical results and the novel bioinformatics tools can help regulatory agencies in the US and other countries that grow GM plants to find more specific bioinformatics guidelines for these crops. Since food allergies can result in fatal reactions, the allergenic potential of genetically-engineered food products needs to be carefully assessed prior to their entry into the market. There is a vital need for faster and reliable methods to evaluate the potential allergenicity of proteins that have not previously been part of the food supply. Our novel approaches can reduce some uncertainty for those crops that may be potentially allergenic for some sensitive sub-population.
This work was supported by a contract from the US Food and Drug Administration (HHSF223200710011I) and grants from the National Institute of Health (R01 AI 064913), and the US Environmental Protection Agency under a STAR Research Assistance Agreement (No. RD 833137).