Search tips
Search criteria 


Logo of jbtJBT IndexAssociation Homepage
J Biomol Tech. 2008 September; 19(4): 251–256.
PMCID: PMC2567133

ABRF-PRG05: De Novo Peptide Sequence Determination


A common request of proteomics core facilities is protein identification. However, in some instances primary sequence information for the protein in question is not present in public databases. In other cases, the amino acid sequence of a protein may differ in some way from the sequence predicted from the gene sequence in a database as a result of gene mutation, gene splicing, and/or multiple posttranslational modifications. Thus, it may be necessary to determine the sequence of one or more peptides de novo in order to identify and/or adequately characterize the protein of interest. The primary goal of this study was to give participating laboratories an opportunity to evaluate their proficiency in sequencing unknown peptides that are not included in any published database. Samples containing 3–6 pmol each of five synthetic peptides with amino acid sequences that were not present in public databases were sent to 106 laboratories. One nonstandard amino acid was present in one of the peptides. From a comparison of the results obtained by different strategies, participating laboratories will be able to gauge their own capabilities and establish realistic expectations for the approaches that can be used for this determination.

Keywords: de novo peptide sequencing, post-translational modification, Edman sequencing, mass spectrometry


Proteomics core laboratories are often presented with unknown proteins to be identified. Sometimes, these are not identifiable by commonly used strategies that involve proteolytic digestion, tandem mass spectrometry (MS) analysis, and database searching. There are several reasons why this approach might not be successful. The peptides derived from the protein might be modified in some way that is not being considered by the database search program being used, it might not have a required sequence characteristic (e.g., a C-terminal Lys or Arg from a tryptic digest), or it might come from an organism for which the primary sequence is not known. Sometimes a homologous protein can be identified, but this requires that the sequences have a sufficiently high degree of similarity. For example, if an unknown protein is 95% identical to a known one, there is approximately a 60% probability that a 20-residue peptide from the unknown protein will have at least one substitution compared to the corresponding known peptide—i.e., 1–(0.95)20. Alternative approaches may be required to obtain the needed sequence(s). The primary goal of the 2005 Association of Biomolecular Resource Facilities (ABRF) Proteomics Research Group (PRG) study was to give participating laboratories a chance to evaluate their capabilities in the following areas: (a) determination of peptide sequence; (b) identification of unusual amino acids; and (c) use of software to assist in the interpretation of de novo sequence data.

The sequences of the peptides synthesized for this study are shown in Table 1. No specific approaches for determining the sequences were recommended, although it was anticipated that tandem mass spectrometry and possibly Edman sequencing would be employed. Each of the laboratories that requested a sample was provided with a mixture consisting of 3–6 pmol each of the five synthetic peptides shown in Table 1; the sequences of these peptides were not present in any public database. The sample was supplied as a dried pellet that could be dissolved in most common aqueous solutions; one peptide (A1) proved somewhat difficult to dissolve. As with any “real-life” sample, there were minor contaminants present. There was either a Lys or an Arg at the C-terminus of each peptide, analogous to tryptic peptides; one peptide had a double “missed cleavage” and another contained two hydroxyproline (Hyp) residues. Participants were asked to return experimental evidence for each sequence they determined in addition to completing a Web-based questionnaire.

Amino Acid Sequences of the Five Peptides in the PRG05 Sample



The peptides were synthesized and purified at the following locations: A1, A2, and A3 at the HHMI Mass Spectrometry Laboratory at University of California, Berkeley; T50 at the NYU Protein Chemistry Laboratory; and J1 at the Macromolecular Structure Facility, Michigan State University. The synthetic peptides were analyzed by reversed-phase high performance liquid chromatography (HPLC) and matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF-MS) to verify purity.

Composition analysis

Amino acid analysis was conducted on small portions of A2, A3, and T50, individually dissolved in the appropriate volume of water to yield 1 mg/mL stock solutions. For each of these three peptides, 3 μL of the stock solution was added to an amino acid analysis tube. The blank contained 3 μL of 1% acetic acid. The samples were dried in a vacuum centrifuge, sealed, and analyzed in duplicate for amino acid content using a Waters AccQtag AAA column in conjunction with a Waters 2690 HPLC equipped with a Waters 2475 fluorometer.

Sample distribution

For distribution to requesting laboratories, the appropriate volume corresponding to 3–6 pmol of each peptide was added to a 0.5-mL polypropylene tube and the peptide mixture was dried in a vacuum centrifuge. Dried samples were sent to 76 laboratories in North America, 20 in Europe, and 10 in other countries.


Sequence data were submitted by 40 laboratories, corresponding to a return rate of 38%, which was similar to that of other recent PRG studies.1,2 A summary of the study results, organized according to instrument configuration and ionization method, is shown in Table 2. A compilation of all results received is shown in Table 3. The following approaches were used: MS alone (35); Edman degradation (1); Edman degradation plus MS (4).

Summary of Instrument Configuration and Ionization Mode Utilization
Summary of Results

The majority of laboratories reported the correct nominal peptide masses; peptide A2 was often found to contain an oxidized Met. Differences in sample preparation and use of derivatization prior to analysis did not seem to influence the success rate for sequencing, although one group used a variety of derivatization strategies and obtained the correct sequence for four of the five peptides.

Static nanoelectrospray worked as well as on-line fractionation by capillary HPLC. Laboratories using a tandem time-of-flight (TOF/TOF) mass spectrometer generally had a slightly higher success rate in obtaining the correct sequences for these peptides. These instruments typically use MALDI ionization; for this study it was not possible to assess the relative importance of ionization mode versus instrument type as related to the TOF/TOF results. In addition, the scores for laboratories reporting use of both an ion trap and another type of instrument were notably higher than those using a trap alone. Some level of manual interpretation was used by all laboratories; software alone did not appear to be sufficient to provide complete sequences. It is clear that there is a wide range of capabilities and levels of expertise among the participating laboratories. Moreover, it is important to note that the total number of responses was not very large. Therefore, it is not possible to formulate statistically rigorous conclusions about the capabilities of any specific approach or instrument used based on the results of this study.

The success rates for sequencing the individual peptides varied (Table 2 and Figure 1). This is most likely due to differences in the sequences. The internal Lys residues combined with the multiple Leu and Ile (scored as 0.5 if not distinguished) undoubtedly contributed to the low scores for peptide T50. Peptide A1 was the longest and, therefore, expected to be more difficult.

Success rate for individual peptides. Solid bars denote mean score obtained by all labs for a given peptide. Empty bars denote mean correct number of amino acid residues obtained by all labs for a given peptide.


The purpose of this study was to evaluate the capabilities of core laboratories to determine the sequences of peptides not found in any published database. Overall, the results show that this is an area that is difficult for many core laboratories. A sufficient amount of each of the peptides was supplied such that sample quantity should not have been a limitation (although solubility issues might have caused problems for sequencing of peptide A1). Peptides T50 and A1 were the most challenging, probably due to specific sequence features of those peptides.

In general, laboratories that reported using more than one type of instrument did slightly better than those that used only a single instrument. It is possible that facilities with multiple instruments might have a larger staff with more overall expertise. Too few cases in which Edman sequencing was used were reported to draw any conclusions. However, quantity limitations and time constraints made it generally less feasible to separate the peptides sufficiently for Edman analysis.

Although there are a variety of computer programs that are designed to perform de novo sequencing, the versions that were available at the time of this study did not appear to be capable of determining the sequences of the study peptides. The peptides used in this study were, by design, not naturally occurring sequences. In many “real” cases, a partial sequence obtained by mass spectrometry followed by database searching, even with errors in the partial sequence obtained by mass spectrometry, can be linked to a protein by a BLAST search. But that would require that a protein of sufficient homology be present in a published database. While that strategy would not be successful for the synthetic peptides provided in this study, it should be routinely considered.

It is clear that manual interpretation was necessary in order to determine the sequences of the peptides in this study. Commercially available instruments can usually provide sufficient tandem MS information to determine the sequences of most unknown peptides. However, it is critically important not only to acquire the spectra with the requisite mass accuracy and resolution, but also to be skilled in data interpretation. For example, there are two Hyp residues in peptide J1. The residue mass of Hyp (113.04768) is 36.4 mmu less than that of Leu/Ile (113.08406). Using some commercial instruments, it is possible to measure collision-induced dissociation fragment masses with sufficient accuracy to distinguish between these residues.

Finally, expertise in de novo sequencing is clearly essential, regardless of whether the data are acquired by mass spectrometry or Edman analysis or both. Whereas proteins that are present in a published database can be identified on a routine basis by scientists who are not experts in interpretation of mass spectra, the same cannot be said for proteins for which sequences are not included in any database. The results of this study provide excellent justification for core laboratories to have not only state-of-the-art instrumentation but also personnel with expertise in instrument operation and data analysis.


  1. The average success rate in this study was relatively low, indicating that in 2005, most core laboratories did not have the capability to perform de novo sequencing. (Note that this study addressed issues that are very different from identifying a protein that is in a database.)
  2. MALDI ionization and TOF/TOF mass analyzers appeared to be more successful than the alternatives, but too few laboratories participated in this study to reach any firm conclusions.
  3. No individual sample preparation or derivatization strategy was notably more successful than others.
  4. Laboratories that used more than one type of instrument were slightly more successful than those that only used a single type of instrument.
  5. Software available in 2005 for de novo sequencing was not sufficient on its own for successful sequence analysis of the test peptides.
  6. Expertise in MS and MS/MS data acquisition and manual interpretation was essential for success.


We thank David S. King of the HHMI Mass Spectrometry Laboratory at the University of California, Berkeley, for synthesis and purification of peptides A1, A2, and A3; Joe Leykam at the Macro-molecular Structure Facility at Michigan State University for synthesizing peptide J1 and for the amino acid analyses; Ron Beavis and Janet Brostowin at the NYU Protein Chemistry Laboratory for the synthesis of peptide T50; Vivek Shetty, Chongfeng Xu, and Yun Lu of the NYU Protein Analysis Facility for mass spectrometry analysis of the samples; Dawn Maynard of the NIMH at the National Institutes of Health for mailing and receiving correspondence and for ensuring that the participants remained anonymous; and Debra Diana of the NYU Skirball Institute of Biomolecular Medicine for receiving confirmatory data.


1. Arnott D, Gawinowicz MA, Grant RA, Neubert TA, Packman LC, Speicher KD. ABRF-PRG03: Phosphorylation site determination. J Biomol Tech. 2003;14:205–215. [PMC free article] [PubMed]
2. Arnott D, Gawinowicz MA, Kowalak JA, Lane WS, Speicher KS, Turck CW, et al. ABRF-PRG04: Differentiation of protein isoforms. J Biomol Tech. 2007;18:124–134. [PMC free article] [PubMed]

Articles from Journal of Biomolecular Techniques : JBT are provided here courtesy of The Association of Biomolecular Resource Facilities