|Home | About | Journals | Submit | Contact Us | Français|
Asara et al. reported the detection of collagen peptides in a 68-million-year-old T. rex bone by shotgun proteomics. This finding has been called into question as a possible statistical artifact. We reanalyze Asara et al.'s tandem mass spectra using a different search engine and different statistical tools. Our reanalysis shows a sample containing common laboratory contaminants, soil bacteria, and bird-like hemoglobin and collagen.
Asara et al.  reported finding seven distinct collagen sequences by shotgun proteomics in a remarkably well-preserved 68-million-year-old fossilized dinosaur bone. They later rejected one of the sequences as statistically insignificant, corrected the placement of hydroxylations on two other sequences [2, 3], and computed a phylogenetic tree that (unsurprisingly) placed T. rex with birds . The discovery of intact protein in such an ancient sample has been called into question on the grounds of plausibility  and inadequate statistical analysis .
In a reanalysis of the data, Matt Fitzgibbon and Martin McIntosh (supplementary material) pointed out that five of the six T. rex sequences appear in ostrich collagen. (Ostrich collagen has not been sequenced, but Asara et al.  deduced partial sequence by mass spectrometry and a computational mutation search using chicken.) The close similarity with ostrich and the fact that the same proteomics laboratory processed both T. rex and ostrich raised a third issue: contamination from another proteomics sample. Moreover, Fitzgibbon and McIntosh found a spectrum matching a bird hemoglobin peptide with a carbamidomethylated cysteine. This bolstered the case for exogenous contamination, because Asara et al. did not originally describe alkylation as part of their sample preparation.
The actual laboratory procedures, however, were somewhat complex. As explained by John Asara (personal communications), the entire data set comprises seven chromatographic runs of T. rex bone and four chromatographic runs of sediment from around the bone, chipped away during excavation. The runs were made over the course of more than a year, and some of the T. rex injections were alkylated to produce carbamidomethylated cysteine and some were not. The confident collagen sequences appear in several different runs and none contain cysteine. The ostrich runs were performed more than a year before any of the T. rex runs, and more than 1000 other runs were performed in between.
Here we reanalyze the original mass spectra yet again using different bioinformatics tools and statistical tests. We find three distinct collagen peptides matched with E-value below 1.0, along with a number of less significant matches to collagen. Assuming statistical independence of distinct peptides, the identification of bird-like collagen at the protein level is clearly significant. We also confirm the statistical significance of the bird hemoglobin peptide reported by Fitzgibbon and McIntosh.
We obtained the entire T. rex data set as an MGF (Mascot generic format) file containing 31,367 distinct MS/MS (Thermo LTQ) spectra and 48,216 combinations of spectrum and precursor charge assignment; these spectra are publicly available. Because precursor charge assignments for ion-trap spectra can be unreliable, we ignored the charge assignments in the MGF file and considered each of the spectra with assignments of +1, +2, and +3. For our protein database, we used uniprot_sprot.fasta (downloaded 4 February 2008), containing 408,099 protein sequences.
We searched the spectra against the database using ByOnic  and compiled a protein list using the companion program ComByne . Previous searches used Sequest  and Mascot . ByOnic uses a “matched filter” or dot-product scorer, incorporating predicted and observed peak intensities and mass deviations between predicted and observed peaks. In previous studies [5, 6], we have found ByOnic to be more sensitive than Mascot and Sequest at the same false discovery rate. We initially performed a wide search, in which we searched all of uniprot_sprot.fasta for fully tryptic peptides, with 1500 ppm precursor mass tolerance and 0.4 Dalton fragment mass tolerance, with the following modifications enabled: carbamidomethylated cysteine (camC, a fixed modification), hydroxyproline (a common modification in collagen), oxidized methionine, and pyro-glu from N-terminal glutamine, glutamic acid, and camC. Not all of the seven chromatographic runs used camC, so we also searched the data assuming unmodified cysteine. We allowed any number of missed cleavages. We also searched a decoy database containing reversals of the protein sequences in uniprot_sprot.fasta. We then made a small database containing all the proteins (forward or reverse) with at least one match scoring at least 200, approximately equivalent to Mascot 20. We also added reversals of all the forward proteins in the small database, and in total the small database contained 4472 forward proteins and 7161 reversed proteins.
It may seem strange to have unequal numbers of forward and reversed proteins, but we did this to guard against bias. The assumption behind the forward/reversed database approach [9, 12] is that the false positives are equally divided between the forward and reversed proteins. If our small database had included only the forward proteins from the first search along with their reversals, we would have matched the numbers and lengths of forward and reversed proteins, but our forward proteins would have been statistically different from our reversed proteins, because the forward proteins had already found matches scoring at least 200. If our small database had included only those proteins, forward or reverse, with matches of at least 200 in the first search, our small database would have been unbiased at the protein level, even though the forward proteins would have outnumbered the reverse proteins. The small database, however, would have been biased at the peptide level, likely to contain more false forward peptides than false reversed peptides, because not every peptide in a true protein is true, that is, truly the peptide represented by a mass spectrum. By including both the first-search reversed proteins and reversals of the first-search forward proteins, we slightly bias against forward peptides; this is the conservative approach.
For our narrow search we searched the small database for fully tryptic peptides, with the same mass tolerances and modifications as above, along with deamidated asparagine and glutamine and at most one SNP mutation per peptide. We searched the small database one more time, using the same modifications along with a “wild-card modification” in order to ensure that nothing interesting was overlooked. By-Onic's wild-card modification allows any integer mass change to any one residue. A peptide can carry both known modifications and a wild card, and the total mass of modifications must lie within a user-settable range, in this case -50 to +80 Daltons.
Rather than relying only on ByOnic and ComByne's internal p-value computations, which were built with training data from a variety of instruments and samples, we also estimated statistical significance with an empirical E-value specific to this data set. For both the wide and narrow searches, we estimated the expected number of random identifications with a given ByOnic score by running a search against a database containing only decoy proteins (reversed sequences). This search used the same mass spectra and precursor charge assignments, and searched the decoy database using the same modifications and cleavage specificity.
After searching the spectra from the bone sample, we obtained from John Asara 7085 MS/MS (Thermo LTQ) spectra from a sample of the sediment surrounding the T. rex fossil bone. The sediment serves as a control to guard against the possibility that proteins from the environment could be mistaken for proteins from the bone fossil. We searched these spectra in the same way as our wide search, again using all protein sequences in uniprot_sprot.fasta along with reversed protein sequences.
ComByne's list of protein groups from the wide search on the bone sample, ranked by logarithm of the p-value (confidence), is shown in Table 1. This list is from the search assuming camC; without this assumption hemoglobin subunit alpha at rank 9 is lost. ComByne's log p-value reflects protein length along with the p-values for all the spectra matched to that protein, so according to ComByne's probabilistic model the matches to the top protein should appear by chance with probability 10−48.05. Assuming statistical independence of distinct collagen peptides, the matches to the top two collagens should appear by chance with probability 10−5.47−4.96 or about 10−10. The assumption of independence may be questionable because collagen is highly self-similar, but even similar sequences generally have quite different mass spectra, as a single substitution shifts roughly half of the ion peaks. The best log p-value achieved by any of the 408,099 decoy proteins was −4.60 and only two other decoys had log p-value below −2.89, so we would expect about three false positives among the 40 proteins in Table 1, for a false discovery rate of about 7.5%. ComByne's p-values are generally more conservative than empirical significance testing: we would expect about 400 out of 400,000 random proteins to have p-value below 0.001 and log p-value below −3.0. The multiorganism protein database, however, is quite redundant, so its effective size is much lower than 400,000.
Table 2 gives the peptide identifications for all the collagen and hemoglobin matches from either the wide or narrow search with ByOnic scores above 250, which is roughly equivalent to a Mascot score of 25. A match with ByOnic score of 300 has an empirical E-value about 100 in both the narrow and wide searches (Figure 1), so it could be discounted as statistically insignificant, yet the probability of any given spectrum hitting a collagen protein by chance within either database, large or small, is less than 0.01, which if factored into the calculation, would bring the E-value of such a match to below 1. (Again there is an assumption of independence: the protein identity and the ByOnic score are statistically independent.) Indeed reversed collagen and hemoglobin sequences received no matches with scores above 250. The number of spectra drops roughly linearly on the logarithmic scale of Figure 1; others have observed that the right-hand tail of the score distribution for most peptide scorers is well modeled by an exponential distribution [6, 10].
Asara et al. deposited six T. rex collagen sequences in GenBank: GATGAPGIAGAPGFPGARGAPG-PQGPSGAPGPK, GSAGPPGATGFPGAAGR, GVQGPPGPQGPR, and GVVGLPGQR from collagen alpha-I type I, GLVGAPGLRGLPGK from collagen alpha-1 type II, and GLPGESGAVGPAGPIGSR from collagen alpha-2 type I. Of these, we confirm the first three sequences with E-values below 1.0. We find the fourth sequence with E-value on the order of 10.0, and can argue for its correctness based on the unlikelihood of hitting collagen at random as above. We do not find the last two sequences and suggest that they be dropped from GenBank.
Several of the sequences in Table 2, for example, GLAGPQGPR, PGPQGPSGAP[+16]GPK, GAPG-PQGP[+16]AGAPGP[+16]K, GVVGLP[+16]GQR, and P[+16]GC[+57]P[+16]GPMGEK, do not appear in the published partial ostrich sequence , but the published sequence has only about 30% coverage of collagen alpha-1, so we cannot rule out ostrich. The peptide P[+16]GC[+57]P[+16]GPMGEK is from mammalian collagen alpha-4, which is found in extracellular matrix but not bone. This hit (score 345) could be a false match, or it could be a collagen alpha-1 or alpha-2 peptide from an unsequenced organism, as all the collagens are similar at the sequence level.
As reported by Asara et al., the proteins in Table 1 are mainly known contaminants in biochemistry laboratories (human keratin, bovine serum albumin, etc.), proteins from soil bacteria (Acidovorax, Verminephrobacter, Polaromonas, Zymomonas, etc.) and other organisms plausibly found in soil, such as Schizosaccharomyces (a yeast), Physcomitrella (a moss), and Neospora caninum (a parasite infecting dogs and cattle). Arachis hypogaea (peanut) allergen appears out of place, but the matches to peanut may also match some protein in Hevea brasiliensis, the source of the latex in laboratory gloves and of protein 33 in Table 1, as cross-reactivity between natural latex and peanut has been reported . Also as reported by Asara et al., there are very few vertebrate proteins in Table 1, so the matches to bird-like collagen and bird hemoglobin stand out. Tubulin is a vertebrate protein, but the tubulin peptides found in the sample are also found in many invertebrates. The only high-scoring match to actin, the peptide AGFAGDDAPR, matches 273 actins in uniprot_sprot.fasta, indeed almost all sequenced eukaryotes.
The search of the spectra from the sediment sample (Table 3) yielded many fewer proteins, with keratins and trypsin the only statistically significant finds in this relatively small data set. There were four low-scoring, statistically insignificant, hits to interesting proteins: two to vertebrate collagens, one to bird mimecan, and one to bird hemoglobin subunit beta (Q[-17]LISGLWGK from ostrich hemoglobin subunit beta with score 247). Because collagen and hemoglobin are found in the surrounding sediment in such trace quantities, if they are found at all, we do not think these hits indicate a source of collagen and hemoglobin other than the fossil bone.
In summary, we find nothing obviously wrong with the Tyrannosaurus rex mass spectra: the identified peptides seem consistent with a sample containing old, quite possibly very ancient, bird-like bone, contaminated with only fairly explicable proteins. Hemoglobin and collagen are plausible proteins to find in fossil bone, because they are two of the most abundant proteins in bone and bone marrow. Schweitzer et al.  previously reported multiple lines of evidence, including immunological reactions, for hemoglobin-derived compounds in T. rex bone, and collagen from younger fossil bones is well known . Contamination remains a tricky and possibly unresolvable issue for this particular sample. Perhaps a bird died on top of the T. rex excavation in the field; perhaps ostrich bone lingered in the mass spectrometry facility for a year; or perhaps avian collagen from a cosmetic or medical product found its way into the T. rex sample. Complete sequencing of ostrich collagen would help dispel one contamination scenario. In just-published work on an 80-million-year-old hadrosaur fossil, Schweitzer et al.  took extra precautions against contamination, including excavation with sterilized tools and analysis of the fossil extracts by more than one mass spectrometry laboratory. So far this new study has met with much less skepticism. It is fair to say that the scientific community is still working out standards for sample handling and data analysis of fossil protein.
Marshall Bern was supported in part by NIH grant GM085718. We would like to thank Pavel Pevzner for asking for the release of this intriguing data set, and John Asara for making the data available to the scientific community.