Samples and data
We obtained fresh frozen endometrial cancer tissue specimens from 91 stage I endometrial patients and seven age-matched normal endometrial samples from post-menopausal women. Proteomic spectral count data analyzed was the sum of four LC-MS/MS analyses from two laboratories. The combined analyses yielded 11,879 distinct protein UniProt ACCs across all samples and both instruments. Transcript analysis using the Affymetrix U133 Plus 2.0 chip provided gene expression data. Details are in the Methods section.
We restricted the focus to bioinformatics resources providing direct mappings between UniProt ACCs and Affymetrix probeset IDs. Identifier mapping systems examined included the Affymetrix NetAffx Analysis Center[16
]; ENFIN's EnVision and Ensembl resources[17
]; and the DAVID resource (Database for Annotation, Visualization, and Integrated Discovery by NIAID[18
]. Details of the annotation systems and our methods for accessing them are provided in the Methods section at the end of this paper. Submitting the list of 11,879 UniProt ACCs to each resource provided putative matches to probesets on the Affymetrix U133 Plus 2.0 chip. We refer to the mapping retrievals obtained interactively from NetAffx as NetAffx_Q
. We label results from processing downloaded data files as NetAffx_F or Aff_F
. We refer to results from the EnVision query web services as EnVision_Q
. We refer to results obtained programmatically from the EnSembl GUI as EnSembl_F
. The labels DAVID_Q or D_Q
will refer to results obtained from the DAVID web service application programming interface (API). In addition we obtained DAVID Knowledgebase files by request from the NIAID. We label probeset match retrievals from DAVID Knowledgebase files as DAVID_F or D_F
. In general the suffix "Q" refers to methods using a direct query method, whether programmatic or interactive, while "F" refers to methods using downloaded files.
Our results concentrate primarily on the "Q" query-based retrievals; similarities and contrasts with the "F" file-based retrievals are noted as appropriate. To evaluate performance changes over time we also present earlier results, obtained variously in 2008 and 2009, labeling the earlier results as above but with the suffix "_8".
Distribution of the number of probesets retrieved by each resource
Beginning with the UniProt ACCs returned by Sequest[20
] for our proteomic tandem mass spectrometry experiment, we characterized each annotation resource individually by the distribution of the number of corresponding probeset IDs returned for each ACC. Table shows the percentage of ACCs with 0, 1, 2, 3 or more probesets found.
Distribution of the number of probesets retrieved for each UniProt ACC, by bioinformatics resource.
The proportion of ACCs returning at least one corresponding probeset ID was 84.7% for DAVID_Q, 73.6% for EnVision_Q, and 72% for Affy_Q. EnVision_Q returned the fewest probesets, and was most likely to return only a single match (38.2% versus 27.1% for DAVID_Q and 31.2% for Affy_Q).
Figure shows the behavior of the large-cardinality retrieved match sets. EnVision_Q delivers fewer matches overall, and also a lower proportion of intermediate-sized and large sets. NetAffx_Q delivers a few large sets. For example, for one Uniprot ACC, A2NYU9, NetAffx_Q returned 40 probe sets while DAVID_Q and EnVision_Q returned only one probe set. The reasons for these discrepancies may be diverse. From NetAffx_Q there were 81 large (40 or larger) match sets, associated with ten immunoglubulin heavy chain genes and one gene coding for a zinc finger protein.
Distribution of the number of probesets retrieved for each UniProt ACC. Complementary empirical cumulative distribution functions, plotted on a vertical log scale to emphasize differences.
Pairwise comparisons of identifier mapping resources
We compared each pair of annotation resources by constructing for each UniProt ACC the intersection and set differences between the two probeset lists mapping to that ACC. The results across all ACCs were grouped according to whether there were no matches in either resource, no matches in one resource with one or more matches in the other, two identical non-empty match sets, one match set but not the other reporting extra matches (containing the other match set), or extra matches reported by each resource. The fountain plot of Figure compares NetAffx_Q and DAVID_Q in this way. The ACC counts and proportions appear at the left of the figure. The figure is constructed by stacking 11,879 horizontal lines; each horizontal line is one ACC. It therefore shows the classification and probeset retrieval size, for each of the 11,879 UniProt ACCs, by category. Within category the results are sorted vertically by the size of the intersection, followed by the sum of the two retrieval set sizes (length of each horizontal line).
Figure 2 Fountain plot comparing retrievals from NetAffx_Q and DAVID_Q. Horizontal axis: For each UniProt ACC, the number of probesets retrieved from NetAffx _Q (left of zero) and DAVID _Q (right of zero). Vertical axis: Each horizontal slice is one ACC (11,879 (more ...)
The extent of the disagreements among resources is not insignificant. NetAffx_Q and DAVID_Q returned identical non-empty answers for 52.5% of the ACCs (For 10.8%, no probesets were found by either resource.). Figures and show less agreement between EnVision_Q and either NetAffx_Q or DAVID_Q; less than half of the ACCs. returned identical answers. Furthermore, each application processed a substantial number of ACCs by returning 1, 2, and occasionally many more probesets when the other two resources produced no matches. This result is exemplified by the horizontal extent of the red and blue portions of the lines in Figures , , and .
Fountain plot comparing web interfaces: NetAffx_Q vs EnVision_Q.
Fountain plot comparing web interfaces: DAVID_Q vs EnVision_Q.
In the past two years, the proportion of ACCs matched to identical non-empty lists has increased somewhat for NetAffx and DAVID (from 39% to 52%) and for NetAffx and EnVision (from 31% to 45%), but decreased slightly for DAVID and EnVision (from 39% to 36.5%). For NetAffx, the number of ACCs with at least one match not previously present is 1899, and the number of ACCs previously present but no longer matching is 504; the proportion of ACCs with identical non-empty NetAffx lists between 2008 and 2010 is 46% (5529/11879).
Contrasts between online query and file download services were of interest. Comparing DAVID_Q with DAVID_F (Figure ), there are respectively 2.6% and 4.6% excess matches, as well as 5% and 9.9% additional (in excess of up to 19) matches. In contrast, NetAffx Q and F yield exactly the same match sets; however, this perfect agreement may be a quirk of timing; previously we have found them to differ. EnV_Q and EnS_F differed by only two UniProt ACCs.
Fountain plot comparing DAVID file-based to query-based ID mappings.
From personal correspondence with the NetAffx and DAVID resource teams we learned that the tradeoffs between the speed and timeliness of access vs. reliability differ considerably between the resources. From the developers of DAVID, it follows that the files available on request are more accurate and more current than the DAVID web query service. On the other hand, obtaining the DAVID file data at the moment cannot be fully automated, since it requires sending a request to the DAVID team and a wait for the provision of a temporary download URL. Choosing the online query to the DAVID database over the file download would be preferable when the wait is not acceptable. In contrast, access to the Affymetrix files is instant. However, from discussions with an Affymetrix representative, we learned that the results of a NetAffx query gradually evolve with limited curation between releases of the Affymetrix annotation file, which is fully manually curated and released roughly quarterly. Software tools provided by Affymetrix generally use the annotation files, not live web queries. Due to the timing of our recent accesses, the two recent retrievals were identical.
Mapping all probesets to ACCs
So far our results utilize ACCs provided by Sequest database in analysis of a particular set of 98 samples, as inputs into mapping resources, representing an archetypal use case for ID mapping. Reversing the direction of the mapping, one can also utilize all of the probeset IDs on a microarray as inputs, to characterize all ACCs that would be mapped, independent of any particular proteomic experiment. With the U133 Plus 2.0 array, the total numbers of ACCs retrieved are seen in Table . DAVID_F returns ACCs for the human ALU probeset affx-hum_alu_at (5165). The other occasions of very high counts are probesets that map to MHC genes.
UniProt ACCs retrieved by mapping all probesets on U133 Plus 2.0
The large discrepancy for DAVID_Q appears related to its higher conformity to SwissProt; 74% of the ACCs returned by DAVID_Q are in SwissProt, versus 23.5% for Aff_Q and 25.4% for EnVision_Q. In comparison, of the 11879 ACCs originally returned by Sequest for the MS/MS experiment, 80.0% are in SwissProt. Among those, the subsets mapping to at least one probeset match by DAVID_Q, Aff_Q and EnV_Q are all primarily in SwissProt (89.7%, 86.3%, and 92.% respectively). Thus, the three resources are much more similar on a "real-world" set of ACCs from an experiment than one would expect from the comprehensive probeset-to-ACC maps of Tables and .
Total numbers of ACCs returned for the U133 Plus 2.0 array.
For each service, the collection of probesets with at least one mapped ACC.
Annexin 2: Example of variation of transcriptome-proteome correlations for individual proteins
To study mappings for individual proteins, we utilized the MS/MS and U133 Plus 2.0 microarray data sets described above. For each match of an ACC to a probeset ID, we merged the corresponding subsets of MS/MS and microarray data by subject ID.
We consider one protein that appears to be elevated in abundance in endometrial cancer relative to normal tissue, annexin 2 (UniProt ACC = P07355). Retrievals are shown in Table .
Probeset retrievals for annexin A2, and Spearman correlations with annexin A2 spectral counts.
Figures and show merged data scatterplots for the two probesets with the best and the worst correlation. (The other probesets are strongly correlated with the ANXA2 spectral counts, and with 213503_x_at. One match, 211241_at, is new; it was not a match in DAVID_Q_8, EnV_8 or NetAffx_8. It has moderate correlations with the other probesets except 1568126_at. ) The presence of strong correlations between protein spectral counts and most of the probesets reinforces confidence in the correct identification of the protein, and in the validity of the cancer-associated differential expression.
Scatterplot, 213503_x_at transcript signals versus Annexin 2 spectral counts, E = endometrioid cancer, S = serous cancer, N = normal.
Scatterplot, 1568126 _at transcript signals versus Annexin 2 spectral counts, E = endometrioid cancer, S = serous cancer, N = normal.
The presence of one poor correlation does not lessen that confidence. This example highlights the fact that a poor correlation does not necessarily mean that the mRNA and protein levels are truly biologically decoupled. The cause of the poor correlation may be an incorrect mapping or a probe-specific assay anomaly. In the case of 1568126_at, further investigation yields an explanation: the NetAffx annotation grade for 1568126_at is "E", indicating mapping only to an expressed sequence tag. Note that the probeset ID "_x_at" quality tag, which indicates caution because some probes hit transcripts from different genes, does not provide guidance; in fact it corresponds to the best correlations. (Affymetrix documentation affirms that the ID assignment at the time of array design is necessarily permanent and reflects the limited knowledge at that time). Analysis of the sequences of the probeset target and individual probes is warranted, and under way, but beyond the scope of this study.
Evaluation of mapping correctness by correlation analysis
To assess the quality of the identifier matches, we performed merges as described in the previous section for every UniProt-probeset pair obtained from one of the mapping strategies. Each corresponding pair of protein spectral counts with microarray expression signals yielded a Spearman correlation. The rationale for examining the entire collection of correlations is as follows. High correlations would be likely, though not guaranteed, to indicate a correct ID match. Negative correlations or correlations plausibly generated by chance might indicate any of several possibilities: (a) the ID match could be incorrect; (b) any of several biological phenomena could cause message expression to fail to manifest proportionately in protein abundance; and/or (c) measurement error variance and bias could mask a true biological correlation. With these limitations in mind, the collection of correlations was used to evaluate the performance of each system to generate correct matches. This analysis includes only the 480 ACCs with at least an average of 0.5 MS/MS spectral events per sample.
Each ACC-probeset match is classified according to the set of annotation resources which returned the match. Figure shows the distributions of these correlations, grouped by this classification. (The seven groups are mutually exclusive.) From the distributions seen in this figure, one confirms the widely reported fact that protein expression and mRNA expression often do not correlate strongly. However, there are differences among the 7 match groups. The nonparametric smooth density estimates of Figure motivate the mixture characterization of the next section. The mixture model will accentuate the meaningful inter-group differences, which are between large positive correlations and all other correlations.
Figure 8 Correlation distributions by match group. Correlations between log(mRNA) levels and spectral counts. Box extends to the first and third quartile, with thick horizontal line at the median. The group "all" constitutes matches that all three resources returned (more ...)
Figure 9 Estimated correlation distributions, nonparametric. Nonparametric smooth density estimates for selected ID pair subsets; for example, "DQ" labels the density estimate for the union of these disjoint groups from Figure 8: "D_Q only", "Affx &D_Q", (more ...)
Evaluation of mapping correctness by mixture modeling
The totality of observed Spearman correlations for all ID pairings were fitted to a two-component mixture distribution, where one density component was centered very near zero and the other had a positive mean. The posterior probability of membership in the second component was used as the target variable in regression analyses for evaluating the ability of each system to identify possibly correct matches. Using this posterior probability rather than the correlation itself focuses the effort of prediction on the part of the correlation distribution of interest.
One component, centered at 0.032 with standard deviation 0.124, has weight 66%. The other component, centered at 0.260 with standard deviation 0.189, has weight 34%.
Figure shows that the mixture model fits remarkably well. We do not claim that the correlations in reality come from a mixture distribution, though that is possible. Even if correct, membership in the first component may represent an incorrect match or a true biological disconnect between mRNA and protein abundance, and membership in the second component may or may not represent correct matches, since chance can generate extreme values. Nevertheless, the mixture model is extremely useful in this setting since the probability of membership in the second group, compared to using the correlation itself, is more sensitive to large correlations and relatively insensitive to differences between correlations that are not among the larger values. Therefore it makes a more useful dependent variable for the regression analyses to follow. The box plots of Figure are similar to the box plots of Figure , but displaying the second component posterior probabilities rather than the correlations. The differences between match groups are considerably enhanced.
Figure 10 Estimated correlation distributions, mixture model. The correlation distribution smoothed, as a mixture. The black line is the estimated mixture distribution; the red and green are the estimated mixture components. For comparison, the orange line is an (more ...)
Figure 11 Distributions of the "large correlation component" probability, for each match group. Transformation of Figure 8, replacing the vertical correlation axis by the estimated probability of belonging to the second ("large correlation") component of the mixture (more ...)
Linear regression analyses provided evaluations of the ability of each matching application to predict a high correlation. A linear regression models was fitted relating the presence or absence of a match in each mapping system to the component #2 probabilities.
The coefficient estimates from this model were: 0.119 for EnVision (P < 4 × 10-10), 0.039 for DAVID (P = 0.13), and 0.038 for NetAffx (P = 0.08). So, for example, if a match is returned by EnVision, the second component probability increased by 12.6% (= exp(0.119)-1). (Addition of total protein identification spectral count to the model did not affect the results. Here the weights were from the bootstrap analysis described in Methods. Similar results were returned when the normal theory weights were used.)
The results of this analysis suggest that a match in EnVision is more predictive of a good positive proteomic-genomic correlation, compared to matches provided by NetAffx or DAVID. However, this suggestion did not receive corroboration in head-to-head comparisons (example: pairs returned by EnVision but not NetAffx, compared to those returned by NetAffx but not by EnVision). Comparing pairs of disjoint groups from Figure , the one clear comparison, supported by large sample sizes, shows that a pair returned by DAVID and NetAffx is more likely to belong to the high-correlation cluster if it is also returned by EnVision (mean probabilities: 0.407 versus 0.290, P < 2 × 10-7). (We have used Spearman correlations throughout. Pearson correlations yield a somewhat higher second component probability, 48% instead of 34%, but shifted to lower correlation values within that component; overall the conclusions of the regressions are very similar.)
Filtering the ACCs further by restricting to SwissProt changed these results little; in fact this dropped only 3% (5 out of 480) ACCs, and 3% (43 out of 1573) of the ACC-probeset pairs. This reflects the fact that, over all ACCs, the association between SwissProt status and total spectral count is strong (Figure ). SwissProt status is also associated with stronger correlations (Figure ).
QQ plot of total spectral counts by protein.
QQ plot for Pearson correlations between spectral counts and mRNA signal,; restricting to 1573 pairs discussed above and shown in Figures 8, 9, 10, and 11.
Changes in the bioinformatics resources over time
In the past two years, there have been substantial changes in most of the services. The following table (table ) shows the numbers of probeset mappings gained, lost, and maintained.
Changes in probeset maps from proteomic experiment to U133 Plus 2.0.
A variety of analyses comparing added pairs to dropped pairs, or kept pairs to dropped pairs, revealed no evidence for NetAffx or for EnVision that the frequency of high correlations is changing. For DAVID_F, the ID pairs kept had significantly better correlations than those dropped (P = 2 × 10-6); similarly for DAVID_Q (P = 0.0001). However, the 267 pairs added recently in DAVID_Q were not superior to those dropped.