|Home | About | Journals | Submit | Contact Us | Français|
Pioneering studies in environmental proteomics have revealed links between protein diversity and ecological function in simple ecological communities such as microbial biofilms. In the near future, high throughput proteomic methods will be applied to more complex ecological systems in which microbes and macrobes interact. Data structures in biodiversity and protein surveys have many similarities, so the statistical methods that ecologists use for analyzing biodiversity data should be adapted for use with quantitative surveys of protein diversity. However, increasing quantities of protein and bioinformatics data will not, by themselves, reveal the functional significance of proteins. Instead, ecologists should be measuring changes in the abundance of protein cohorts in response to replicated field manipulations, including nutrient enrichment and removal of top predators.
High throughput methods for studying and characterizing large numbers of proteins of uncultured biological samples (metaproteomics) now allow for cataloguing the proteins of component species of assemblages (community proteomics) or of parts of ecosystems, without precise knowledge of the organism that produced the protein (environmental proteomics) [1,2]. Pioneering studies have surveyed relatively simple, microbial-dominated assemblages (reviewed in [3,4]), sampled from habitats such as acid-mine drainages , microfilm surfaces of leaves , seawater  and soil  (Box 1).
Acid-rock and acid-mine drainages (AMDs) can be found wherever water moves through pyrite (iron sulfide: FeS2); examples include mineral-rich deposits in caves and piles of tailings from locations where sulfide minerals (including uranium and coal) are mined. AMDs have very low pH (often reaching 0.5–1.0), but can host highly productive ecosystems consisting of diverse bacteria, Archaea, and eukaryotes. Many of these taxa are chemoautotrophs that have evolved adaptations to heavy-metal environments . The chemical isolation of these communities from external sources of carbon and nitrogen, and their physical segregation from higher pH assemblages in specific drainages makes them an ideal self-contained, microecosystem for assessing proteomic diversity and function in a microbial system.
Banfield et al.  sampled the AMD biofilm community at six sites along an environmental gradient (defined by distance from ore deposits and seepage flow) at the Richmond Iron Mountain Mine Complex in northern California. Diversity of microbes from water samples and biofilm scraped from rock surfaces was determined using a combination of in situ hybridization with fluorescent probes screened with epifluorescence and confocal microscopy, followed by direct DNA sequencing of entire genomes. The sequence data were then used to identify proteins from particular dominant organisms. Unknown sequences were screened against proteomic databases to identify common proteins. Functions of novel proteins were characterized further with biological observation, mass-spectroscopy, biochemical assays (including microtiter plates that can assay ~100,000 samples at a time for multiple biochemical functions) and protein-structural modeling.
These methods and results should be of great interest to all community and ecosystem ecologists, because patterns of differential protein production and expression reflect physiological responses to changing or stressful environmental conditions , including climatic change  and the presence of predators . In addition, shifts in protein abundance and composition — best characterized through analyses of microbial processes  — can indicate changes in magnitude or rates of material and energetic fluxes within and between ecosystems .
Although pioneering studies of microbes suggest directions for future work in environmental proteomics, it is unclear how easily results from microbial assemblages can be scaled to more complex ecological systems. The structural complexity of microbial systems appears to be simplified relative to that of most terrestrial and aquatic ecosystems. Biofilm and microbial systems generally lack the photosynthetic primary producers, detritivores, and higher trophic levels that characterize typical “green” and “brown” food webs  of interacting multicellular (“macrobial”) eukaryotes and prokaryotes . Technical challenges—including the identification of proteins from species without sequenced genomes, variation in the physical and biological conformation of proteins, in situ activity of isolated proteins and the difficulty of reliably extracting proteins from complex media such as seawater and soil  — also hinder the application of proteomic (and genomic) approaches to macrobial assemblages and food webs . Nevertheless, as these challenges are overcome with new technologies and bioinformatics tools, proteomic surveys of a range of ecological communities and ecosystems will become an important complement to parallel metagenomic, metatranscriptomic and metametabolomic surveys . All of these methods generate complementary data, although from our perspective, proteins are the most desirable unit of study because they are most closely related to the functioning of ecosystems and because they are a more direct measure of the “molecular phenotype” .
Perhaps the most important practical result of ongoing technological advances is that ecologists will be able to survey proteomes efficiently and repeatedly and describe their variability in natural ecosystems, rather than ignoring variation and classifying species, communities and their proteomes using fixed typologies. For example, Figure 1 shows three proteomic “fingerprints” of samples from three replicate food webs inhabiting water-filled leaves of the pitcher plant Sarracenia purpurea. Each pitcher supports an independent, intact food web consisting of a resource base of captured insect prey (mostly ants and flies), a sub-web of bacteria, protozoa, algae, and fungi and a suite of aquatic invertebrate larvae (flies, midges and mosquitoes) that interact as filter feeders, detritivores, omnivores, and top predators . The obvious variability in protein profiles among replicate food webs in the microbial component of each web (Figure 1) likely reflects differences in species composition and abundance of invertebrates and microbes within each leaf, along with differences in the quantity and composition of captured insect prey. At the same time, consistent protein bands suggest repeatable patterns and protein structure among replicate food webs. In an associated community proteomics survey , species-specific proteins and peptide signatures of the three most abundant macroinvertebrates were revealed, a first step towards proteomic profiling of an entire microbe-macrobe ecosystem.
In just two decades, cost-effective methods have been developed to increase throughput and comprehensive identification of proteins [20,21]; accurately quantify protein relative abundance [22,23]; enrich rare protein types and protein modifications [24, 25]; and expand the capacity and precision of bioinformatic tools that enhance the functional interpretation of proteomic datasets [26,27]. Although some of these methods are still in their infancy, rapid technological and statistical advances daily are increasing the quantity and quality of information about proteins that can be sampled from functioning ecosystems. However, by itself, this flood of bioinformatics data is unlikely to reveal which particular proteins are important causes and consequences of community structure and ecosystem function.
An analogous disconnection between data density and functional inference plagued community ecology in the initial stages of documenting the diversity of species, their assembly, and their interactions within food webs. Pioneering naturalists and taxonomists —Rumphius, Linneaus, Darwin, Wallace, and von Humboldt, among many others —collected, catalogued and named species from around the world. Beginning in the late 1800s, ecologists assembled detailed records of food web structure —extensive lists of “who eats whom” — in plant and animal assemblages. Now, species diversity and food web structure both are encapsulated, respectively, in summary statistics and indices  and as network diagrams with nodes representing species and links representing trophic interactions including predation and parasitism .
But pattern identification and summary statistics are not the same as identifying causal processes, which requires experimental manipulations and statistical modeling of the results. In studies of biodiversity [30,31] and ecological food webs [32,33], there are long dialectical traditions of thesis, antithesis, and synthesis. The same dialectic could be applied profitably to proteomics research in ecology: characterizing and summarizing the diversity of proteins and their potential interactions, analyzing proteomic assemblages of experimentally manipulated food webs in the field and synthesizing the resulting data with statistical modeling.
Just as the first step to working with an ecosystem is identifying its constituent organisms, a first step towards understanding the functional significance of protein diversity of an ecosystem will be identifying and enumerating their distinct proteins and estimating their relative abundances. However, just as it is impossible to detect all of the species in a microbial or macrobial assemblage [34,35], so too it is unlikely that we will be able to detail an entire ecological proteome in the foreseeable future. Estimates of proteomic diversity in microbial ecosystems range from 104 to 109 expressed proteins . There are approximately 6,000–60,000 protein-encoding genes in a single prokaryote or eukaryote species‘ genome , and even a single tissue sampled from a single multicellular species contains hundreds or even thousands of distinct proteins . In ecological assemblages that contain both microbes and macrobes, proteomic diversity likely will exceed 1010 expressed proteins because— unlike the — genome the proteome of a multicellular organism is not constant during its lifetime, but changes ontogenetically  and in response to changing biotic and abiotic conditions .
For proteins, as for species, ecologists must remain satisfied with sampling only a small fraction of the potential diversity. The resulting data from surveys of both ecosystems and proteomes are lists of the component species (some of which might be taxonomically ambiguous) or proteins (some of which might be incompletely characterized because a single tryptic peptide might map to several, typically related, proteins) and an estimate of relative abundances (usually from counts of individual organisms  or counts of peptides from different proteins ). Statistical distributions of both species and proteins are also similar: there are usually a small number of relatively common species (or proteins), a large number of relatively rare species (or proteins) and an unknown number of species (or proteins) that are present, but are not detected in the sample. The observed distribution of relative abundances of species is roughly log-normal in shape , although the precise statistical characterization continues to be controversial . However, the observed distribution of the relative abundance of proteins has not yet been analyzed with the same statistical tools that ecologists have used to analyze the relative abundance of species.
Over the past several decades, ecologists have developed a unified toolbox of non-parametric statistical methods for analyzing such data. Rarefaction methods interpolate diversity estimates of small random subsamples of data, allowing investigators to compare multiple, standardized data sets on the basis of common sampling effort . Asymptotic estimators can extrapolate diversity data to estimate the minimum number of undetected species (or proteins) in a sample, based on the frequencies of the rare species that are present . The amount of sampling effort that would be needed to reveal these undetected elements also can be calculated . Finally, sample variances and confidence intervals characterize the uncertainty associated with such interpolations and extrapolations .
These statistical tools have not yet been adapted for protein studies; Koziol et al.  is an exemplary study, and the only published example we know of in which biodiversity sampling models were applied to proteomic data (Figure 2). Although it might seem premature to discuss statistical methods for data that have not yet been collected at the ecosystem and community level, we think it is critical to look forward and anticipate methods of analysis for proteomic data that will emerge in future studies. Such analyses might reveal, for example, whether the relationship between protein diversity and ecosystem function mirrors that of species diversity and ecosystem function.
Analysis of proteomic data differs from analysis of biodiversity samples in one crucial way, however. To characterize a proteome, the proteins are proteolytically digested into smaller peptides prior to analysis; this is analogous to taking the species sampled, shredding them into small pieces, and then re-assembling, identifying, and counting the reconstituted organisms! Considerable efforts in bioinformatics are devoted towards statistically reconstructing the original protein(s) from these peptide fragments. This added twist makes it difficult to apply traditional analyses of random samples of “individuals” to the estimation of relative abundances of different protein types. In particular, standard rarefaction of peptide frequency data will be biased: assuming a similar distribution in the size and ionization efficiency of peptides from each protein, the frequency of large proteins will be overestimated and the frequency of small proteins will be underestimated. However, using the number of amino acids in each identified protein, a simple modification to rarefaction is possible that will help to correct such biases (Figure 3).
Proteomics data, like biodiversity data, increasingly are available over the Internet from institutional data repositories such as the Proteomics Identifications database (PRIDE http://www.ebi.ac.uk/pride). Unlike ecological data, however, proteomic data come in only a few, highly standardized forms – mass-spectroscopy spectra and protein identification supported by peptide identifications – that are routinely (and mandatorily) submitted to data archives as part of the manuscript submission and publication process. Proteomic data repositories require detailed metadata about the origins, processing, and analysis of samples, and they have controlled vocabularies (ontologies) for annotating data and metadata that can be searched with predefined computational algorithms and structured work-flows . Thus, these data repositories provide an additional, as yet untapped, resource for exploring patterns of proteomic diversity.
In summary, there are remarkable similarities between the biodiversity data and environmental and community proteomics data, in the constraints and challenges of sampling, in the form of the data that result from such sampling, and in the availability of archived data and work-flows for analyzing them. With adjustments to account for bias in protein size, the statistical framework developed by ecologists to characterize and summarize biological diversity will be applicable to protein diversity data.
Summary metrics and statistical analyses are only the first step forward with proteomic data. We are more interested in how proteins interact in an assemblage of organisms or in a complete ecosystem, and expect that an understanding of these interactions will provide new insights into the workings of ecological processes. As noted earlier, the sheer number of proteins in even a single tissue suggests that trying to identify each and every function for each and every protein in an ecological system will be a long time coming, and might not be informative for ecologists working with typical food webs of macrobes and microbes. Mechanistic studies of food webs have grappled with the analogous question – do we need to know the identity and role of every species in a food web, or can we reduce the inherent complexity of food webs by grouping species into trophic categories (consumers, detritivores, predators), functional feeding groups (shredders, scrapers, filter-feeders), or taxonomic/ecological guilds (tube-building polychaetes, seed-eating finches) that still provide robust insights into mechanisms driving ecological dynamics? When these simplified groupings have been organized into abstracted webs that have few nodes and links , they have proven to be more amenable to mechanistic modeling . The predictions of these simplified models have been tested directly by adding or removing species at different trophic levels  manipulating nutrients or basal resources , or quantifying the responses of the rest of the assemblage to such perturbations .
A similar approach might profitably be used to understand the functional significance of proteins in ecological systems. This would be achieved most easily by grouping proteins (or their homologs) by class using gene ontology databases in which proteins are classified into functional categories based on similar biological processes, biochemical activities, and subcellular localization [53, 54]. Indeed, one of the key reasons that proteomics might yield new ecological insights is that many species —particularly of microbes —might be functionally equivalent because they produce similar proteins with similar functions in similar environments [55,56]. This kind of ecological redundancy is a key concept in the study of ecosystem function  that already allows ecologists to simplify the vast diversity that characterizes natural assemblages .
Thus, rather than trying to characterize all of the uncommon proteins in an assemblage, ecologists should first concentrate on the small number of identified proteins that change in abundance either from common to rare or rare to common in response to experimental manipulations. By adding or removing species and species groups, ecologists will discover which kinds of proteins are involved in the response. This molecular phenotyping  of entire assemblages should provide insight into how these proteins function, and repeated sampling might reveal temporal dynamics as well. Molecular phenotyping and recognizing proteins‘ functions do not require that functionally important proteins be common in natural assemblages. When proteins that are absent or rare in control assemblages become abundant in response to experimental perturbations, there is good reason to suspect they have an important function in the ecosystem.
Moreover, as recombinant DNA technology and protein expression systems become widely available to ecologists, entirely new avenues of ecological experiments will become possible, at least for secreted proteins. Rather than manipulating species or trophic groups, ecologists will be able to add synthesized proteins directly to experimental mesocosms. There is a long tradition of experimentally (or unintentionally) enriching food webs with nitrogen, phosphorous, and other critical nutrients that limit the growth of phytoplankton, algae, and terrestrial plants [59,60]. This idea can be extended to enriching ecosystems with potentially important proteins and metabolites, and then measuring food-web responses. Measuring the ecological proteome in response to species removals and additions, and measuring the biodiversity response to the addition of synthetic proteins should go lead ecologists to understand the functions of proteins in ecosystem organization.
Of course, these kinds of experiments will generate new statistical challenges. For example, if 500 proteins are surveyed in an ecological field experiment with a control and a predator removal treatment, a traditional frequentist analysis would potentially generate 500 t-tests and associated P-values. Even if there are no effects of the top predator on the ecological proteome, at least some of these comparisons will be statistically significant by chance alone . However, promising new analyses based on empirical Bayesian  and other approaches  have been effective in screening large data sets such as those generated by proteomic surveys , microarrays  and species occurrence matrices .
Pioneering studies of environmental proteomics have recently revealed the functional significance of proteins in simple biofilm communities. However, it is not clear how well the results can be generalized to more typical communities of interacting microbes and macrobes in structurally complex environments. Technological advances and improvements in bioinformatics and data archiving will allow for proteomic characterization of a variety of terrestrial and aquatic habitats. However, protein diversity is vast, and there will always be undetected rare proteins in biological material. Rather than trying to exhaustively sample this rare tail of protein diversity, ecologists should concentrate on measuring more common proteins in replicated samples, and quantifying the variability inherent in natural food webs.
If the relative abundances of different protein types from a biological sample can be estimated, the resulting data are remarkably similar in form to counts of individuals and species that already are familiar to ecologists who analyze patterns of biodiversity. For such analyses, a statistical tool box of rarefaction methods for interpolation, and asymptotic estimators for extrapolation can be applied to effectively summarize protein diversity data and will allow ecologists to standardize samples for meaningful comparisons.
However, as the history of food web studies has illustrated, statistical methods alone usually cannot reveal important functional roles. Therefore, ecologists should begin using proteomics in combination with traditional experimental field manipulations, such as nutrient enrichment, species additions and removals, and the modification of habitat complexity. In such experiments, proteins that change status from “common” to “rare” or “rare” to “common” are good candidates for functional importance to food web structure, although in some cases they might simply mirror changes in the relative abundance of constituent species. More innovative experiments would involve the addition to food webs of synthetic proteins first identified by proteomic analysis.
Although there are still serious methodological challenges to extracting proteins from complex media such as soil and seawater, proteomics will eventually become an important tool for community and ecosystem ecologists. The analysis of environmental proteomic data —in combination with statistical biodiversity methods and experimental field manipulations — has the potential to greatly increase our understanding of the role of proteins in the organization and function of food webs and ecosystems.
We thank Rachel Brooks for generating the data in Figure 1. N.J.G. was supported by the U.S. National Science Foundation (DEB-136703 and DEB-0541936) and the U.S. Department of Energy (DE-FG02-08ER64510), and a University of Vermont Research Award to N.J.G. and B.A.B. A.M.E was supported by grants from the U.S. National Science Foundation (DEB-136703, DEB-0541680, DEB 06-20443, and DBI-1003938), and the U.S. Department of Energy (DE-FG02-08ER64510). B.A.B. was also supported by the U.S. National Science Foundation (DEB-136703) and the Vermont Genetics Network through U.S. National Institutes of Health grant P20 RR16462 from the INBRE Program of the NCRR.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.