Search tips
Search criteria

Results 1-25 (25)

Clipboard (0)

Select a Filter Below

more »
Year of Publication
1.  MOPED 2.5—An Integrated Multi-Omics Resource: Multi-Omics Profiling Expression Database Now Includes Transcriptomics Data 
Multi-omics data-driven scientific discovery crucially rests on high-throughput technologies and data sharing. Currently, data are scattered across single omics repositories, stored in varying raw and processed formats, and are often accompanied by limited or no metadata. The Multi-Omics Profiling Expression Database (MOPED, version 2.5 is a freely accessible multi-omics expression database. Continual improvement and expansion of MOPED is driven by feedback from the Life Sciences Community. In order to meet the emergent need for an integrated multi-omics data resource, MOPED 2.5 now includes gene relative expression data in addition to protein absolute and relative expression data from over 250 large-scale experiments. To facilitate accurate integration of experiments and increase reproducibility, MOPED provides extensive metadata through the Data-Enabled Life Sciences Alliance (DELSA Global, metadata checklist. MOPED 2.5 has greatly increased the number of proteomics absolute and relative expression records to over 500,000, in addition to adding more than four million transcriptomics relative expression records. MOPED has an intuitive user interface with tabs for querying different types of omics expression data and new tools for data visualization. Summary information including expression data, pathway mappings, and direct connection between proteins and genes can be viewed on Protein and Gene Details pages. These connections in MOPED provide a context for multi-omics expression data exploration. Researchers are encouraged to submit omics data which will be consistently processed into expression summaries. MOPED as a multi-omics data resource is a pivotal public database, interdisciplinary knowledge resource, and platform for multi-omics understanding.
PMCID: PMC4048574  PMID: 24910945
2.  MOPED enables discoveries through consistently processed proteomics data 
Journal of proteome research  2013;13(1):107-113.
The Model Organism Protein Expression Database (MOPED,, is an expanding proteomics resource to enable biological and biomedical discoveries. MOPED aggregates simple, standardized and consistently processed summaries of protein expression and metadata from proteomics (mass spectrometry) experiments from human and model organisms (mouse, worm and yeast). The latest version of MOPED adds new estimates of protein abundance and concentration, as well as relative (differential) expression data. MOPED provides a new updated query interface that allows users to explore information by organism, tissue, localization, condition, experiment, or keyword. MOPED supports the Human Proteome Project’s efforts to generate chromosome and diseases specific proteomes by providing links from proteins to chromosome and disease information, as well as many complementary resources. MOPED supports a new omics metadata checklist in order to harmonize data integration, analysis and use. MOPED’s development is driven by the user community, which spans 90 countries guiding future development that will transform MOPED into a multi-omics resource. MOPED encourages users to submit data in a simple format. They can use the metadata a checklist generate a data publication for this submission. As a result, MOPED will provide even greater insights into complex biological processes and systems and enable deeper and more comprehensive biological and biomedical discoveries.
PMCID: PMC4039175  PMID: 24350770
3.  Toward More Transparent and Reproducible Omics Studies Through a Common Metadata Checklist and Data Publications 
Biological processes are fundamentally driven by complex interactions between biomolecules. Integrated high-throughput omics studies enable multifaceted views of cells, organisms, or their communities. With the advent of new post-genomics technologies, omics studies are becoming increasingly prevalent; yet the full impact of these studies can only be realized through data harmonization, sharing, meta-analysis, and integrated research. These essential steps require consistent generation, capture, and distribution of metadata. To ensure transparency, facilitate data harmonization, and maximize reproducibility and usability of life sciences studies, we propose a simple common omics metadata checklist. The proposed checklist is built on the rich ontologies and standards already in use by the life sciences community. The checklist will serve as a common denominator to guide experimental design, capture important parameters, and be used as a standard format for stand-alone data publications. The omics metadata checklist and data publications will create efficient linkages between omics data and knowledge-based life sciences innovation and, importantly, allow for appropriate attribution to data generators and infrastructure science builders in the post-genomics era. We ask that the life sciences community test the proposed omics metadata checklist and data publications and provide feedback for their use and improvement.
PMCID: PMC3903324  PMID: 24456465
4.  Beyond protein expression, MOPED goes multi-omics 
Nucleic Acids Research  2014;43(Database issue):D1145-D1151.
MOPED (Multi-Omics Profiling Expression Database; has transitioned from solely a protein expression database to a multi-omics resource for human and model organisms. Through a web-based interface, MOPED presents consistently processed data for gene, protein and pathway expression. To improve data quality, consistency and use, MOPED includes metadata detailing experimental design and analysis methods. The multi-omics data are integrated through direct links between genes and proteins and further connected to pathways and experiments. MOPED now contains over 5 million records, information for approximately 75 000 genes and 50 000 proteins from four organisms (human, mouse, worm, yeast). These records correspond to 670 unique combinations of experiment, condition, localization and tissue. MOPED includes the following new features: pathway expression, Pathway Details pages, experimental metadata checklists, experiment summary statistics and more advanced searching tools. Advanced searching enables querying for genes, proteins, experiments, pathways and keywords of interest. The system is enhanced with visualizations for comparing across different data types. In the future MOPED will expand the number of organisms, increase integration with pathways and provide connections to disease.
PMCID: PMC4383969  PMID: 25404128
5.  Optimizing high performance computing workflow for protein functional annotation 
Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible, whereas a high compute cost limits the utility of existing automated approaches. In this work, we present an improved and optmized automated workflow to enable large-scale protein annotation. The workflow uses high performance computing architectures and a low complexity classification algorithm to assign proteins into existing clusters of orthologous groups of proteins. On the basis of the Position-Specific Iterative Basic Local Alignment Search Tool the algorithm ensures at least 80% specificity and sensitivity of the resulting classifications. The workflow utilizes highly scalable parallel applications for classification and sequence alignment. Using Extreme Science and Engineering Discovery Environment supercomputers, the workflow processed 1,200,000 newly sequenced bacterial proteins. With the rapid expansion of the protein sequence universe, the proposed workflow will enable scientists to annotate big genome data.
PMCID: PMC4194055  PMID: 25313296
science gateways; petascale; data-enabled life sciences; sequence similarity; computational bioinformatics; protein annotation; protein sequence universe; COG; BLAST; PSI-BLAST; HSPp-BLAST; XSEDE; PS
6.  Modeling sequence and function similarity between proteins for protein functional annotation 
A common task in biological research is to predict function for proteins by comparing sequences between proteins of known and unknown function. This is often done using pair-wise sequence alignment algorithms (e.g. BLAST). A problem with this approach is the assumption of a simple equivalence between a minimum sequence similarity threshold and the function similarity between proteins. This assumption is based on the binary concept of homology in that proteins are or not homologous. The relationship between sequence and function however is more complex as well as pertinent for predicting protein function, e.g. evaluating BLAST alignments or developing training sets for profile models based on functional rather than homologous groupings. Our motivation for this study was to model sequence and function similarity between proteins to gain insights into the “sequence-function similarity relationship between proteins for predicting function. Using our model we found that function similarity generally increases with sequence similarity but with a high degree of variability. This result has implications for pair-wise approaches in that it appears sequence similarity must be very high to ensure high function similarity. Profile models which enable higher sensitivity are a potential solution. However, multiple sequences alignments (a necessary prerequisite) are a problem in that current algorithms have difficulty aligning sequences with very low sequence similarity, which is common in our data set, or are intractable for high numbers of sequences. Given the importance of predicting protein function and the need for multiple sequence alignments, algorithms for accomplishing this task should be further refined and developed.
PMCID: PMC4120521  PMID: 25101328
Experimentation; Biostatistics; Bioinformatics; Multiple Sequence Alignment
7.  Integrative Analysis of Longitudinal Metabolomics Data from a Personal Multi-Omics Profile  
Metabolites  2013;3(3):741-760.
The integrative personal omics profile (iPOP) is a pioneering study that combines genomics, transcriptomics, proteomics, metabolomics and autoantibody profiles from a single individual over a 14-month period. The observation period includes two episodes of viral infection: a human rhinovirus and a respiratory syncytial virus. The profile studies give an informative snapshot into the biological functioning of an organism. We hypothesize that pathway expression levels are associated with disease status. To test this hypothesis, we use biological pathways to integrate metabolomics and proteomics iPOP data. The approach computes the pathways’ differential expression levels at each time point, while taking into account the pathway structure and the longitudinal design. The resulting pathway levels show strong association with the disease status. Further, we identify temporal patterns in metabolite expression levels. The changes in metabolite expression levels also appear to be consistent with the disease status. The results of the integrative analysis suggest that changes in biological pathways may be used to predict and monitor the disease. The iPOP experimental design, data acquisition and analysis issues are discussed within the broader context of personal profiling.
PMCID: PMC3901289  PMID: 24958148
metabolomics; integrative pathway analysis; DEAP; dendrogram sharpening; DELSA; iPOP; longitudinal design; multi-omics data; single linkage
8.  Correction: Differential Expression Analysis for Pathways 
PLoS Computational Biology  2013;9(4):10.1371/annotation/58cf4d21-f9b0-4292-94dd-3177f393a284.
PMCID: PMC3648644
9.  Differential Expression Analysis for Pathways 
PLoS Computational Biology  2013;9(3):e1002967.
Life science technologies generate a deluge of data that hold the keys to unlocking the secrets of important biological functions and disease mechanisms. We present DEAP, Differential Expression Analysis for Pathways, which capitalizes on information about biological pathways to identify important regulatory patterns from differential expression data. DEAP makes significant improvements over existing approaches by including information about pathway structure and discovering the most differentially expressed portion of the pathway. On simulated data, DEAP significantly outperformed traditional methods: with high differential expression, DEAP increased power by two orders of magnitude; with very low differential expression, DEAP doubled the power. DEAP performance was illustrated on two different gene and protein expression studies. DEAP discovered fourteen important pathways related to chronic obstructive pulmonary disease and interferon treatment that existing approaches omitted. On the interferon study, DEAP guided focus towards a four protein path within the 26 protein Notch signalling pathway.
Author Summary
The data deluge represents a growing challenge for life sciences. Within this sea of data surely lie many secrets to understanding important biological and medical systems. To quantify important patterns in this data, we present DEAP (Differential Expression Analysis for Pathways). DEAP amalgamates information about biological pathway structure and differential expression to identify important patterns of regulation. On both simulated and biological data, we show that DEAP is able to identify key mechanisms while making significant improvements over existing methodologies. For example, on the interferon study, DEAP uniquely identified both the interferon gamma signalling pathway and the JAK STAT signalling pathway.
PMCID: PMC3597535  PMID: 23516350
10.  Temporoparietal hypometabolism is common in FTLD and is associated with imaging diagnostic errors 
Archives of neurology  2010;68(3):329-337.
To evaluate the cause of diagnostic errors in the visual interpretation of positron emission tomography scans with 18F-fluorodeoxyglucose (FDG-PET) in patients with frontotemporal lobar degeneration (FTLD) and Alzheimer's disease (AD).
Twelve trained raters unaware of clinical and autopsy information independently reviewed FDG-PET scans and provided their diagnostic impression and confidence of either FTLD or AD. Six of these raters also recorded whether metabolism appeared normal or abnormal in 5 predefined brain regions in each hemisphere – frontal cortex, anterior cingulate cortex, anterior temporal cortex, temporoparietal cortex and posterior cingulate cortex. Results were compared to neuropathological diagnoses.
Academic medical centers
45 patients with pathologically confirmed FTLD (n=14) or AD (n=31)
Raters had a high degree of diagnostic accuracy in the interpretation of FDG-PET scans; however, raters consistently found some scans more difficult to interpret than others. Unanimity of diagnosis among the raters was more frequent in patients with AD (27/31, 87%) than in patients with FTLD (7/14, 50%) (p = 0.02). Disagreements in interpretation of scans in patients with FTLD largely occurred when there was temporoparietal hypometabolism, which was present in 7 of the 14 FTLD scans and 6 of the 7 lacking unanimity. Hypometabolism of anterior cingulate and anterior temporal regions had higher specificities and positive likelihood ratios for FTLD than temporoparietal hypometabolism had for AD.
Temporoparietal hypometabolism in FTLD is common and may cause inaccurate interpretation of FDG-PET scans. An interpretation paradigm that focuses on the absence of hypometabolism in regions typically affected in AD before considering FTLD is likely to misclassify a significant portion of FTLD scans. Anterior cingulate and/or anterior temporal hypometabolism indicates a high likelihood of FTLD, even when temporoparietal hypometabolism is present. Ultimately, the accurate interpretation of FDG-PET scans in patients with dementia cannot rest on the presence or absence of a single region of hypometabolism, but must take into account the relative hypometabolism of all brain regions.
PMCID: PMC3058918  PMID: 21059987
11.  Design and Initial Characterization of the SC-200 Proteomics Standard Mixture 
High-throughput (HTP) proteomics studies generate large amounts of data. Interpretation of these data requires effective approaches to distinguish noise from biological signal, particularly as instrument and computational capacity increase and studies become more complex. Resolving this issue requires validated and reproducible methods and models, which in turn requires complex experimental and computational standards. The absence of appropriate standards and data sets for validating experimental and computational workflows hinders the development of HTP proteomics methods. Most protein standards are simple mixtures of proteins or peptides, or undercharacterized reference standards in which the identity and concentration of the constituent proteins is unknown. The Seattle Children's 200 (SC-200) proposed proteomics standard mixture is the next step toward developing realistic, fully characterized HTP proteomics standards. The SC-200 exhibits a unique modular design to extend its functionality, and consists of 200 proteins of known identities and molar concentrations from 6 microbial genomes, distributed into 10 molar concentration tiers spanning a 1,000-fold range. We describe the SC-200's design, potential uses, and initial characterization. We identified 84% of SC-200 proteins with an LTQ-Orbitrap and 65% with an LTQ-Velos (false discovery rate = 1% for both). There were obvious trends in success rate, sequence coverage, and spectral counts with protein concentration; however, protein identification, sequence coverage, and spectral counts vary greatly within concentration levels.
PMCID: PMC3110723  PMID: 21250827
12.  The necessity of adjusting tests of protein category enrichment in discovery proteomics 
Bioinformatics  2010;26(24):3007-3011.
Motivation: Enrichment tests are used in high-throughput experimentation to measure the association between gene or protein expression and membership in groups or pathways. The Fisher's exact test is commonly used. We specifically examined the associations produced by the Fisher test between protein identification by mass spectrometry discovery proteomics, and their Gene Ontology (GO) term assignments in a large yeast dataset. We found that direct application of the Fisher test is misleading in proteomics due to the bias in mass spectrometry to preferentially identify proteins based on their biochemical properties. False inference about associations can be made if this bias is not corrected. Our method adjusts Fisher tests for these biases and produces associations more directly attributable to protein expression rather than experimental bias.
Results: Using logistic regression, we modeled the association between protein identification and GO term assignments while adjusting for identification bias in mass spectrometry. The model accounts for five biochemical properties of peptides: (i) hydrophobicity, (ii) molecular weight, (iii) transfer energy, (iv) beta turn frequency and (v) isoelectric point. The model was fit on 181 060 peptides from 2678 proteins identified in 24 yeast proteomics datasets with a 1% false discovery rate. In analyzing the association between protein identification and their GO term assignments, we found that 25% (134 out of 544) of Fisher tests that showed significant association (q-value ≤0.05) were non-significant after adjustment using our model. Simulations generating yeast protein sets enriched for identification propensity show that unadjusted enrichment tests were biased while our approach worked well.
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2995116  PMID: 21068002
13.  Validation of Consensus Panel Diagnosis in Dementia 
Archives of neurology  2010;67(12):1506-1512.
The clinical diagnosis of dementing diseases largely depends upon the subjective interpretation of patient symptoms. Consensus panels are frequently used in research to determine diagnoses when definitive pathological findings are unavailable. Nevertheless, research on group decision-making indicates many factors can adversely influence panel performance.
To determine conditions that improve consensus panel diagnosis.
Comparison of neuropathological diagnoses with individual and consensus panel diagnoses based on clinical summaries, FDG-PET scans, and summaries with scans.
Expert and trainee individual and consensus panel deliberations using a modified Delphi method in a pilot research study of the diagnostic utility of FDG-PET imaging.
Patients and Methods
Forty-five patients with pathologically confirmed Alzheimer’s disease or frontotemporal dementia. Statistical measures of diagnostic accuracy, agreement, and confidence for individual raters and panelists before and after consensus deliberations.
The consensus protocol using trainees and experts surpassed the accuracy of individual expert diagnoses when clinical information elicited diverse judgments. In these situations, consensus was 3.5 times more likely to produce positive rather than negative changes in the accuracy and diagnostic certainty of individual panelists. A rule that forced group consensus was at least as accurate as majority and unanimity rules.
Using a modified Delphi protocol to arrive at a consensus diagnosis is a reasonable substitute for pathologic information. This protocol improves diagnostic accuracy and certainty when panelist judgments differ and is easily adapted to other research and clinical settings while avoiding potential pitfalls of group decision-making.
PMCID: PMC3178413  PMID: 21149812
14.  MOPED: Model Organism Protein Expression Database 
Nucleic Acids Research  2011;40(Database issue):D1093-D1099.
Large numbers of mass spectrometry proteomics studies are being conducted to understand all types of biological processes. The size and complexity of proteomics data hinders efforts to easily share, integrate, query and compare the studies. The Model Organism Protein Expression Database (MOPED, htttp:// is a new and expanding proteomics resource that enables rapid browsing of protein expression information from publicly available studies on humans and model organisms. MOPED is designed to simplify the comparison and sharing of proteomics data for the greater research community. MOPED uniquely provides protein level expression data, meta-analysis capabilities and quantitative data from standardized analysis. Data can be queried for specific proteins, browsed based on organism, tissue, localization and condition and sorted by false discovery rate and expression. MOPED empowers users to visualize their own expression data and compare it with existing studies. Further, MOPED links to various protein and pathway databases, including GeneCards, Entrez, UniProt, KEGG and Reactome. The current version of MOPED contains over 43 000 proteins with at least one spectral match and more than 11 million high certainty spectra.
PMCID: PMC3245040  PMID: 22139914
15.  Design and Initial Characterization of the SC-200 Proteomics Standard Mixture 
High-throughput (HTP) proteomics studies generate large amounts of data. Interpretation of these data requires effective approaches to distinguish noise from biological signal, particularly as instrument and computational capacity increase and studies become more complex. Resolving this issue requires validated and reproducible methods and models, which in turn requires complex experimental and computational standards. The absence of appropriate standards and data sets for validating experimental and computational workflows hinders the development of HTP proteomics methods. Most protein standards are simple mixtures of proteins or peptides, or undercharacterized reference standards in which the identity and concentration of the constituent proteins is unknown. The Seattle Children's 200 (SC-200) proposed proteomics standard mixture is the next step toward developing realistic, fully characterized HTP proteomics standards. The SC-200 exhibits a unique modular design to extend its functionality, and consists of 200 proteins of known identities and molar concentrations from 6 microbial genomes, distributed into 10 molar concentration tiers spanning a 1,000-fold range. We describe the SC-200's design, potential uses, and initial characterization. We identified 84% of SC-200 proteins with an LTQ-Orbitrap and 65% with an LTQ-Velos (false discovery rate = 1% for both). There were obvious trends in success rate, sequence coverage, and spectral counts with protein concentration; however, protein identification, sequence coverage, and spectral counts vary greatly within concentration levels.
PMCID: PMC3110723  PMID: 21250827
16.  Meta-analysis for Protein Identification: A Case Study on Yeast Data 
Large amounts of mass spectrometry (MS) proteomics data are now publicly available; however, little attention has been given to how to best combine these data and assess the error rates for protein identification. The objective of this article is to show how variation in the type and amount of data included with each study impacts coverage of the yeast proteome and estimation of the false discovery rate (FDR). Our analysis of a subset of the publicly available yeast data showed that failure to reevaluate the FDR when combining protein IDs from different experiments resulted in an underestimation of the FDR by approximately threefold. A worst-case approximation of the FDR was only slightly larger than estimating the FDR by randomized database matches. The use of a weighted model to emphasize the most informative experimental data provided an increase in the number of IDs at a 1% FDR when compared to other meta-analysis approaches. Also, using an FDR higher than 1% results in a very high rate of false discoveries for IDs above the 1% threshold. Ideally, raw MS data will be made publicly available for complete and consistent reanalysis. In the circumstance that raw data is not available, determining a combined FDR on the basis of the worst-case estimation provides a reasonable approximation of the FDR. When combining experimental results, adding additional experiments results in diminishing and in some cases negative returns on protein identifications. It may be beneficial to include only those experiments generating the most unique identifications due to solid experimental design and sensitive instrumentation.
PMCID: PMC3133781  PMID: 20569183
17.  Interplay of heritage and habitat in the distribution of bacterial signal transduction systems 
Molecular bioSystems  2010;6(4):721-728.
Comparative analysis of the complete genome sequences from a variety of poorly studied organisms aims at predicting ecological and behavioral properties of these organisms and help in characterizing their habitats. This task requires finding appropriate descriptors that could be correlated with the core traits of each system and would allow meaningful comparisons. Using the relatively simple bacterial models, first attempts have been made to introduce suitable metrics to describe the complexity of organism’s signaling machinery, which included introducing the “bacterial IQ” score. Here, we use an updated census of prokaryotic signal transduction systems to improve this parameter and evaluate its consistency within selected bacterial phyla. We also introduce a more elaborate descriptor, a set of profiles of relative abundance of members of each family of signal transduction proteins encoded in each genome. We show that these family profiles are well conserved within each genus and are often consistent within families of bacteria. Thus, they reflect evolutionary relationships between organisms as well as individual adaptations of each organism to its specific ecological niche.
PMCID: PMC3071642  PMID: 20237650
comparative genomics; evolution; protein phosphorylation; receptor; Mycobacterium; Shewanella
18.  The United States of America and Scientific Research 
PLoS ONE  2010;5(8):e12203.
To gauge the current commitment to scientific research in the United States of America (US), we compared federal research funding (FRF) with the US gross domestic product (GDP) and industry research spending during the past six decades. In order to address the recent globalization of scientific research, we also focused on four key indicators of research activities: research and development (R&D) funding, total science and engineering doctoral degrees, patents, and scientific publications. We compared these indicators across three major population and economic regions: the US, the European Union (EU) and the People's Republic of China (China) over the past decade. We discovered a number of interesting trends with direct relevance for science policy. The level of US FRF has varied between 0.2% and 0.6% of the GDP during the last six decades. Since the 1960s, the US FRF contribution has fallen from twice that of industrial research funding to roughly equal. Also, in the last two decades, the portion of the US government R&D spending devoted to research has increased. Although well below the US and the EU in overall funding, the current growth rate for R&D funding in China greatly exceeds that of both. Finally, the EU currently produces more science and engineering doctoral graduates and scientific publications than the US in absolute terms, but not per capita. This study's aim is to facilitate a serious discussion of key questions by the research community and federal policy makers. In particular, our results raise two questions with respect to: a) the increasing globalization of science: “What role is the US playing now, and what role will it play in the future of international science?”; and b) the ability to produce beneficial innovations for society: “How will the US continue to foster its strengths?”
PMCID: PMC2922381  PMID: 20808949
19.  Neuropathology of Nondemented Aging: Presumptive Evidence for Preclinical Alzheimer Disease 
Neurobiology of aging  2009;30(7):1026-1036.
To determine the frequency and possible cognitive effect of histological Alzheimer’s disease (AD) in autopsied older nondemented individuals.
Senile plaques (SPs) and neurofibrillary tangles (NFTs) were assessed quantitatively in 97 cases from 7 Alzheimer’s Disease Centers (ADCs). Neuropathological diagnoses of AD (npAD) were also made with four sets of criteria. Adjusted linear mixed models tested differences between participants with and without npAD on the quantitative neuropathology measures and psychometric test scores prior to death. Spearman rank-order correlations between AD lesions and psychometric scores at last assessment were calculated for cases with pathology in particular regions.
Washington University Alzheimer’s Disease Research Center.
Ninety-seven nondemented participants who were age 60 years or older at death (mean = 84 years).
About 40% of nondemented individuals met at least some level of criteria for npAD; when strict criteria were used, about 20% of cases had npAD. Substantial overlap of Braak neurofibrillary stages occurred between npAD and no-npAD cases. Although there was no measurable cognitive impairment prior to death for either the no-npAD or npAD groups, cognitive function in nondemented aging appears to be degraded by the presence of NFTs and SPs.
Neuropathological processes related to AD in persons without dementia appear to be associated with subtle cognitive dysfunction and may represent a preclinical stage of the illness. By age 80–85 years, many nondemented older adults have substantial AD pathology.
PMCID: PMC2737680  PMID: 19376612
preclinical Alzheimer’s disease; nondemented aging; neuropathological Alzheimer’s disease
20.  Quantifying Protein Function Specificity in the Gene Ontology 
Standards in Genomic Sciences  2010;2(2):238-244.
Quantitative or numerical metrics of protein function specificity made possible by the Gene Ontology are useful in that they enable development of distance or similarity measures between protein functions. Here we describe how to calculate four measures of function specificity for GO terms: 1) number of ancestor terms; 2) number of offspring terms; 3) proportion of terms; and 4) Information Content (IC). We discuss the relationship between the metrics and the strengths and weaknesses of each.
PMCID: PMC3035283  PMID: 21304708
protein annotation; protein function; function specificity
21.  A Statistical Model of Protein Sequence Similarity and Function Similarity Reveals Overly-Specific Function Predictions 
PLoS ONE  2009;4(10):e7546.
Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.
Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.
Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e−62, non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e−05, NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.
PMCID: PMC2760442  PMID: 19844580
22.  Validating Annotations for Uncharacterized Proteins in Shewanella oneidensis 
Proteins of unknown function are a barrier to our understanding of molecular biology. Assigning function to these “uncharacterized” proteins is imperative, but challenging. The usual approach is similarity searches using annotation databases, which are useful for predicting function. However, since the performance of these databases on uncharacterized proteins is basically unknown, the accuracy of their predictions is suspect, making annotation difficult. To address this challenge, we developed a benchmark annotation dataset of 30 proteins in Shewanella oneidensis. The proteins in the dataset were originally uncharacterized after the initial annotation of the S. oneidensis proteome in 2002. In the intervening 5 years, the accumulation of new experimental evidence has enabled specific functions to be predicted. We utilized this benchmark dataset to evaluate several commonly utilized annotation databases. According to our criteria, six annotation databases accurately predicted functions for at least 60% of proteins in our dataset. Two of these six even had a “conditional accuracy” of 90%. Conditional accuracy is another evaluation metric we developed which excludes results from databases where no function was predicted. Also, 27 of the 30 proteins' functions were correctly predicted by at least one database. These represent one of the first performance evaluations of annotation databases on uncharacterized proteins. Our evaluation indicates that these databases readily incorporate new information and are accurate in predicting functions for uncharacterized proteins, provided that experimental function evidence exists.
PMCID: PMC3189009  PMID: 18687039
23.  Staphylococcus aureus Elicits Marked Alterations in the Airway Proteome during Early Pneumonia▿ ‡  
Infection and Immunity  2008;76(12):5862-5872.
Pneumonia caused by Staphylococcus aureus is a growing concern in the health care community. We hypothesized that characterization of the early innate immune response to bacteria in the lungs would provide insight into the mechanisms used by the host to protect itself from infection. An adult mouse model of Staphylococcus aureus pneumonia was utilized to define the early events in the innate immune response and to assess the changes in the airway proteome during the first 6 h of pneumonia. S. aureus actively replicated in the lungs of mice inoculated intranasally under anesthesia to cause significant morbidity and mortality. By 6 h postinoculation, the release of proinflammatory cytokines caused effective recruitment of neutrophils to the airway. Neutrophil influx, loss of alveolar architecture, and consolidated pneumonia were observed histologically 6 h postinoculation. Bronchoalveolar lavage fluids from mice inoculated with phosphate-buffered saline (PBS) or S. aureus were depleted of overabundant proteins and subjected to strong cation exchange fractionation followed by liquid chromatography and tandem mass spectrometry to identify the proteins present in the airway. No significant changes in response to PBS inoculation or 30 min following S. aureus inoculation were observed. However, a dramatic increase in extracellular proteins was observed 6 h postinoculation with S. aureus, with the increase dominated by inflammatory and coagulation proteins. The data presented here provide a comprehensive evaluation of the rapid and vigorous innate immune response mounted in the host airway during the earliest stages of S. aureus pneumonia.
PMCID: PMC2583584  PMID: 18852243
24.  Host Airway Proteins Interact with Staphylococcus aureus during Early Pneumonia▿ †  
Infection and Immunity  2008;76(3):888-898.
Staphylococcus aureus is a major cause of hospital-acquired pneumonia and is emerging as an important etiological agent of community-acquired pneumonia. Little is known about the specific host-pathogen interactions that occur when S. aureus first enters the airway. A shotgun proteomics approach was utilized to identify the airway proteins associated with S. aureus during the first 6 h of infection. Host proteins eluted from bacteria recovered from the airways of mice 30 min or 6 h following intranasal inoculation under anesthesia were subjected to liquid chromatography and tandem mass spectrometry. A total of 513 host proteins were associated with S. aureus 30 min and/or 6 h postinoculation. A majority of the identified proteins were host cytosolic proteins, suggesting that S. aureus was rapidly internalized by phagocytes in the airway and that significant host cell lysis occurred during early infection. In addition, extracellular matrix and secreted proteins, including fibronectin, antimicrobial peptides, and complement components, were associated with S. aureus at both time points. The interaction of 12 host proteins shown to bind to S. aureus in vitro was demonstrated in vivo for the first time. The association of hemoglobin, which is thought to be the primary staphylococcal iron source during infection, with S. aureus in the airway was validated by immunoblotting. Thus, we used our recently developed S. aureus pneumonia model and shotgun proteomics to validate previous in vitro findings and to identify nearly 500 other proteins that interact with S. aureus in vivo. The data presented here provide novel insights into the host-pathogen interactions that occur when S. aureus enters the airway.
PMCID: PMC2258841  PMID: 18195024
25.  Gene expression correlates of neurofibrillary tangles in Alzheimer’s disease 
Neurobiology of aging  2005;27(10):1359-1371.
Neurofibrillary tangles (NFT) constitute one of the cardinal histopathological features of Alzheimer’s disease (AD). To explore in vivo molecular processes involved in the development of NFTs, we compared gene expression profiles of NFT-bearing entorhinal cortex neurons from 19 AD patients, adjacent non-NFT-bearing entorhinal cortex neurons from the same patients, and non-NFT-bearing entorhinal cortex neurons from 14 non-demented, histopathologically normal controls (ND). Of the differentially expressed genes, 225 showed progressively increased expression (AD NFT neurons > AD non-NFT neurons > ND non-NFT neurons) or progressively decreased expression (AD NFT neurons < AD non-NFT neurons < ND non-NFT neurons), raising the possibility that they may be related to the early stages of NFT formation. Immunohistochemical studies confirmed that many of the implicated proteins are dysregulated and preferentially localized to NFTs, including apolipoprotein J, interleukin-1 receptor-associated kinase 1, tissue inhibitor of metalloproteinase 3, and casein kinase 2, beta. Functional validation studies are underway to determine which candidate genes may be causally related to NFT neuropathology, thus providing therapeutic targets for the treatment of AD.
PMCID: PMC2259291  PMID: 16242812
Alzheimer’s disease; Neurofibrillary tangles; Microarray; Gene expression; Dementia; Neurodegeneration; NFT; Laser capture microdissection

Results 1-25 (25)