The Cancer Genome Atlas (TCGA) project is a large-scale effort with the goal of identifying novel molecular aberrations in glioblastoma (GBM).
Here, we describe an in-depth analysis of gene expression data and copy number aberration (CNA) data to classify GBMs into prognostic groups to determine correlates of subtypes that may be biologically significant.
To identify predictive survival models, we searched TCGA in 173 patients and identified 42 probe sets (P = .0005) that could be used to divide the tumor samples into 3 groups and showed a significantly (P = .0006) improved overall survival. Kaplan-Meier plots showed that the median survival of group 3 was markedly longer (127 weeks) than that of groups 1 and 2 (47 and 52 weeks, respectively). We then validated the 42 probe sets to stratify the patients according to survival in other public GBM gene expression datasets (eg, GSE4290 dataset). An overall analysis of the gene expression and copy number aberration using a multivariate Cox regression model showed that the 42 probe sets had a significant (P < .018) prognostic value independent of other variables.
By integrating multidimensional genomic data from TCGA, we identified a specific survival model in a new prognostic group of GBM and suggest that molecular stratification of patients with GBM into homogeneous subgroups may provide opportunities for the development of new treatment modalities.
comparative genomic hybridization; EMT; gene expression; glioblastoma; prognostic marker; TCGA
Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics’ “Big Data” from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine.
QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running “download and install” software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months.
QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
SRA; TCGA; nsSNV; SNV; SNP; Next-gen; NGS; Phylogenetics; Cancer
Genetics and genomics have radically altered our understanding of breast cancer progression. However, the genomic basis of various histopathologic features of breast cancer is not yet well-defined.
Materials and Methods:
The Cancer Genome Atlas (TCGA) is an international database containing a large collection of human cancer genome sequencing data. cBioPortal is a web tool developed for mining these sequencing data. We performed mining of TCGA sequencing data in an attempt to characterize the genomic features correlated with breast cancer histopathology. We first assessed the quality of the TCGA data using a group of genes with known alterations in various cancers. Both genome-wide gene mutation and copy number changes as well as a group of genes with a high frequency of genetic changes were then correlated with various histopathologic features of invasive breast cancer.
Validation of TCGA data using a group of genes with known alterations in breast cancer suggests that the TCGA has accurately documented the genomic abnormalities of multiple malignancies. Further analysis of TCGA breast cancer sequencing data shows that accumulation of specific genomic defects is associated with higher tumor grade, larger tumor size and receptor negativity. Distinct groups of genomic changes were found to be associated with the different grades of invasive ductal carcinoma. The mutator role of the TP53 gene was validated by genomic sequencing data of invasive breast cancer and TP53 mutation was found to play a critical role in defining high tumor grade.
Data mining of the TCGA genome sequencing data is an innovative and reliable method to help characterize the genomic abnormalities associated with histopathologic features of invasive breast cancer.
Breast cancer; cBioPortal; data mining; histopathology; the cancer genome atlas; tumor grade
Our ultimate goal is to identify and target modifiable risk factors that will reduce major cardiovascular events in African-American lupus patients. As a first step toward achieving this goal, this study was designed to explore risk factor models of preclinical atherosclerosis in a predominantly African-American group of SLE patients using variables historically associated with endothelial function in non-lupus populations.
51 subjects with SLE but without a history of clinical cardiovascular events were enrolled. At entry, a Framingham risk factor history and medication list were recorded. Sera and plasma samples were analyzed for lipids, lupus activity markers, and total 25-hydroxyvitamin D (25(OH)D) levels. Carotid ultrasound measurements were performed to determine total plaque area (TPA) in both carotids. Cases had TPA values above age-matched controls from a vascular prevention clinic population. Logistic regression and machine learning analyses were performed to create predictive models.
25(OH)D levels were significantly lower and SLE disease duration was significantly higher in cases. 25(OH)D levels inversely correlated with age-adjusted TPA. ACE-inhibitor non-use associated with case status. Logistic regression models containing ACE-inhibitor use, 25(OH)D levels, and LDL levels had a diagnostic accuracy of 84% for predicting accelerated atherosclerosis. Similar results were obtained with machine learning models, but hydroxychloroquine use associated with controls in these models.
This is the first study to demonstrate an association between atherosclerotic burden and 25(OH)D insufficiency or ACE-inhibitor non-use in lupus patients. These findings provide strong rationale for the study of ACE-inhibitors and vitamin D replenishment as preventive therapies in this high-risk population.
Systemic lupus erythematosus; Atherosclerosis; Vitamin D deficiency; Angiotensin converting enzyme inhibitors; Hypercholesterolemia
Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.
Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.
Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
There are currently no reliable markers of acute domoic acid toxicosis (DAT) for California sea lions. We investigated whether patterns of serum peptides could diagnose acute DAT. Serum peptides were analyzed by MALDI-TOF mass spectrometry from 107 sea lions (acute DAT n = 34; non-DAT n = 73). Artificial neural networks (ANN) were trained using MALDI-TOF data. Individual peaks and neural networks were qualified using an independent test set (n = 20).
No single peak was a good classifier of acute DAT, and ANN models were the best predictors of acute DAT. Performance measures for a single median ANN were: sensitivity, 100%; specificity, 60%; positive predictive value, 71%; negative predictive value, 100%. When 101 ANNs were combined and allowed to vote for the outcome, the performance measures were: sensitivity, 30%; specificity, 100%; positive predictive value, 100%; negative predictive value, 59%.
These results suggest that MALDI-TOF peptide profiling and neural networks can perform either as a highly sensitive (100% negative predictive value) or a highly specific (100% positive predictive value) diagnostic tool for acute DAT. This also suggests that machine learning directed by populations of predictive models offer the ability to modulate the predictive effort into a specific type of error.
Serum peptides; Neural network; Zalophus californianus; Neurotoxin
Image bioinformatics infrastructure typically relies on a combination of server-side high-performance computing and client desktop applications tailored for graphic rendering. On the server side, matrix manipulation environments are often used as the back-end where deployment of specialized analytical workflows takes place. However, neither the server-side nor the client-side desktop solution, by themselves or combined, is conducive to the emergence of open, collaborative, computational ecosystems for image analysis that are both self-sustained and user driven.
Materials and Methods:
ImageJS was developed as a browser-based webApp, untethered from a server-side backend, by making use of recent advances in the modern web browser such as a very efficient compiler, high-end graphical rendering capabilities, and I/O tailored for code migration.
Multiple versioned code hosting services were used to develop distinct ImageJS modules to illustrate its amenability to collaborative deployment without compromise of reproducibility or provenance. The illustrative examples include modules for image segmentation, feature extraction, and filtering. The deployment of image analysis by code migration is in sharp contrast with the more conventional, heavier, and less safe reliance on data transfer. Accordingly, code and data are loaded into the browser by exactly the same script tag loading mechanism, which offers a number of interesting applications that would be hard to attain with more conventional platforms, such as NIH's popular ImageJ application.
The modern web browser was found to be advantageous for image bioinformatics in both the research and clinical environments. This conclusion reflects advantages in deployment scalability and analysis reproducibility, as well as the critical ability to deliver advanced computational statistical procedures machines where access to sensitive data is controlled, that is, without local “download and installation”.
Cloud computing; image analysis; webApp
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
The objective of this investigation was to evaluate the effect of maternal obesity, as measured by prepregnancy body mass index (BMI), on the mode of delivery in women undergoing indicated induction of labor for preeclampsia.
Following IRB approval, patients with preeclampsia who underwent an induction of labor from 1997–2007 were identified from a perinatal information database, which included historical and clinical information. Data analysis included bivariable and multivariable analyses of predictor variables by mode of delivery. An artificial neural network was trained and externally validated to independently examine predictors of mode of delivery among women with preeclampsia.
Six hundred and eight women met eligibility criteria and were included in this investigation. Based on multivariable logistic regression (MLR) modeling, a five unit increase in BMI yields a 16% increase in the odds of cesarean delivery. An artificial neural network trained and externally validated confirmed the importance of obesity in the prediction of mode of delivery among women undergoing labor induction for preeclampsia.
Among patients who are affected by preeclampsia, obesity complicates labor induction. The risk of cesarean delivery is enhanced by obesity, even with small increases in BMI. Prediction of mode of delivery by an artificial neural network performs similar to MLR among patients undergoing labor induction for preeclampsia.
Obesity; severe preeclampsia; cesarean delivery; body mass index
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.
The value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control.
We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data.
Reflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases.
S3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms.
S3DB; Linked Data; KOS; RDF; SPARQL; knowledge organization system, policy
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
TCGA; SPARQL; RDF; Linked Data; Data integration
Acute kidney injury (AKI) is an important cause of death among hospitalized patients. The two most common causes of AKI are acute tubular necrosis (ATN) and prerenal azotemia (PRA). Appropriate diagnosis of the disease is important but often difficult. We analyzed urine proteins by 2-DE from 38 patients with AKI. Patients were randomly assigned to a training set, an internal test set or an external validation set. Spot abundances were analyzed by artificial neural networks (ANN) to identify biomarkers which differentiate between ATN and PRA. When the trained neural network algorithm was tested against the training data it identified the diagnosis for 16/18 patients in the training set and all 10 patients in the internal test set. The accuracy was validated in the novel external set of patients where 9/10 subjects were correctly diagnosed including 5/5 with ATN and 4/5 with PRA. Plasma retinol binding protein (PRBP) was identified in one spot and a fragment of albumin and PRBP in the other. These proteins are candidate markers for diagnostic assays of AKI.
Acute kidney injury; Biomarkers; Diagnosis; Kidney; Urine
AGUIA is a front-end web application originally developed to manage clinical, demographic and biomolecular patient data collected during clinical trials at MD Anderson Cancer Center. The diversity of methods involved in patient screening and sample processing generates a variety of data types that require a resource-oriented architecture to capture the associations between the heterogeneous data elements. AGUIA uses a semantic web formalism, resource description framework (RDF), and a bottom-up design of knowledge bases that employ the S3DB tool as the starting point for the client's interface assembly.
The data web service, S3DB, meets the necessary requirements of generating the RDF and of explicitly distinguishing the description of the domain from its instantiation, while allowing for continuous editing of both. Furthermore, it uses an HTTP-REST protocol, has a SPARQL endpoint, and has open source availability in the public domain, which facilitates the development and dissemination of this application. However, S3DB alone does not address the issue of representing content in a form that makes sense for domain experts.
We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piece-wise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials.
Biomedical research is set to greatly benefit from the use of semantic web technologies in the design of computational infrastructure. However, beyond well defined research initiatives, substantial issues of data heterogeneity, source distribution, and privacy currently stand in the way towards the personalization of Medicine.
A computational framework for bioinformatic infrastructure was designed to deal with the heterogeneous data sources and the sensitive mixture of public and private data that characterizes the biomedical domain. This framework consists of a logical model build with semantic web tools, coupled with a Markov process that propagates user operator states. An accompanying open source prototype was developed to meet a series of applications that range from collaborative multi-institution data acquisition efforts to data analysis applications that need to quickly traverse complex data structures. This report describes the two abstractions underlying the S3DB-based infrastructure, logical and numerical, and discusses its generality beyond the immediate confines of existing implementations.
The emergence of the "web as a computer" requires a formal model for the different functionalities involved in reading and writing to it. The S3DB core model proposed was found to address the design criteria of biomedical computational infrastructure, such as those supporting large scale multi-investigator research, clinical trials, and molecular epidemiology.
Heavy metals, such as copper, zinc and cadmium, represent some of the most common and serious pollutants in coastal estuaries. In the present study, we used a combination of linear and artificial neural network (ANN) modelling to detect and explore interactions among low-dose mixtures of these heavy metals and their impacts on fundamental physiological processes in tissues of the Eastern oyster, Crassostrea virginica. Animals were exposed to Cd (0.001–0.400 µM), Zn (0.001–3.059 µM) or Cu (0.002–0.787 µM), either alone or in combination for 1 to 27 days. We measured indicators of acid–base balance (hemolymph pH and total CO2), gas exchange (Po2), immunocompetence (total hemocyte counts, numbers of invasive bacteria), antioxidant status (glutathione, GSH), oxidative damage (lipid peroxidation; LPx), and metal accumulation in the gill and the hepatopancreas. Linear analysis showed that oxidative membrane damage from tissue accumulation of environmental metals was correlated with impaired acid–base balance in oysters. ANN analysis revealed interactions of metals with hemolymph acid–base chemistry in predicting oxidative damage that were not evident from linear analyses. These results highlight the usefulness of machine learning approaches, such as ANNs, for improving our ability to recognize and understand the effects of subacute exposure to contaminant mixtures.
Heavy metals; Artificial neural networks; Crassostrea virginica; Lipid peroxidation; Glutathione; Acid–base balance; Hemolymph PO2
Two-dimensional gel electrophoresis (2DE) offers high-resolution separation for intact proteins. However, variability in the appearance of spots can limit the ability to identify true differences between conditions. Variability can occur at a number of levels. Individual samples can differ because of biological variability. Technical variability can occur during protein extraction, processing, or storage. Another potential source of variability occurs during analysis of the gels and is not a result of any of the causes of variability named above. We performed a study designed to focus only on the variability caused by analysis. We separated three aliquots of rat left ventricle and analyzed differences in protein abundance on the replicate 2D gels. As the samples loaded on each gel were identical, differences in protein abundance are caused by variability in separation or interpretation of the gels. Protein spots were compared across gels by quantile values to determine differences. Fourteen percent of spots had a maximum difference in intensity of 0.4 quantile values or more between replicates. We then looked individually at the spots to determine the cause of differences between the measured intensities. Reasons for differences were: failure to identify a spot (59%), differences in spot boundaries (13%), difference in the peak height (6%), and a combination of these factors (21). This study demonstrates that spot identification and characterization make major contributions to variability seen with 2DE. Methods to highlight why measured protein spot abundance is different could reduce these errors.
heart; proteomics; reproducibility; protein
DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.
We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.
The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served.
The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API.
The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories.
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
Diagnosis of the type of glomerular disease that causes the nephrotic syndrome is necessary for appropriate treatment and typically requires a renal biopsy. The goal of this study was to identify candidate protein biomarkers to diagnose glomerular diseases. Proteomic methods and informatic analysis were used to identify patterns of urine proteins that are characteristic of the diseases. Urine proteins were separated by two-dimensional electrophoresis in 32 patients with FSGS, lupus nephritis, membranous nephropathy, or diabetic nephropathy. Protein abundances from 16 patients were used to train an artificial neural network to create a prediction algorithm. The remaining 16 patients were used as an external validation set to test the accuracy of the prediction algorithm. In the validation set, the model predicted the presence of the diseases with sensitivities between 75 and 86% and specificities from 92 to 67%. The probability of obtaining these results in the novel set by chance is 5 × 10−8. Twenty-one gel spots were most important for the differentiation of the diseases. The spots were cut from the gel, and 20 were identified by mass spectrometry as charge forms of 11 plasma proteins: Orosomucoid, transferrin, α-1 microglobulin, zinc α-2 glycoprotein, α-1 antitrypsin, complement factor B, haptoglobin, transthyretin, plasma retinol binding protein, albumin, and hemopexin. These data show that diseases that cause nephrotic syndrome change glomerular protein permeability in characteristic patterns. The fingerprint of urine protein charge forms identifies the glomerular disease. The identified proteins are candidate biomarkers that can be tested in assays that are more amenable to clinical testing.