PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (56)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
Document Types
1.  Identification of prognostic gene signatures of glioblastoma: a study based on TCGA data analysis 
Neuro-Oncology  2013;15(7):829-839.
Background
The Cancer Genome Atlas (TCGA) project is a large-scale effort with the goal of identifying novel molecular aberrations in glioblastoma (GBM).
Methods
Here, we describe an in-depth analysis of gene expression data and copy number aberration (CNA) data to classify GBMs into prognostic groups to determine correlates of subtypes that may be biologically significant.
Results
To identify predictive survival models, we searched TCGA in 173 patients and identified 42 probe sets (P = .0005) that could be used to divide the tumor samples into 3 groups and showed a significantly (P = .0006) improved overall survival. Kaplan-Meier plots showed that the median survival of group 3 was markedly longer (127 weeks) than that of groups 1 and 2 (47 and 52 weeks, respectively). We then validated the 42 probe sets to stratify the patients according to survival in other public GBM gene expression datasets (eg, GSE4290 dataset). An overall analysis of the gene expression and copy number aberration using a multivariate Cox regression model showed that the 42 probe sets had a significant (P < .018) prognostic value independent of other variables.
Conclusions
By integrating multidimensional genomic data from TCGA, we identified a specific survival model in a new prognostic group of GBM and suggest that molecular stratification of patients with GBM into homogeneous subgroups may provide opportunities for the development of new treatment modalities.
doi:10.1093/neuonc/not024
PMCID: PMC3688008  PMID: 23502430
comparative genomic hybridization; EMT; gene expression; glioblastoma; prognostic marker; TCGA
2.  QMachine: commodity supercomputing in web browsers 
BMC Bioinformatics  2014;15:176.
Background
Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics’ “Big Data” from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine.
Results
QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running “download and install” software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months.
Conclusions
QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
doi:10.1186/1471-2105-15-176
PMCID: PMC4063228  PMID: 24913605
Cloud computing; Crowdsourcing; Distributed computing; JavaScript; MapReduce; PaaS; Sequence analysis; Web service
3.  Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data 
BMC Bioinformatics  2014;15:28.
Background
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
Results
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Conclusions
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
doi:10.1186/1471-2105-15-28
PMCID: PMC3916084  PMID: 24467687
SRA; TCGA; nsSNV; SNV; SNP; Next-gen; NGS; Phylogenetics; Cancer
4.  Mining genome sequencing data to identify the genomic features linked to breast cancer histopathology 
Background:
Genetics and genomics have radically altered our understanding of breast cancer progression. However, the genomic basis of various histopathologic features of breast cancer is not yet well-defined.
Materials and Methods:
The Cancer Genome Atlas (TCGA) is an international database containing a large collection of human cancer genome sequencing data. cBioPortal is a web tool developed for mining these sequencing data. We performed mining of TCGA sequencing data in an attempt to characterize the genomic features correlated with breast cancer histopathology. We first assessed the quality of the TCGA data using a group of genes with known alterations in various cancers. Both genome-wide gene mutation and copy number changes as well as a group of genes with a high frequency of genetic changes were then correlated with various histopathologic features of invasive breast cancer.
Results:
Validation of TCGA data using a group of genes with known alterations in breast cancer suggests that the TCGA has accurately documented the genomic abnormalities of multiple malignancies. Further analysis of TCGA breast cancer sequencing data shows that accumulation of specific genomic defects is associated with higher tumor grade, larger tumor size and receptor negativity. Distinct groups of genomic changes were found to be associated with the different grades of invasive ductal carcinoma. The mutator role of the TP53 gene was validated by genomic sequencing data of invasive breast cancer and TP53 mutation was found to play a critical role in defining high tumor grade.
Conclusions:
Data mining of the TCGA genome sequencing data is an innovative and reliable method to help characterize the genomic abnormalities associated with histopathologic features of invasive breast cancer.
doi:10.4103/2153-3539.126147
PMCID: PMC3952399  PMID: 24672738
Breast cancer; cBioPortal; data mining; histopathology; the cancer genome atlas; tumor grade
5.  Premature atherosclerosis is associated with hypovitaminosis D and angiotensin converting enzyme inhibitor non-use in lupus patients 
Our ultimate goal is to identify and target modifiable risk factors that will reduce major cardiovascular events in African-American lupus patients. As a first step toward achieving this goal, this study was designed to explore risk factor models of preclinical atherosclerosis in a predominantly African-American group of SLE patients using variables historically associated with endothelial function in non-lupus populations.
51 subjects with SLE but without a history of clinical cardiovascular events were enrolled. At entry, a Framingham risk factor history and medication list were recorded. Sera and plasma samples were analyzed for lipids, lupus activity markers, and total 25-hydroxyvitamin D (25(OH)D) levels. Carotid ultrasound measurements were performed to determine total plaque area (TPA) in both carotids. Cases had TPA values above age-matched controls from a vascular prevention clinic population. Logistic regression and machine learning analyses were performed to create predictive models.
25(OH)D levels were significantly lower and SLE disease duration was significantly higher in cases. 25(OH)D levels inversely correlated with age-adjusted TPA. ACE-inhibitor non-use associated with case status. Logistic regression models containing ACE-inhibitor use, 25(OH)D levels, and LDL levels had a diagnostic accuracy of 84% for predicting accelerated atherosclerosis. Similar results were obtained with machine learning models, but hydroxychloroquine use associated with controls in these models.
This is the first study to demonstrate an association between atherosclerotic burden and 25(OH)D insufficiency or ACE-inhibitor non-use in lupus patients. These findings provide strong rationale for the study of ACE-inhibitors and vitamin D replenishment as preventive therapies in this high-risk population.
doi:10.1097/MAJ.0b013e31823fa7d9
PMCID: PMC3323721  PMID: 22222338
Systemic lupus erythematosus; Atherosclerosis; Vitamin D deficiency; Angiotensin converting enzyme inhibitors; Hypercholesterolemia
6.  A self-updating road map of The Cancer Genome Atlas 
Bioinformatics  2013;29(10):1333-1340.
Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.
Results: We developed an engine to index and annotate the TCGA files, relying exclusively on third-generation web technologies (Web 3.0). Specifically, this engine uses JavaScript in conjunction with the World Wide Web Consortium’s (W3C) Resource Description Framework (RDF), and SPARQL, the query language for RDF, to capture metadata of files in the TCGA open-access HTTP directory. The resulting index may be queried using SPARQL, and enables file-level provenance annotations as well as discovery of arbitrary subsets of files, based on their metadata, using web standard languages. In turn, these abilities enhance the reproducibility and distribution of novel results delivered as elements of a web-based computational ecosystem. The development of the TCGA Roadmap engine was found to provide specific clues about how biomedical big data initiatives should be exposed as public resources for exploratory analysis, data mining and reproducible research. These specific design elements align with the concept of knowledge reengineering and represent a sharp departure from top-down approaches in grid initiatives such as CaBIG. They also present a much more interoperable and reproducible alternative to the still pervasive use of data portals.
Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.
Contact: robbinsd@uab.edu
doi:10.1093/bioinformatics/btt141
PMCID: PMC3654710  PMID: 23595662
7.  Fractal MapReduce decomposition of sequence alignment 
Background
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
Results
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
Conclusions
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.
Availability
Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
doi:10.1186/1748-7188-7-12
PMCID: PMC3394223  PMID: 22551205
8.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis 
Background
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
Results
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
Conclusions
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
doi:10.1186/1748-7188-7-10
PMCID: PMC3402988  PMID: 22551152
9.  Serum profiling by MALDI-TOF mass spectrometry as a diagnostic tool for domoic acid toxicosis in California sea lions 
Proteome Science  2012;10:18.
Background
There are currently no reliable markers of acute domoic acid toxicosis (DAT) for California sea lions. We investigated whether patterns of serum peptides could diagnose acute DAT. Serum peptides were analyzed by MALDI-TOF mass spectrometry from 107 sea lions (acute DAT n = 34; non-DAT n = 73). Artificial neural networks (ANN) were trained using MALDI-TOF data. Individual peaks and neural networks were qualified using an independent test set (n = 20).
Results
No single peak was a good classifier of acute DAT, and ANN models were the best predictors of acute DAT. Performance measures for a single median ANN were: sensitivity, 100%; specificity, 60%; positive predictive value, 71%; negative predictive value, 100%. When 101 ANNs were combined and allowed to vote for the outcome, the performance measures were: sensitivity, 30%; specificity, 100%; positive predictive value, 100%; negative predictive value, 59%.
Conclusions
These results suggest that MALDI-TOF peptide profiling and neural networks can perform either as a highly sensitive (100% negative predictive value) or a highly specific (100% positive predictive value) diagnostic tool for acute DAT. This also suggests that machine learning directed by populations of predictive models offer the ability to modulate the predictive effort into a specific type of error.
doi:10.1186/1477-5956-10-18
PMCID: PMC3338078  PMID: 22429742
Serum peptides; Neural network; Zalophus californianus; Neurotoxin
10.  ImageJS: Personalized, participated, pervasive, and reproducible image bioinformatics in the web browser  
Background:
Image bioinformatics infrastructure typically relies on a combination of server-side high-performance computing and client desktop applications tailored for graphic rendering. On the server side, matrix manipulation environments are often used as the back-end where deployment of specialized analytical workflows takes place. However, neither the server-side nor the client-side desktop solution, by themselves or combined, is conducive to the emergence of open, collaborative, computational ecosystems for image analysis that are both self-sustained and user driven.
Materials and Methods:
ImageJS was developed as a browser-based webApp, untethered from a server-side backend, by making use of recent advances in the modern web browser such as a very efficient compiler, high-end graphical rendering capabilities, and I/O tailored for code migration.
Results:
Multiple versioned code hosting services were used to develop distinct ImageJS modules to illustrate its amenability to collaborative deployment without compromise of reproducibility or provenance. The illustrative examples include modules for image segmentation, feature extraction, and filtering. The deployment of image analysis by code migration is in sharp contrast with the more conventional, heavier, and less safe reliance on data transfer. Accordingly, code and data are loaded into the browser by exactly the same script tag loading mechanism, which offers a number of interesting applications that would be hard to attain with more conventional platforms, such as NIH's popular ImageJ application.
Conclusions:
The modern web browser was found to be advantageous for image bioinformatics in both the research and clinical environments. This conclusion reflects advantages in deployment scalability and analysis reproducibility, as well as the critical ability to deliver advanced computational statistical procedures machines where access to sensitive data is controlled, that is, without local “download and installation”.
doi:10.4103/2153-3539.98813
PMCID: PMC3424663  PMID: 22934238
Cloud computing; image analysis; webApp
11.  The Gel Electrophoresis Markup Language (GelML) from the Proteomics Standards Initiative 
Proteomics  2010;10(17):3073-3081.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
doi:10.1002/pmic.201000120
PMCID: PMC3193076  PMID: 20677327
data standard; gel electrophoresis; database; ontology
12.  Examining the effect of maternal obesity on outcome of labor induction in patients with preeclampsia 
OBJECTIVE
The objective of this investigation was to evaluate the effect of maternal obesity, as measured by prepregnancy body mass index (BMI), on the mode of delivery in women undergoing indicated induction of labor for preeclampsia.
STUDY DESIGN
Following IRB approval, patients with preeclampsia who underwent an induction of labor from 1997–2007 were identified from a perinatal information database, which included historical and clinical information. Data analysis included bivariable and multivariable analyses of predictor variables by mode of delivery. An artificial neural network was trained and externally validated to independently examine predictors of mode of delivery among women with preeclampsia.
RESULTS
Six hundred and eight women met eligibility criteria and were included in this investigation. Based on multivariable logistic regression (MLR) modeling, a five unit increase in BMI yields a 16% increase in the odds of cesarean delivery. An artificial neural network trained and externally validated confirmed the importance of obesity in the prediction of mode of delivery among women undergoing labor induction for preeclampsia.
CONCLUSION
Among patients who are affected by preeclampsia, obesity complicates labor induction. The risk of cesarean delivery is enhanced by obesity, even with small increases in BMI. Prediction of mode of delivery by an artificial neural network performs similar to MLR among patients undergoing labor induction for preeclampsia.
doi:10.3109/10641950903452386
PMCID: PMC3192401  PMID: 20818957
Obesity; severe preeclampsia; cesarean delivery; body mass index
13.  Computational ecosystems for data-driven medical genomics 
Genome Medicine  2010;2(9):67.
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.
doi:10.1186/gm188
PMCID: PMC3092118  PMID: 20854645
14.  S3QL: A distributed domain specific language for controlled semantic integration of life sciences data 
BMC Bioinformatics  2011;12:285.
Background
The value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control.
We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data.
Results
Reflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases.
Conclusions
S3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms.
doi:10.1186/1471-2105-12-285
PMCID: PMC3155508  PMID: 21756325
S3DB; Linked Data; KOS; RDF; SPARQL; knowledge organization system, policy
15.  Exposing the cancer genome atlas as a SPARQL endpoint 
Journal of biomedical informatics  2010;43(6):998-1008.
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
doi:10.1016/j.jbi.2010.09.004
PMCID: PMC3071752  PMID: 20851208
TCGA; SPARQL; RDF; Linked Data; Data integration
16.  Identification of Diagnostic Urinary Biomarkers for Acute Kidney Injury 
Acute kidney injury (AKI) is an important cause of death among hospitalized patients. The two most common causes of AKI are acute tubular necrosis (ATN) and prerenal azotemia (PRA). Appropriate diagnosis of the disease is important but often difficult. We analyzed urine proteins by 2-DE from 38 patients with AKI. Patients were randomly assigned to a training set, an internal test set or an external validation set. Spot abundances were analyzed by artificial neural networks (ANN) to identify biomarkers which differentiate between ATN and PRA. When the trained neural network algorithm was tested against the training data it identified the diagnosis for 16/18 patients in the training set and all 10 patients in the internal test set. The accuracy was validated in the novel external set of patients where 9/10 subjects were correctly diagnosed including 5/5 with ATN and 4/5 with PRA. Plasma retinol binding protein (PRBP) was identified in one spot and a fragment of albumin and PRBP in the other. These proteins are candidate markers for diagnostic assays of AKI.
doi:10.231/JIM.0b013e3181d473e7
PMCID: PMC2864920  PMID: 20224435
Acute kidney injury; Biomarkers; Diagnosis; Kidney; Urine
17.  AGUIA: autonomous graphical user interface assembly for clinical trials semantic data services 
Background
AGUIA is a front-end web application originally developed to manage clinical, demographic and biomolecular patient data collected during clinical trials at MD Anderson Cancer Center. The diversity of methods involved in patient screening and sample processing generates a variety of data types that require a resource-oriented architecture to capture the associations between the heterogeneous data elements. AGUIA uses a semantic web formalism, resource description framework (RDF), and a bottom-up design of knowledge bases that employ the S3DB tool as the starting point for the client's interface assembly.
Methods
The data web service, S3DB, meets the necessary requirements of generating the RDF and of explicitly distinguishing the description of the domain from its instantiation, while allowing for continuous editing of both. Furthermore, it uses an HTTP-REST protocol, has a SPARQL endpoint, and has open source availability in the public domain, which facilitates the development and dissemination of this application. However, S3DB alone does not address the issue of representing content in a form that makes sense for domain experts.
Results
We identified an autonomous set of descriptors, the GBox, that provides user and domain specifications for the graphical user interface. This was achieved by identifying a formalism that makes use of an RDF schema to enable the automatic assembly of graphical user interfaces in a meaningful manner while using only resources native to the client web browser (JavaScript interpreter, document object model). We defined a generalized RDF model such that changes in the graphic descriptors are automatically and immediately (locally) reflected into the configuration of the client's interface application.
Conclusions
The design patterns identified for the GBox benefit from and reflect the specific requirements of interacting with data generated by clinical trials, and they contain clues for a general purpose solution to the challenge of having interfaces automatically assembled for multiple and volatile views of a domain. By coding AGUIA in JavaScript, for which all browsers include a native interpreter, a solution was found that assembles interfaces that are meaningful to the particular user, and which are also ubiquitous and lightweight, allowing the computational load to be carried by the client's machine.
doi:10.1186/1472-6947-10-65
PMCID: PMC2987967  PMID: 20977768
18.  A nonparametric approach to detect nonlinear correlation in gene expression 
We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piece-wise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials.
doi:10.1198/jcgs.2010.08160
PMCID: PMC2945392  PMID: 20877445
19.  S3DB core: a framework for RDF generation and management in bioinformatics infrastructures 
BMC Bioinformatics  2010;11:387.
Background
Biomedical research is set to greatly benefit from the use of semantic web technologies in the design of computational infrastructure. However, beyond well defined research initiatives, substantial issues of data heterogeneity, source distribution, and privacy currently stand in the way towards the personalization of Medicine.
Results
A computational framework for bioinformatic infrastructure was designed to deal with the heterogeneous data sources and the sensitive mixture of public and private data that characterizes the biomedical domain. This framework consists of a logical model build with semantic web tools, coupled with a Markov process that propagates user operator states. An accompanying open source prototype was developed to meet a series of applications that range from collaborative multi-institution data acquisition efforts to data analysis applications that need to quickly traverse complex data structures. This report describes the two abstractions underlying the S3DB-based infrastructure, logical and numerical, and discusses its generality beyond the immediate confines of existing implementations.
Conclusions
The emergence of the "web as a computer" requires a formal model for the different functionalities involved in reading and writing to it. The S3DB core model proposed was found to address the design criteria of biomedical computational infrastructure, such as those supporting large scale multi-investigator research, clinical trials, and molecular epidemiology.
doi:10.1186/1471-2105-11-387
PMCID: PMC2918582  PMID: 20646315
20.  Modelling interactions of acid–base balance and respiratory status in the toxicity of metal mixtures in the American oyster Crassostrea virginica 
Heavy metals, such as copper, zinc and cadmium, represent some of the most common and serious pollutants in coastal estuaries. In the present study, we used a combination of linear and artificial neural network (ANN) modelling to detect and explore interactions among low-dose mixtures of these heavy metals and their impacts on fundamental physiological processes in tissues of the Eastern oyster, Crassostrea virginica. Animals were exposed to Cd (0.001–0.400 µM), Zn (0.001–3.059 µM) or Cu (0.002–0.787 µM), either alone or in combination for 1 to 27 days. We measured indicators of acid–base balance (hemolymph pH and total CO2), gas exchange (Po2), immunocompetence (total hemocyte counts, numbers of invasive bacteria), antioxidant status (glutathione, GSH), oxidative damage (lipid peroxidation; LPx), and metal accumulation in the gill and the hepatopancreas. Linear analysis showed that oxidative membrane damage from tissue accumulation of environmental metals was correlated with impaired acid–base balance in oysters. ANN analysis revealed interactions of metals with hemolymph acid–base chemistry in predicting oxidative damage that were not evident from linear analyses. These results highlight the usefulness of machine learning approaches, such as ANNs, for improving our ability to recognize and understand the effects of subacute exposure to contaminant mixtures.
doi:10.1016/j.cbpa.2009.11.019
PMCID: PMC2906223  PMID: 19958840
Heavy metals; Artificial neural networks; Crassostrea virginica; Lipid peroxidation; Glutathione; Acid–base balance; Hemolymph PO2
21.  Sources of Variability among Replicate Samples Separated by Two-Dimensional Gel Electrophoresis 
Two-dimensional gel electrophoresis (2DE) offers high-resolution separation for intact proteins. However, variability in the appearance of spots can limit the ability to identify true differences between conditions. Variability can occur at a number of levels. Individual samples can differ because of biological variability. Technical variability can occur during protein extraction, processing, or storage. Another potential source of variability occurs during analysis of the gels and is not a result of any of the causes of variability named above. We performed a study designed to focus only on the variability caused by analysis. We separated three aliquots of rat left ventricle and analyzed differences in protein abundance on the replicate 2D gels. As the samples loaded on each gel were identical, differences in protein abundance are caused by variability in separation or interpretation of the gels. Protein spots were compared across gels by quantile values to determine differences. Fourteen percent of spots had a maximum difference in intensity of 0.4 quantile values or more between replicates. We then looked individually at the spots to determine the cause of differences between the measured intensities. Reasons for differences were: failure to identify a spot (59%), differences in spot boundaries (13%), difference in the peak height (6%), and a combination of these factors (21). This study demonstrates that spot identification and characterization make major contributions to variability seen with 2DE. Methods to highlight why measured protein spot abundance is different could reduce these errors.
PMCID: PMC2841997  PMID: 20357976
heart; proteomics; reproducibility; protein
23.  DASMiner: discovering and integrating data from DAS sources 
BMC Systems Biology  2009;3:109.
Background
DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.
Results
We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.
The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served.
The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API.
Conclusion
The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories.
doi:10.1186/1752-0509-3-109
PMCID: PMC2789070  PMID: 19919683
24.  An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++ 
PLoS ONE  2009;4(9):e7087.
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
doi:10.1371/journal.pone.0007087
PMCID: PMC2739274  PMID: 19763254
25.  Urine Biomarkers Predict the Cause of Glomerular Disease 
Diagnosis of the type of glomerular disease that causes the nephrotic syndrome is necessary for appropriate treatment and typically requires a renal biopsy. The goal of this study was to identify candidate protein biomarkers to diagnose glomerular diseases. Proteomic methods and informatic analysis were used to identify patterns of urine proteins that are characteristic of the diseases. Urine proteins were separated by two-dimensional electrophoresis in 32 patients with FSGS, lupus nephritis, membranous nephropathy, or diabetic nephropathy. Protein abundances from 16 patients were used to train an artificial neural network to create a prediction algorithm. The remaining 16 patients were used as an external validation set to test the accuracy of the prediction algorithm. In the validation set, the model predicted the presence of the diseases with sensitivities between 75 and 86% and specificities from 92 to 67%. The probability of obtaining these results in the novel set by chance is 5 × 10−8. Twenty-one gel spots were most important for the differentiation of the diseases. The spots were cut from the gel, and 20 were identified by mass spectrometry as charge forms of 11 plasma proteins: Orosomucoid, transferrin, α-1 microglobulin, zinc α-2 glycoprotein, α-1 antitrypsin, complement factor B, haptoglobin, transthyretin, plasma retinol binding protein, albumin, and hemopexin. These data show that diseases that cause nephrotic syndrome change glomerular protein permeability in characteristic patterns. The fingerprint of urine protein charge forms identifies the glomerular disease. The identified proteins are candidate biomarkers that can be tested in assays that are more amenable to clinical testing.
doi:10.1681/ASN.2006070767
PMCID: PMC2733832  PMID: 17301191

Results 1-25 (56)