Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, ‘Chaos Game Representation’. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
sequence analysis; iterated maps; chaos game; mapreduce; big data; alignment-free
Domoic acid toxicosis (DAT) in California sea lions (Zalophus californianus) is caused by exposure to the marine biotoxin domoic acid and has been linked to massive stranding events and mortality. Diagnosis is based on clinical signs in addition to the presence of domoic acid in body fluids. Chronic DAT further is characterized by reoccurring seizures progressing to status epilepticus. Diagnosis of chronic DAT is often slow and problematic, and minimally invasive tests for DAT have been the focus of numerous recent biomarker studies. The goal of this study was to retrospectively profile plasma proteins in a population of sea lions with chronic DAT and those without DAT using two dimensional gel electrophoresis to discover whether individual, multiple, or combinations of protein and clinical data could be utilized to identify sea lions with DAT. Using a training set of 32 sea lion sera, 20 proteins and their isoforms were identified that were significantly different between the two groups (p<0.05). Interestingly, 11 apolipoprotein E (ApoE) charge forms were decreased in DAT samples, indicating that ApoE charge form distributions may be important in the progression of DAT. In order to develop a classifier of chronic DAT, an independent blinded test set of 20 sea lions, seven with chronic DAT, was used to validate models utilizing ApoE charge forms and eosinophil counts. The resulting support vector machine had high sensitivity (85.7% with 92.3% negative predictive value) and high specificity (92.3% with 85.7% positive predictive value). These results suggest that ApoE and eosinophil counts along with machine learning can perform as a robust and accurate tool to diagnose chronic DAT. Although this analysis is specifically focused on blood biomarkers and routine clinical data, the results demonstrate promise for future studies combining additional variables in multidimensional space to create robust classifiers.
Though treatment of the ventilated premature infant has experienced many advances over the past decades, determining the best time point for extubation of these infants remains challenging and the incidence of extubation failures largely unchanged. The objective was to provide clinicians with a decision-support tool to determine whether to extubate a mechanically ventilated premature infant by using a set of machine learning algorithms on a dataset assembled from 486 premature infants receiving mechanical ventilation.
Algorithms included artificial neural networks (ANN), support vector machine (SVM), naïve Bayesian classifier (NBC), boosted decision trees (BDT), and multivariable logistic regression (MLR). Results for ANN, MLR, and NBC were satisfactory (area under the curve [AUC]: 0.63–0.76); however, SVM and BDT consistently showed poor performance (AUC ~0.5).
Complex medical data such as the data set used for this study require further preprocessing steps before prediction models can be developed that achieve similar or better performance than clinicians.
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to catalogue genetic mutations responsible for cancer using genome analysis techniques. One of the aims of this project is to create a comprehensive and open repository of cancer related molecular analysis, to be exploited by bioinformaticians towards advancing cancer knowledge. However, devising bioinformatics applications to analyse such large dataset is still challenging, as it often requires downloading large archives and parsing the relevant text files. Therefore, it is making it difficult to enable virtual data integration in order to collect the critical co-variates necessary for analysis.
We address these issues by transforming the TCGA data into the Semantic Web standard Resource Description Format (RDF), link it to relevant datasets in the Linked Open Data (LOD) cloud and further propose an efficient data distribution strategy to host the resulting 20.4 billion triples data via several SPARQL endpoints. Having the TCGA data distributed across multiple SPARQL endpoints, we enable biomedical scientists to query and retrieve information from these SPARQL endpoints by proposing a TCGA tailored federated SPARQL query processing engine named TopFed.
We compare TopFed with a well established federation engine FedX in terms of source selection and query execution time by using 10 different federated SPARQL queries with varying requirements. Our evaluation results show that TopFed selects on average less than half of the sources (with 100% recall) with query execution time equal to one third to that of FedX.
With TopFed, we aim to offer biomedical scientists a single-point-of-access through which distributed TCGA data can be accessed in unison. We believe the proposed system can greatly help researchers in the biomedical domain to carry out their research effectively with TCGA as the amount and diversity of data exceeds the ability of local resources to handle its retrieval and parsing.
Federated queries; SPARQL; TCGA; RDF
Though treatment of the prematurely born infant breathing with assistance of a mechanical ventilator has much advanced in the past decades, predicting extubation outcome at a given point in time remains challenging. Numerous studies have been conducted to identify predictors for extubation outcome; however, the rate of infants failing extubation attempts has not declined.
To develop a decision-support tool for the prediction of extubation outcome in premature infants using a set of machine learning algorithms
A dataset assembled from 486 premature infants on mechanical ventilation was used to develop predictive models using machine learning algorithms such as artificial neural networks (ANN), support vector machine (SVM), naïve Bayesian classifier (NBC), boosted decision trees (BDT), and multivariable logistic regression (MLR). Performance of all models was evaluated using area under the curve (AUC).
For some of the models (ANN, MLR and NBC) results were satisfactory (AUC: 0.63–0.76); however, two algorithms (SVM and BDT) showed poor performance with AUCs of ~0.5.
Clinician's predictions still outperform machine learning due to the complexity of the data and contextual information that may not be captured in clinical data used as input for the development of the machine learning algorithms. Inclusion of preprocessing steps in future studies may improve the performance of prediction models.
Premature infant; mechanical ventilation; extubation; prediction; machine learning
The Cancer Genome Atlas (TCGA) project is a large-scale effort with the goal of identifying novel molecular aberrations in glioblastoma (GBM).
Here, we describe an in-depth analysis of gene expression data and copy number aberration (CNA) data to classify GBMs into prognostic groups to determine correlates of subtypes that may be biologically significant.
To identify predictive survival models, we searched TCGA in 173 patients and identified 42 probe sets (P = .0005) that could be used to divide the tumor samples into 3 groups and showed a significantly (P = .0006) improved overall survival. Kaplan-Meier plots showed that the median survival of group 3 was markedly longer (127 weeks) than that of groups 1 and 2 (47 and 52 weeks, respectively). We then validated the 42 probe sets to stratify the patients according to survival in other public GBM gene expression datasets (eg, GSE4290 dataset). An overall analysis of the gene expression and copy number aberration using a multivariate Cox regression model showed that the 42 probe sets had a significant (P < .018) prognostic value independent of other variables.
By integrating multidimensional genomic data from TCGA, we identified a specific survival model in a new prognostic group of GBM and suggest that molecular stratification of patients with GBM into homogeneous subgroups may provide opportunities for the development of new treatment modalities.
comparative genomic hybridization; EMT; gene expression; glioblastoma; prognostic marker; TCGA
Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics’ “Big Data” from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine.
QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running “download and install” software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months.
QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
SRA; TCGA; nsSNV; SNV; SNP; Next-gen; NGS; Phylogenetics; Cancer
Genetics and genomics have radically altered our understanding of breast cancer progression. However, the genomic basis of various histopathologic features of breast cancer is not yet well-defined.
Materials and Methods:
The Cancer Genome Atlas (TCGA) is an international database containing a large collection of human cancer genome sequencing data. cBioPortal is a web tool developed for mining these sequencing data. We performed mining of TCGA sequencing data in an attempt to characterize the genomic features correlated with breast cancer histopathology. We first assessed the quality of the TCGA data using a group of genes with known alterations in various cancers. Both genome-wide gene mutation and copy number changes as well as a group of genes with a high frequency of genetic changes were then correlated with various histopathologic features of invasive breast cancer.
Validation of TCGA data using a group of genes with known alterations in breast cancer suggests that the TCGA has accurately documented the genomic abnormalities of multiple malignancies. Further analysis of TCGA breast cancer sequencing data shows that accumulation of specific genomic defects is associated with higher tumor grade, larger tumor size and receptor negativity. Distinct groups of genomic changes were found to be associated with the different grades of invasive ductal carcinoma. The mutator role of the TP53 gene was validated by genomic sequencing data of invasive breast cancer and TP53 mutation was found to play a critical role in defining high tumor grade.
Data mining of the TCGA genome sequencing data is an innovative and reliable method to help characterize the genomic abnormalities associated with histopathologic features of invasive breast cancer.
Breast cancer; cBioPortal; data mining; histopathology; the cancer genome atlas; tumor grade
Our ultimate goal is to identify and target modifiable risk factors that will reduce major cardiovascular events in African-American lupus patients. As a first step toward achieving this goal, this study was designed to explore risk factor models of preclinical atherosclerosis in a predominantly African-American group of SLE patients using variables historically associated with endothelial function in non-lupus populations.
51 subjects with SLE but without a history of clinical cardiovascular events were enrolled. At entry, a Framingham risk factor history and medication list were recorded. Sera and plasma samples were analyzed for lipids, lupus activity markers, and total 25-hydroxyvitamin D (25(OH)D) levels. Carotid ultrasound measurements were performed to determine total plaque area (TPA) in both carotids. Cases had TPA values above age-matched controls from a vascular prevention clinic population. Logistic regression and machine learning analyses were performed to create predictive models.
25(OH)D levels were significantly lower and SLE disease duration was significantly higher in cases. 25(OH)D levels inversely correlated with age-adjusted TPA. ACE-inhibitor non-use associated with case status. Logistic regression models containing ACE-inhibitor use, 25(OH)D levels, and LDL levels had a diagnostic accuracy of 84% for predicting accelerated atherosclerosis. Similar results were obtained with machine learning models, but hydroxychloroquine use associated with controls in these models.
This is the first study to demonstrate an association between atherosclerotic burden and 25(OH)D insufficiency or ACE-inhibitor non-use in lupus patients. These findings provide strong rationale for the study of ACE-inhibitors and vitamin D replenishment as preventive therapies in this high-risk population.
Systemic lupus erythematosus; Atherosclerosis; Vitamin D deficiency; Angiotensin converting enzyme inhibitors; Hypercholesterolemia
Motivation: Since 2011, The Cancer Genome Atlas’ (TCGA) files have been accessible through HTTP from a public site, creating entirely new possibilities for cancer informatics by enhancing data discovery and retrieval. Significantly, these enhancements enable the reporting of analysis results that can be fully traced to and reproduced using their source data. However, to realize this possibility, a continually updated road map of files in the TCGA is required. Creation of such a road map represents a significant data modeling challenge, due to the size and fluidity of this resource: each of the 33 cancer types is instantiated in only partially overlapping sets of analytical platforms, while the number of data files available doubles approximately every 7 months.
Availability: A prepared dashboard, including links to source code and a SPARQL endpoint, is available at http://bit.ly/TCGARoadmap. A video tutorial is available at http://bit.ly/TCGARoadmapTutorial.
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.
Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
There are currently no reliable markers of acute domoic acid toxicosis (DAT) for California sea lions. We investigated whether patterns of serum peptides could diagnose acute DAT. Serum peptides were analyzed by MALDI-TOF mass spectrometry from 107 sea lions (acute DAT n = 34; non-DAT n = 73). Artificial neural networks (ANN) were trained using MALDI-TOF data. Individual peaks and neural networks were qualified using an independent test set (n = 20).
No single peak was a good classifier of acute DAT, and ANN models were the best predictors of acute DAT. Performance measures for a single median ANN were: sensitivity, 100%; specificity, 60%; positive predictive value, 71%; negative predictive value, 100%. When 101 ANNs were combined and allowed to vote for the outcome, the performance measures were: sensitivity, 30%; specificity, 100%; positive predictive value, 100%; negative predictive value, 59%.
These results suggest that MALDI-TOF peptide profiling and neural networks can perform either as a highly sensitive (100% negative predictive value) or a highly specific (100% positive predictive value) diagnostic tool for acute DAT. This also suggests that machine learning directed by populations of predictive models offer the ability to modulate the predictive effort into a specific type of error.
Serum peptides; Neural network; Zalophus californianus; Neurotoxin
Image bioinformatics infrastructure typically relies on a combination of server-side high-performance computing and client desktop applications tailored for graphic rendering. On the server side, matrix manipulation environments are often used as the back-end where deployment of specialized analytical workflows takes place. However, neither the server-side nor the client-side desktop solution, by themselves or combined, is conducive to the emergence of open, collaborative, computational ecosystems for image analysis that are both self-sustained and user driven.
Materials and Methods:
ImageJS was developed as a browser-based webApp, untethered from a server-side backend, by making use of recent advances in the modern web browser such as a very efficient compiler, high-end graphical rendering capabilities, and I/O tailored for code migration.
Multiple versioned code hosting services were used to develop distinct ImageJS modules to illustrate its amenability to collaborative deployment without compromise of reproducibility or provenance. The illustrative examples include modules for image segmentation, feature extraction, and filtering. The deployment of image analysis by code migration is in sharp contrast with the more conventional, heavier, and less safe reliance on data transfer. Accordingly, code and data are loaded into the browser by exactly the same script tag loading mechanism, which offers a number of interesting applications that would be hard to attain with more conventional platforms, such as NIH's popular ImageJ application.
The modern web browser was found to be advantageous for image bioinformatics in both the research and clinical environments. This conclusion reflects advantages in deployment scalability and analysis reproducibility, as well as the critical ability to deliver advanced computational statistical procedures machines where access to sensitive data is controlled, that is, without local “download and installation”.
Cloud computing; image analysis; webApp
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
data standard; gel electrophoresis; database; ontology
The objective of this investigation was to evaluate the effect of maternal obesity, as measured by prepregnancy body mass index (BMI), on the mode of delivery in women undergoing indicated induction of labor for preeclampsia.
Following IRB approval, patients with preeclampsia who underwent an induction of labor from 1997–2007 were identified from a perinatal information database, which included historical and clinical information. Data analysis included bivariable and multivariable analyses of predictor variables by mode of delivery. An artificial neural network was trained and externally validated to independently examine predictors of mode of delivery among women with preeclampsia.
Six hundred and eight women met eligibility criteria and were included in this investigation. Based on multivariable logistic regression (MLR) modeling, a five unit increase in BMI yields a 16% increase in the odds of cesarean delivery. An artificial neural network trained and externally validated confirmed the importance of obesity in the prediction of mode of delivery among women undergoing labor induction for preeclampsia.
Among patients who are affected by preeclampsia, obesity complicates labor induction. The risk of cesarean delivery is enhanced by obesity, even with small increases in BMI. Prediction of mode of delivery by an artificial neural network performs similar to MLR among patients undergoing labor induction for preeclampsia.
Obesity; severe preeclampsia; cesarean delivery; body mass index
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.
The value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control.
We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data.
Reflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases.
S3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms.
S3DB; Linked Data; KOS; RDF; SPARQL; knowledge organization system, policy
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
TCGA; SPARQL; RDF; Linked Data; Data integration
Acute kidney injury (AKI) is an important cause of death among hospitalized patients. The two most common causes of AKI are acute tubular necrosis (ATN) and prerenal azotemia (PRA). Appropriate diagnosis of the disease is important but often difficult. We analyzed urine proteins by 2-DE from 38 patients with AKI. Patients were randomly assigned to a training set, an internal test set or an external validation set. Spot abundances were analyzed by artificial neural networks (ANN) to identify biomarkers which differentiate between ATN and PRA. When the trained neural network algorithm was tested against the training data it identified the diagnosis for 16/18 patients in the training set and all 10 patients in the internal test set. The accuracy was validated in the novel external set of patients where 9/10 subjects were correctly diagnosed including 5/5 with ATN and 4/5 with PRA. Plasma retinol binding protein (PRBP) was identified in one spot and a fragment of albumin and PRBP in the other. These proteins are candidate markers for diagnostic assays of AKI.
Acute kidney injury; Biomarkers; Diagnosis; Kidney; Urine
AGUIA is a front-end web application originally developed to manage clinical, demographic and biomolecular patient data collected during clinical trials at MD Anderson Cancer Center. The diversity of methods involved in patient screening and sample processing generates a variety of data types that require a resource-oriented architecture to capture the associations between the heterogeneous data elements. AGUIA uses a semantic web formalism, resource description framework (RDF), and a bottom-up design of knowledge bases that employ the S3DB tool as the starting point for the client's interface assembly.
The data web service, S3DB, meets the necessary requirements of generating the RDF and of explicitly distinguishing the description of the domain from its instantiation, while allowing for continuous editing of both. Furthermore, it uses an HTTP-REST protocol, has a SPARQL endpoint, and has open source availability in the public domain, which facilitates the development and dissemination of this application. However, S3DB alone does not address the issue of representing content in a form that makes sense for domain experts.
We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piece-wise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials.
Biomedical research is set to greatly benefit from the use of semantic web technologies in the design of computational infrastructure. However, beyond well defined research initiatives, substantial issues of data heterogeneity, source distribution, and privacy currently stand in the way towards the personalization of Medicine.
A computational framework for bioinformatic infrastructure was designed to deal with the heterogeneous data sources and the sensitive mixture of public and private data that characterizes the biomedical domain. This framework consists of a logical model build with semantic web tools, coupled with a Markov process that propagates user operator states. An accompanying open source prototype was developed to meet a series of applications that range from collaborative multi-institution data acquisition efforts to data analysis applications that need to quickly traverse complex data structures. This report describes the two abstractions underlying the S3DB-based infrastructure, logical and numerical, and discusses its generality beyond the immediate confines of existing implementations.
The emergence of the "web as a computer" requires a formal model for the different functionalities involved in reading and writing to it. The S3DB core model proposed was found to address the design criteria of biomedical computational infrastructure, such as those supporting large scale multi-investigator research, clinical trials, and molecular epidemiology.
Heavy metals, such as copper, zinc and cadmium, represent some of the most common and serious pollutants in coastal estuaries. In the present study, we used a combination of linear and artificial neural network (ANN) modelling to detect and explore interactions among low-dose mixtures of these heavy metals and their impacts on fundamental physiological processes in tissues of the Eastern oyster, Crassostrea virginica. Animals were exposed to Cd (0.001–0.400 µM), Zn (0.001–3.059 µM) or Cu (0.002–0.787 µM), either alone or in combination for 1 to 27 days. We measured indicators of acid–base balance (hemolymph pH and total CO2), gas exchange (Po2), immunocompetence (total hemocyte counts, numbers of invasive bacteria), antioxidant status (glutathione, GSH), oxidative damage (lipid peroxidation; LPx), and metal accumulation in the gill and the hepatopancreas. Linear analysis showed that oxidative membrane damage from tissue accumulation of environmental metals was correlated with impaired acid–base balance in oysters. ANN analysis revealed interactions of metals with hemolymph acid–base chemistry in predicting oxidative damage that were not evident from linear analyses. These results highlight the usefulness of machine learning approaches, such as ANNs, for improving our ability to recognize and understand the effects of subacute exposure to contaminant mixtures.
Heavy metals; Artificial neural networks; Crassostrea virginica; Lipid peroxidation; Glutathione; Acid–base balance; Hemolymph PO2