PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (50)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
more »
Document Types
1.  Fractal MapReduce decomposition of sequence alignment 
Background
The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required.
Results
In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming.
Conclusions
The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing.
Availability
Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".
doi:10.1186/1748-7188-7-12
PMCID: PMC3394223  PMID: 22551205
2.  Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis 
Background
Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2-L distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations.
Results
The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm.
Conclusions
The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.
doi:10.1186/1748-7188-7-10
PMCID: PMC3402988  PMID: 22551152
3.  Serum profiling by MALDI-TOF mass spectrometry as a diagnostic tool for domoic acid toxicosis in California sea lions 
Proteome Science  2012;10:18.
Background
There are currently no reliable markers of acute domoic acid toxicosis (DAT) for California sea lions. We investigated whether patterns of serum peptides could diagnose acute DAT. Serum peptides were analyzed by MALDI-TOF mass spectrometry from 107 sea lions (acute DAT n = 34; non-DAT n = 73). Artificial neural networks (ANN) were trained using MALDI-TOF data. Individual peaks and neural networks were qualified using an independent test set (n = 20).
Results
No single peak was a good classifier of acute DAT, and ANN models were the best predictors of acute DAT. Performance measures for a single median ANN were: sensitivity, 100%; specificity, 60%; positive predictive value, 71%; negative predictive value, 100%. When 101 ANNs were combined and allowed to vote for the outcome, the performance measures were: sensitivity, 30%; specificity, 100%; positive predictive value, 100%; negative predictive value, 59%.
Conclusions
These results suggest that MALDI-TOF peptide profiling and neural networks can perform either as a highly sensitive (100% negative predictive value) or a highly specific (100% positive predictive value) diagnostic tool for acute DAT. This also suggests that machine learning directed by populations of predictive models offer the ability to modulate the predictive effort into a specific type of error.
doi:10.1186/1477-5956-10-18
PMCID: PMC3338078  PMID: 22429742
Serum peptides; Neural network; Zalophus californianus; Neurotoxin
4.  ImageJS: Personalized, participated, pervasive, and reproducible image bioinformatics in the web browser  
Background:
Image bioinformatics infrastructure typically relies on a combination of server-side high-performance computing and client desktop applications tailored for graphic rendering. On the server side, matrix manipulation environments are often used as the back-end where deployment of specialized analytical workflows takes place. However, neither the server-side nor the client-side desktop solution, by themselves or combined, is conducive to the emergence of open, collaborative, computational ecosystems for image analysis that are both self-sustained and user driven.
Materials and Methods:
ImageJS was developed as a browser-based webApp, untethered from a server-side backend, by making use of recent advances in the modern web browser such as a very efficient compiler, high-end graphical rendering capabilities, and I/O tailored for code migration.
Results:
Multiple versioned code hosting services were used to develop distinct ImageJS modules to illustrate its amenability to collaborative deployment without compromise of reproducibility or provenance. The illustrative examples include modules for image segmentation, feature extraction, and filtering. The deployment of image analysis by code migration is in sharp contrast with the more conventional, heavier, and less safe reliance on data transfer. Accordingly, code and data are loaded into the browser by exactly the same script tag loading mechanism, which offers a number of interesting applications that would be hard to attain with more conventional platforms, such as NIH's popular ImageJ application.
Conclusions:
The modern web browser was found to be advantageous for image bioinformatics in both the research and clinical environments. This conclusion reflects advantages in deployment scalability and analysis reproducibility, as well as the critical ability to deliver advanced computational statistical procedures machines where access to sensitive data is controlled, that is, without local “download and installation”.
doi:10.4103/2153-3539.98813
PMCID: PMC3424663  PMID: 22934238
Cloud computing; image analysis; webApp
5.  The Gel Electrophoresis Markup Language (GelML) from the Proteomics Standards Initiative 
Proteomics  2010;10(17):3073-3081.
The Human Proteome Organisation’s Proteomics Standards Initiative (HUPO-PSI) has developed the GelML data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for mass spectrometry data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
doi:10.1002/pmic.201000120
PMCID: PMC3193076  PMID: 20677327
data standard; gel electrophoresis; database; ontology
6.  Examining the effect of maternal obesity on outcome of labor induction in patients with preeclampsia 
OBJECTIVE
The objective of this investigation was to evaluate the effect of maternal obesity, as measured by prepregnancy body mass index (BMI), on the mode of delivery in women undergoing indicated induction of labor for preeclampsia.
STUDY DESIGN
Following IRB approval, patients with preeclampsia who underwent an induction of labor from 1997–2007 were identified from a perinatal information database, which included historical and clinical information. Data analysis included bivariable and multivariable analyses of predictor variables by mode of delivery. An artificial neural network was trained and externally validated to independently examine predictors of mode of delivery among women with preeclampsia.
RESULTS
Six hundred and eight women met eligibility criteria and were included in this investigation. Based on multivariable logistic regression (MLR) modeling, a five unit increase in BMI yields a 16% increase in the odds of cesarean delivery. An artificial neural network trained and externally validated confirmed the importance of obesity in the prediction of mode of delivery among women undergoing labor induction for preeclampsia.
CONCLUSION
Among patients who are affected by preeclampsia, obesity complicates labor induction. The risk of cesarean delivery is enhanced by obesity, even with small increases in BMI. Prediction of mode of delivery by an artificial neural network performs similar to MLR among patients undergoing labor induction for preeclampsia.
doi:10.3109/10641950903452386
PMCID: PMC3192401  PMID: 20818957
Obesity; severe preeclampsia; cesarean delivery; body mass index
7.  Computational ecosystems for data-driven medical genomics 
Genome Medicine  2010;2(9):67.
In the path towards personalized medicine, the integrative bioinformatics infrastructure is a critical enabling resource. Until large-scale reference data became available, the attributes of the computational infrastructure were postulated by many, but have mostly remained unverified. Now that large-scale initiatives such as The Cancer Genome Atlas (TCGA) are in full swing, the opportunity is at hand to find out what analytical approaches and computational architectures are really effective. A recent report did just that: first a software development environment was assembled as part of an informatics research program, and only then was the analysis of TCGA's glioblastoma multiforme multi-omic data pursued at the multi-omic scale. The results of this complex analysis are the focus of the report highlighted here. However, what is reported in the analysis is also the validating corollary for an infrastructure development effort guided by the iterative identification of sound design criteria for the architecture of the integrative computational infrastructure. The work is at least as valuable as the data analysis results themselves: computational ecosystems with their own high-level abstractions rather than rigid pipelines with prescriptive recipes appear to be the critical feature of an effective infrastructure. Only then can analytical workflows benefit from experimentation just like any other component of the biomedical research program.
doi:10.1186/gm188
PMCID: PMC3092118  PMID: 20854645
8.  S3QL: A distributed domain specific language for controlled semantic integration of life sciences data 
BMC Bioinformatics  2011;12:285.
Background
The value and usefulness of data increases when it is explicitly interlinked with related data. This is the core principle of Linked Data. For life sciences researchers, harnessing the power of Linked Data to improve biological discovery is still challenged by a need to keep pace with rapidly evolving domains and requirements for collaboration and control as well as with the reference semantic web ontologies and standards. Knowledge organization systems (KOSs) can provide an abstraction for publishing biological discoveries as Linked Data without complicating transactions with contextual minutia such as provenance and access control.
We have previously described the Simple Sloppy Semantic Database (S3DB) as an efficient model for creating knowledge organization systems using Linked Data best practices with explicit distinction between domain and instantiation and support for a permission control mechanism that automatically migrates between the two. In this report we present a domain specific language, the S3DB query language (S3QL), to operate on its underlying core model and facilitate management of Linked Data.
Results
Reflecting the data driven nature of our approach, S3QL has been implemented as an application programming interface for S3DB systems hosting biomedical data, and its syntax was subsequently generalized beyond the S3DB core model. This achievement is illustrated with the assembly of an S3QL query to manage entities from the Simple Knowledge Organization System. The illustrative use cases include gastrointestinal clinical trials, genomic characterization of cancer by The Cancer Genome Atlas (TCGA) and molecular epidemiology of infectious diseases.
Conclusions
S3QL was found to provide a convenient mechanism to represent context for interoperation between public and private datasets hosted at biomedical research institutions and linked data formalisms.
doi:10.1186/1471-2105-12-285
PMCID: PMC3155508  PMID: 21756325
S3DB; Linked Data; KOS; RDF; SPARQL; knowledge organization system, policy
9.  Exposing the cancer genome atlas as a SPARQL endpoint 
Journal of biomedical informatics  2010;43(6):998-1008.
The Cancer Genome Atlas (TCGA) is a multidisciplinary, multi-institutional effort to characterize several types of cancer. Datasets from biomedical domains such as TCGA present a particularly challenging task for those interested in dynamically aggregating its results because the data sources are typically both heterogeneous and distributed. The Linked Data best practices offer a solution to integrate and discover data with those characteristics, namely through exposure of data as Web services supporting SPARQL, the Resource Description Framework query language. Most SPARQL endpoints, however, cannot easily be queried by data experts. Furthermore, exposing experimental data as SPARQL endpoints remains a challenging task because, in most cases, data must first be converted to Resource Description Framework triples. In line with those requirements, we have developed an infrastructure to expose clinical, demographic and molecular data elements generated by TCGA as a SPARQL endpoint by assigning elements to entities of the Simple Sloppy Semantic Database (S3DB) management model. All components of the infrastructure are available as independent Representational State Transfer (REST) Web services to encourage reusability, and a simple interface was developed to automatically assemble SPARQL queries by navigating a representation of the TCGA domain. A key feature of the proposed solution that greatly facilitates assembly of SPARQL queries is the distinction between the TCGA domain descriptors and data elements. Furthermore, the use of the S3DB management model as a mediator enables queries to both public and protected data without the need for prior submission to a single data source.
doi:10.1016/j.jbi.2010.09.004
PMCID: PMC3071752  PMID: 20851208
TCGA; SPARQL; RDF; Linked Data; Data integration
10.  Identification of Diagnostic Urinary Biomarkers for Acute Kidney Injury 
Acute kidney injury (AKI) is an important cause of death among hospitalized patients. The two most common causes of AKI are acute tubular necrosis (ATN) and prerenal azotemia (PRA). Appropriate diagnosis of the disease is important but often difficult. We analyzed urine proteins by 2-DE from 38 patients with AKI. Patients were randomly assigned to a training set, an internal test set or an external validation set. Spot abundances were analyzed by artificial neural networks (ANN) to identify biomarkers which differentiate between ATN and PRA. When the trained neural network algorithm was tested against the training data it identified the diagnosis for 16/18 patients in the training set and all 10 patients in the internal test set. The accuracy was validated in the novel external set of patients where 9/10 subjects were correctly diagnosed including 5/5 with ATN and 4/5 with PRA. Plasma retinol binding protein (PRBP) was identified in one spot and a fragment of albumin and PRBP in the other. These proteins are candidate markers for diagnostic assays of AKI.
doi:10.231/JIM.0b013e3181d473e7
PMCID: PMC2864920  PMID: 20224435
Acute kidney injury; Biomarkers; Diagnosis; Kidney; Urine
11.  AGUIA: autonomous graphical user interface assembly for clinical trials semantic data services 
Background
AGUIA is a front-end web application originally developed to manage clinical, demographic and biomolecular patient data collected during clinical trials at MD Anderson Cancer Center. The diversity of methods involved in patient screening and sample processing generates a variety of data types that require a resource-oriented architecture to capture the associations between the heterogeneous data elements. AGUIA uses a semantic web formalism, resource description framework (RDF), and a bottom-up design of knowledge bases that employ the S3DB tool as the starting point for the client's interface assembly.
Methods
The data web service, S3DB, meets the necessary requirements of generating the RDF and of explicitly distinguishing the description of the domain from its instantiation, while allowing for continuous editing of both. Furthermore, it uses an HTTP-REST protocol, has a SPARQL endpoint, and has open source availability in the public domain, which facilitates the development and dissemination of this application. However, S3DB alone does not address the issue of representing content in a form that makes sense for domain experts.
Results
We identified an autonomous set of descriptors, the GBox, that provides user and domain specifications for the graphical user interface. This was achieved by identifying a formalism that makes use of an RDF schema to enable the automatic assembly of graphical user interfaces in a meaningful manner while using only resources native to the client web browser (JavaScript interpreter, document object model). We defined a generalized RDF model such that changes in the graphic descriptors are automatically and immediately (locally) reflected into the configuration of the client's interface application.
Conclusions
The design patterns identified for the GBox benefit from and reflect the specific requirements of interacting with data generated by clinical trials, and they contain clues for a general purpose solution to the challenge of having interfaces automatically assembled for multiple and volatile views of a domain. By coding AGUIA in JavaScript, for which all browsers include a native interpreter, a solution was found that assembles interfaces that are meaningful to the particular user, and which are also ubiquitous and lightweight, allowing the computational load to be carried by the client's machine.
doi:10.1186/1472-6947-10-65
PMCID: PMC2987967  PMID: 20977768
12.  A nonparametric approach to detect nonlinear correlation in gene expression 
We propose a distribution-free approach to detect nonlinear relationships by reporting local correlation. The effect of our proposed method is analogous to piece-wise linear approximation although the method does not utilize any linear dependency. The proposed metric, maximum local correlation, was applied to both simulated cases and expression microarray data comparing the rd mouse with age-matched control animals. The rd mouse is an animal model (with a mutation for the gene Pde6b) for photoreceptor degeneration. Using simulated data, we show that maximum local correlation detects nonlinear association, which could not be detected using other correlation measures. In the microarray study, our proposed method detects nonlinear association between the expression levels of different genes, which could not be detected using the conventional linear methods. The simulation dataset, microarray expression data, and the Nonparametric Nonlinear Correlation (NNC) software library, implemented in Matlab, are included as part of the online supplemental materials.
doi:10.1198/jcgs.2010.08160
PMCID: PMC2945392  PMID: 20877445
13.  S3DB core: a framework for RDF generation and management in bioinformatics infrastructures 
BMC Bioinformatics  2010;11:387.
Background
Biomedical research is set to greatly benefit from the use of semantic web technologies in the design of computational infrastructure. However, beyond well defined research initiatives, substantial issues of data heterogeneity, source distribution, and privacy currently stand in the way towards the personalization of Medicine.
Results
A computational framework for bioinformatic infrastructure was designed to deal with the heterogeneous data sources and the sensitive mixture of public and private data that characterizes the biomedical domain. This framework consists of a logical model build with semantic web tools, coupled with a Markov process that propagates user operator states. An accompanying open source prototype was developed to meet a series of applications that range from collaborative multi-institution data acquisition efforts to data analysis applications that need to quickly traverse complex data structures. This report describes the two abstractions underlying the S3DB-based infrastructure, logical and numerical, and discusses its generality beyond the immediate confines of existing implementations.
Conclusions
The emergence of the "web as a computer" requires a formal model for the different functionalities involved in reading and writing to it. The S3DB core model proposed was found to address the design criteria of biomedical computational infrastructure, such as those supporting large scale multi-investigator research, clinical trials, and molecular epidemiology.
doi:10.1186/1471-2105-11-387
PMCID: PMC2918582  PMID: 20646315
14.  Modelling interactions of acid–base balance and respiratory status in the toxicity of metal mixtures in the American oyster Crassostrea virginica 
Heavy metals, such as copper, zinc and cadmium, represent some of the most common and serious pollutants in coastal estuaries. In the present study, we used a combination of linear and artificial neural network (ANN) modelling to detect and explore interactions among low-dose mixtures of these heavy metals and their impacts on fundamental physiological processes in tissues of the Eastern oyster, Crassostrea virginica. Animals were exposed to Cd (0.001–0.400 µM), Zn (0.001–3.059 µM) or Cu (0.002–0.787 µM), either alone or in combination for 1 to 27 days. We measured indicators of acid–base balance (hemolymph pH and total CO2), gas exchange (Po2), immunocompetence (total hemocyte counts, numbers of invasive bacteria), antioxidant status (glutathione, GSH), oxidative damage (lipid peroxidation; LPx), and metal accumulation in the gill and the hepatopancreas. Linear analysis showed that oxidative membrane damage from tissue accumulation of environmental metals was correlated with impaired acid–base balance in oysters. ANN analysis revealed interactions of metals with hemolymph acid–base chemistry in predicting oxidative damage that were not evident from linear analyses. These results highlight the usefulness of machine learning approaches, such as ANNs, for improving our ability to recognize and understand the effects of subacute exposure to contaminant mixtures.
doi:10.1016/j.cbpa.2009.11.019
PMCID: PMC2906223  PMID: 19958840
Heavy metals; Artificial neural networks; Crassostrea virginica; Lipid peroxidation; Glutathione; Acid–base balance; Hemolymph PO2
15.  Sources of Variability among Replicate Samples Separated by Two-Dimensional Gel Electrophoresis 
Two-dimensional gel electrophoresis (2DE) offers high-resolution separation for intact proteins. However, variability in the appearance of spots can limit the ability to identify true differences between conditions. Variability can occur at a number of levels. Individual samples can differ because of biological variability. Technical variability can occur during protein extraction, processing, or storage. Another potential source of variability occurs during analysis of the gels and is not a result of any of the causes of variability named above. We performed a study designed to focus only on the variability caused by analysis. We separated three aliquots of rat left ventricle and analyzed differences in protein abundance on the replicate 2D gels. As the samples loaded on each gel were identical, differences in protein abundance are caused by variability in separation or interpretation of the gels. Protein spots were compared across gels by quantile values to determine differences. Fourteen percent of spots had a maximum difference in intensity of 0.4 quantile values or more between replicates. We then looked individually at the spots to determine the cause of differences between the measured intensities. Reasons for differences were: failure to identify a spot (59%), differences in spot boundaries (13%), difference in the peak height (6%), and a combination of these factors (21). This study demonstrates that spot identification and characterization make major contributions to variability seen with 2DE. Methods to highlight why measured protein spot abundance is different could reduce these errors.
PMCID: PMC2841997  PMID: 20357976
heart; proteomics; reproducibility; protein
17.  DASMiner: discovering and integrating data from DAS sources 
BMC Systems Biology  2009;3:109.
Background
DAS is a widely adopted protocol for providing syntactic interoperability among biological databases. The popularity of DAS is due to a simplified and elegant mechanism for data exchange that consists of sources exposing their RESTful interfaces for data access. As a growing number of DAS services are available for molecular biology resources, there is an incentive to explore this protocol in order to advance data discovery and integration among these resources.
Results
We developed DASMiner, a Matlab toolkit for querying DAS data sources that enables creation of integrated biological models using the information available in DAS-compliant repositories. DASMiner is composed by a browser application and an API that work together to facilitate gathering of data from different DAS sources, which can be used for creating enriched datasets from multiple sources.
The browser is used to formulate queries and navigate data contained in DAS sources. Users can execute queries against these sources in an intuitive fashion, without the need of knowing the specific DAS syntax for the particular source. Using the source's metadata provided by the DAS Registry, the browser's layout adapts to expose only the set of commands and coordinate systems supported by the specific source. For this reason, the browser can interrogate any DAS source, independently of the type of data being served.
The API component of DASMiner may be used for programmatic access of DAS sources by programs in Matlab. Once the desired data is found during navigation, the query is exported in the format of an API call to be used within any Matlab application. We illustrate the use of DASMiner by creating integrative models of histone modification maps and protein-protein interaction networks. These enriched datasets were built by retrieving and integrating distributed genomic and proteomic DAS sources using the API.
Conclusion
The support of the DAS protocol allows that hundreds of molecular biology databases to be treated as a federated, online collection of resources. DASMiner enables full exploration of these resources, and can be used to deploy applications and create integrated views of biological systems using the information deposited in DAS repositories.
doi:10.1186/1752-0509-3-109
PMCID: PMC2789070  PMID: 19919683
18.  An Introspective Comparison of Random Forest-Based Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++ 
PLoS ONE  2009;4(9):e7087.
Many mass spectrometry-based studies, as well as other biological experiments produce cluster-correlated data. Failure to account for correlation among observations may result in a classification algorithm overfitting the training data and producing overoptimistic estimated error rates and may make subsequent classifications unreliable. Current common practice for dealing with replicated data is to average each subject replicate sample set, reducing the dataset size and incurring loss of information. In this manuscript we compare three approaches to dealing with cluster-correlated data: unmodified Breiman's Random Forest (URF), forest grown using subject-level averages (SLA), and RF++ with subject-level bootstrapping (SLB). RF++, a novel Random Forest-based algorithm implemented in C++, handles cluster-correlated data through a modification of the original resampling algorithm and accommodates subject-level classification. Subject-level bootstrapping is an alternative sampling method that obviates the need to average or otherwise reduce each set of replicates to a single independent sample. Our experiments show nearly identical median classification and variable selection accuracy for SLB forests and URF forests when applied to both simulated and real datasets. However, the run-time estimated error rate was severely underestimated for URF forests. Predictably, SLA forests were found to be more severely affected by the reduction in sample size which led to poorer classification and variable selection accuracy. Perhaps most importantly our results suggest that it is reasonable to utilize URF for the analysis of cluster-correlated data. Two caveats should be noted: first, correct classification error rates must be obtained using a separate test dataset, and second, an additional post-processing step is required to obtain subject-level classifications. RF++ is shown to be an effective alternative for classifying both clustered and non-clustered data. Source code and stand-alone compiled versions of command-line and easy-to-use graphical user interface (GUI) versions of RF++ for Windows and Linux as well as a user manual (Supplementary File S2) are available for download at: http://sourceforge.org/projects/rfpp/ under the GNU public license.
doi:10.1371/journal.pone.0007087
PMCID: PMC2739274  PMID: 19763254
19.  Urine Biomarkers Predict the Cause of Glomerular Disease 
Diagnosis of the type of glomerular disease that causes the nephrotic syndrome is necessary for appropriate treatment and typically requires a renal biopsy. The goal of this study was to identify candidate protein biomarkers to diagnose glomerular diseases. Proteomic methods and informatic analysis were used to identify patterns of urine proteins that are characteristic of the diseases. Urine proteins were separated by two-dimensional electrophoresis in 32 patients with FSGS, lupus nephritis, membranous nephropathy, or diabetic nephropathy. Protein abundances from 16 patients were used to train an artificial neural network to create a prediction algorithm. The remaining 16 patients were used as an external validation set to test the accuracy of the prediction algorithm. In the validation set, the model predicted the presence of the diseases with sensitivities between 75 and 86% and specificities from 92 to 67%. The probability of obtaining these results in the novel set by chance is 5 × 10−8. Twenty-one gel spots were most important for the differentiation of the diseases. The spots were cut from the gel, and 20 were identified by mass spectrometry as charge forms of 11 plasma proteins: Orosomucoid, transferrin, α-1 microglobulin, zinc α-2 glycoprotein, α-1 antitrypsin, complement factor B, haptoglobin, transthyretin, plasma retinol binding protein, albumin, and hemopexin. These data show that diseases that cause nephrotic syndrome change glomerular protein permeability in characteristic patterns. The fingerprint of urine protein charge forms identifies the glomerular disease. The identified proteins are candidate biomarkers that can be tested in assays that are more amenable to clinical testing.
doi:10.1681/ASN.2006070767
PMCID: PMC2733832  PMID: 17301191
20.  Entropic Profiler – detection of conservation in genomes using information theory 
BMC Research Notes  2009;2:72.
Background
In the last decades, with the successive availability of whole genome sequences, many research efforts have been made to mathematically model DNA. Entropic Profiles (EP) were proposed recently as a new measure of continuous entropy of genome sequences. EP represent local information plots related to DNA randomness and are based on information theory and statistical concepts. They express the weighed relative abundance of motifs for each position in genomes. Their study is very relevant because under or over-representation segments are often associated with significant biological meaning.
Findings
The Entropic Profiler application here presented is a new tool designed to detect and extract under and over-represented DNA segments in genomes by using EP. It allows its computation in a very efficient way by recurring to improved algorithms and data structures, which include modified suffix trees. Available through a web interface and as downloadable source code, it allows to study positions and to search for motifs inside the whole sequence or within a specified range. DNA sequences can be entered from different sources, including FASTA files, pre-loaded examples or resuming a previously saved work. Besides the EP value plots, p-values and z-scores for each motif are also computed, along with the Chaos Game Representation of the sequence.
Conclusion
EP are directly related with the statistical significance of motifs and can be considered as a new method to extract and classify significant regions in genomes and estimate local scales in DNA. The present implementation establishes an efficient and useful tool for whole genome analysis.
doi:10.1186/1756-0500-2-72
PMCID: PMC2686720  PMID: 19416538
21.  Identification of neutral biochemical network models from time series data 
BMC Systems Biology  2009;3:47.
Background
The major difficulty in modeling biological systems from multivariate time series is the identification of parameter sets that endow a model with dynamical behaviors sufficiently similar to the experimental data. Directly related to this parameter estimation issue is the task of identifying the structure and regulation of ill-characterized systems. Both tasks are simplified if the mathematical model is canonical, i.e., if it is constructed according to strict guidelines.
Results
In this report, we propose a method for the identification of admissible parameter sets of canonical S-systems from biological time series. The method is based on a Monte Carlo process that is combined with an improved version of our previous parameter optimization algorithm. The method maps the parameter space into the network space, which characterizes the connectivity among components, by creating an ensemble of decoupled S-system models that imitate the dynamical behavior of the time series with sufficient accuracy. The concept of sloppiness is revisited in the context of these S-system models with an exploration not only of different parameter sets that produce similar dynamical behaviors but also different network topologies that yield dynamical similarity.
Conclusion
The proposed parameter estimation methodology was applied to actual time series data from the glycolytic pathway of the bacterium Lactococcus lactis and led to ensembles of models with different network topologies. In parallel, the parameter optimization algorithm was applied to the same dynamical data upon imposing a pre-specified network topology derived from prior biological knowledge, and the results from both strategies were compared. The results suggest that the proposed method may serve as a powerful exploration tool for testing hypotheses and the design of new experiments.
doi:10.1186/1752-0509-3-47
PMCID: PMC2694766  PMID: 19416537
22.  Prediction of urinary protein markers in lupus nephritis 
Kidney international  2005;68(6):2588-2592.
Background
Lupus nephritis is divided into six classes and scored according to activity and chronicity indices based on histologic findings. Treatment differs based on the pathologic findings. Renal biopsy is currently the only way to accurately predict class and activity and chronicity indices. We propose to use patterns of abundance of urine proteins to identify class and disease indices.
Methods
Urine was collected from 20 consecutive patients immediately prior to biopsy for evaluation of lupus nephritis. The International Society of Nephrology/Renal Pathology Society (ISN/RPS) class of lupus nephritis, activity, and chronicity indices were determined by a renal pathologist. Proteins were separated by two-dimensional gel electrophoresis. Artificial neural networks were trained on normalized spot abundance values.
Results
Biopsy specimens were classified in the database according to ISN/RPS class, activity, and chronicity. Nine samples had characteristics of more than one class present. Receiver operating characteristic (ROC) curves of the trained networks demonstrated areas under the curve ranging from 0.85 to 0.95. The sensitivity and specificity for the ISN/RPS classes were class II 100%, 100%; III 86%, 100%; IV 100%, 92%; and V 92%, 50%. Activity and chronicity indices had r values of 0.77 and 0.87, respectively. A list of spots was obtained that provided diagnostic sensitivity to the analysis.
Conclusion
We have identified a list of protein spots that can be used to develop a clinical assay to predict ISN/RPS class and chronicity for patients with lupus nephritis. An assay based on antibodies against these spots could eliminate the need for renal biopsy, allow frequent evaluation of disease status, and begin specific therapy for patients with lupus nephritis.
doi:10.1111/j.1523-1755.2005.00730.x
PMCID: PMC2667626  PMID: 16316334
lupus nephritis; biomarkers; urine; electrophoresis; two-dimensional gel
23.  Biological sequences as pictures – a generic two dimensional solution for iterated maps 
BMC Bioinformatics  2009;10:100.
Background
Representing symbolic sequences graphically using iterated maps has enjoyed an enduring popularity since it was first proposed in Jeffrey 1990 as chaos game representation (CGR). The usefulness of this representation goes beyond the convenience of a scale independent representation. It provides a variable memory length representation of transition. This includes the representation of succession with non-integer order, which comes with the promise of generalizing Markovian formalisms. The original proposal targeted genomic sequences only but since then several generalizations have been proposed, many specifically designed to handle protein data.
Results
The challenge of a general solution is that of deriving a bijective transformation of symbolic sequences into bi-dimensional planes. More specifically, it requires the regular fractal nesting of polygons. A first attempt at a general solution was proposed by Fiser 1994 by using non-overlapping circles that contain the polygons. This was used as a starting point to identify a more efficient solution where the encapsulating circles can overlap without the same happening for the sequence maps which are circumscribed to fractal polygon domains.
Conclusion
We identified the optimal inscribed packing solution for iterated maps of any Biological sequence, indeed of any symbolic sequence. The new solution maintains the prized bijective mapping property and includes the Sierpinski triangle and the CGR square as particular solutions of the more encompassing formulation.
doi:10.1186/1471-2105-10-100
PMCID: PMC2678093  PMID: 19335894
24.  PrepMS: TOF MS Data Graphical Preprocessing Tool 
Bioinformatics (Oxford, England)  2006;23(2):264-265.
Summary
We introduce a simple-to-use graphical tool that enables researchers to easily prepare time-of-flight mass spectrometry data for analysis. For ease of use, the graphical executable provides default parameter settings experimentally determined to work well in most situations. These values can be changed by the user if desired. PrepMS is a stand-alone application made freely available (open source), and is under the General Public License (GPL). Its graphical user interface, default parameter settings, and display plots allow PrepMS to be used effectively for data preprocessing, peak detection, and visual data quality assessment.
doi:10.1093/bioinformatics/btl583
PMCID: PMC2633108  PMID: 17121773
25.  Exploratory Analysis of the Copy Number Alterations in Glioblastoma Multiforme 
PLoS ONE  2008;3(12):e4076.
Background
The Cancer Genome Atlas project (TCGA) has initiated the analysis of multiple samples of a variety of tumor types, starting with glioblastoma multiforme. The analytical methods encompass genomic and transcriptomic information, as well as demographic and clinical data about the sample donors. The data create the opportunity for a systematic screening of the components of the molecular machinery for features that may be associated with tumor formation. The wealth of existing mechanistic information about cancer cell biology provides a natural reference for the exploratory exercise.
Methodology/Principal Findings
Glioblastoma multiforme DNA copy number data was generated by The Cancer Genome Atlas project for 167 patients using 227 aCGH experiments, and was analyzed to build a catalog of aberrant regions. Genome screening was performed using an information theory approach in order to quantify aberration as a deviation from a centrality without the bias of untested assumptions about its parametric nature. A novel Cancer Genome Browser software application was developed and is made public to provide a user-friendly graphical interface in which the reported results can be reproduced. The application source code and stand alone executable are available at http://code.google.com/p/cancergenome and http://bioinformaticstation.org, respectively.
Conclusions/Significance
The most important known copy number alterations for glioblastoma were correctly recovered using entropy as a measure of aberration. Additional alterations were identified in different pathways, such as cell proliferation, cell junctions and neural development. Moreover, novel candidates for oncogenes and tumor suppressors were also detected. A detailed map of aberrant regions is provided.
doi:10.1371/journal.pone.0004076
PMCID: PMC2605252  PMID: 19115005

Results 1-25 (50)