PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-16 (16)
 

Clipboard (0)
None

Select a Filter Below

Journals
more »
Year of Publication
1.  Data model, dictionaries, and desiderata for biomolecular simulation data indexing and sharing 
Background
Few environments have been developed or deployed to widely share biomolecular simulation data or to enable collaborative networks to facilitate data exploration and reuse. As the amount and complexity of data generated by these simulations is dramatically increasing and the methods are being more widely applied, the need for new tools to manage and share this data has become obvious. In this paper we present the results of a process aimed at assessing the needs of the community for data representation standards to guide the implementation of future repositories for biomolecular simulations.
Results
We introduce a list of common data elements, inspired by previous work, and updated according to feedback from the community collected through a survey and personal interviews. These data elements integrate the concepts for multiple types of computational methods, including quantum chemistry and molecular dynamics. The identified core data elements were organized into a logical model to guide the design of new databases and application programming interfaces. Finally a set of dictionaries was implemented to be used via SQL queries or locally via a Java API built upon the Apache Lucene text-search engine.
Conclusions
The model and its associated dictionaries provide a simple yet rich representation of the concepts related to biomolecular simulations, which should guide future developments of repositories and more complex terminologies and ontologies. The model still remains extensible through the decomposition of virtual experiments into tasks and parameter sets, and via the use of extended attributes. The benefits of a common logical model for biomolecular simulations was illustrated through various use cases, including data storage, indexing, and presentation. All the models and dictionaries introduced in this paper are available for download at http://ibiomes.chpc.utah.edu/mediawiki/index.php/Downloads.
doi:10.1186/1758-2946-6-4
PMCID: PMC3915074  PMID: 24484917
Biomolecular simulations; Molecular dynamics; Computational chemistry; Data model; Repository; XML; UML
2.  Automatic Extraction of Nanoparticle Properties Using Natural Language Processing: NanoSifter an Application to Acquire PAMAM Dendrimer Properties 
PLoS ONE  2014;9(1):e83932.
In this study, we demonstrate the use of natural language processing methods to extract, from nanomedicine literature, numeric values of biomedical property terms of poly(amidoamine) dendrimers. We have developed a method for extracting these values for properties taken from the NanoParticle Ontology, using the General Architecture for Text Engineering and a Nearly-New Information Extraction System. We also created a method for associating the identified numeric values with their corresponding dendrimer properties, called NanoSifter.
We demonstrate that our system can correctly extract numeric values of dendrimer properties reported in the cancer treatment literature with high recall, precision, and f-measure. The micro-averaged recall was 0.99, precision was 0.84, and f-measure was 0.91. Similarly, the macro-averaged recall was 0.99, precision was 0.87, and f-measure was 0.92. To our knowledge, these results are the first application of text mining to extract and associate dendrimer property terms and their corresponding numeric values.
doi:10.1371/journal.pone.0083932
PMCID: PMC3879259  PMID: 24392101
3.  A Parallel Genetic Algorithm to Discover Patterns in Genetic Markers that Indicate Predisposition to Multifactorial Disease 
Computers in biology and medicine  2008;38(7):826-836.
This paper describes a novel algorithm to analyze genetic linkage data using pattern recognition techniques and genetic algorithms (GA). The method allows a search for regions of the chromosome that may contain genetic variations that jointly predispose individuals for a particular disease. The method uses correlation analysis, filtering theory and genetic algorithms (GA) to achieve this goal. Because current genome scans use from hundreds to hundreds of thousands of markers, two versions of the method have been implemented. The first is an exhaustive analysis version that can be used to visualize, explore, and analyze small genetic data sets for two marker correlations; the second is a GA version, which uses a parallel implementation allowing searches of higher-order correlations in large data sets. Results on simulated data sets indicate that the method can be informative in the identification of major disease loci and gene-gene interactions in genome-wide linkage data and that further exploration of these techniques is justified. The results presented for both variants of the method show that it can help genetic epidemiologists to identify promising combinations of genetic factors that might predispose to complex disorders. In particular, the correlation analysis of IBD expression patterns might hint to possible gene-gene interactions and the filtering might be a fruitful approach to distinguish true correlation signals from noise.
doi:10.1016/j.compbiomed.2008.04.011
PMCID: PMC2532987  PMID: 18547558
Gene-Gene Interactions; Multifactorial Diseases; Pattern Recognition; Data Mining; Correlation Analysis; Parallel Genetic Algorithm
4.  Utility of gene-specific algorithms for predicting pathogenicity of uncertain gene variants 
The rapid advance of gene sequencing technologies has produced an unprecedented rate of discovery of genome variation in humans. A growing number of authoritative clinical repositories archive gene variants and disease phenotypes, yet there are currently many more gene variants that lack clear annotation or disease association. To date, there has been very limited coverage of gene-specific predictors in the literature. Here the evaluation is presented of “gene-specific” predictor models based on a naïve Bayesian classifier for 20 gene–disease datasets, containing 3986 variants with clinically characterized patient conditions. The utility of gene-specific prediction is then compared with “all-gene” generalized prediction and also with existing popular predictors. Gene-specific computational prediction models derived from clinically curated gene variant disease datasets often outperform established generalized algorithms for novel and uncertain gene variants.
doi:10.1136/amiajnl-2011-000309
PMCID: PMC3277614  PMID: 22037892
Amino acid properties; gene variant classification; machine learning; phenotype prediction; bioinformatics; gene variants classification; gene disease database; developing/using computerized provider order entry; designing usable (responsive) resources and systems; methods for integration of information from disparate sources; high-performance and large-scale computing; distributed systems; agents; software engineering: architecture; data exchange; communication; integration across care settings (inter- and intra-enterprise); system implementation and management issues; languages; computational methods; statistical analysis of large datasets; advanced algorithms; identifying genome and protein structure and function; detecting disease outbreaks and biological threats; visualization of data and knowledge
5.  Consensus: a framework for evaluation of uncertain gene variants in laboratory test reporting 
Genome Medicine  2012;4(5):48.
Accurate interpretation of gene testing is a key component in customizing patient therapy. Where confirming evidence for a gene variant is lacking, computational prediction may be employed. A standardized framework, however, does not yet exist for quantitative evaluation of disease association for uncertain or novel gene variants in an objective manner. Here, complementary predictors for missense gene variants were incorporated into a weighted Consensus framework that includes calculated reference intervals from known disease outcomes. Data visualization for clinical reporting is also discussed.
doi:10.1186/gm347
PMCID: PMC3506914  PMID: 22640420
6.  Identification of pneumonia and influenza deaths using the death certificate pipeline 
Background
Death records are a rich source of data, which can be used to assist with public surveillance and/or decision support. However, to use this type of data for such purposes it has to be transformed into a coded format to make it computable. Because the cause of death in the certificates is reported as free text, encoding the data is currently the single largest barrier of using death certificates for surveillance. Therefore, the purpose of this study was to demonstrate the feasibility of using a pipeline, composed of a detection rule and a natural language processor, for the real time encoding of death certificates using the identification of pneumonia and influenza cases as an example and demonstrating that its accuracy is comparable to existing methods.
Results
A Death Certificates Pipeline (DCP) was developed to automatically code death certificates and identify pneumonia and influenza cases. The pipeline used MetaMap to code death certificates from the Utah Department of Health for the year 2008. The output of MetaMap was then accessed by detection rules which flagged pneumonia and influenza cases based on the Centers of Disease and Control and Prevention (CDC) case definition. The output from the DCP was compared with the current method used by the CDC and with a keyword search. Recall, precision, positive predictive value and F-measure with respect to the CDC method were calculated for the two other methods considered here. The two different techniques compared here with the CDC method showed the following recall/ precision results: DCP: 0.998/0.98 and keyword searching: 0.96/0.96. The F-measure were 0.99 and 0.96 respectively (DCP and keyword searching). Both the keyword and the DCP can run in interactive form with modest computer resources, but DCP showed superior performance.
Conclusion
The pipeline proposed here for coding death certificates and the detection of cases is feasible and can be extended to other conditions. This method provides an alternative that allows for coding free-text death certificates in real time that may increase its utilization not only in the public health domain but also for biomedical researchers and developers.
Trial Registration
This study did not involved any clinical trials.
doi:10.1186/1472-6947-12-37
PMCID: PMC3444937  PMID: 22569097
Public health informatics; Natural language processing; Surveillance; Pneumonia and influenza
7.  Chemical shift tensors: Theory and application to molecular structural problems 
doi:10.1016/j.pnmrs.2010.10.003
PMCID: PMC3058154  PMID: 21397119
NMR chemical shifts; NMR shielding; Molecular structure
8.  Transition from exo- to endo- Cu absorption in CuSin clusters: A Genetic Algorithms Density Functional Theory (DFT) Study 
Molecular simulation  2011;37(8):678-688.
The characterization and prediction of the structures of metal silicon clusters is important for nanotechnology research because these clusters can be used as building blocks for nano devices, integrated circuits and solar cells. Several authors have postulated that there is a transition between exo to endo absorption of Cu in Sin clusters and showed that for n larger than 9 it is possible to find endohedral clusters. Unfortunately, no global searchers have confirmed this observation, which is based on local optimizations of plausible structures. Here we use parallel Genetic Algorithms (GA), as implemented in our MGAC software, directly coupled with DFT energy calculations to show that the global search of CuSin cluster structures does not find endohedral clusters for n < 8 but finds them for n ≥ 10.
doi:10.1080/08927020903583830
PMCID: PMC3139224  PMID: 21785526
copper-silicon clusters; genetic algorithms; global optimization
9.  Towards crystal structure prediction of complex organic compounds – a report on the fifth blind test 
Following on from the success of the previous crystal structure prediction blind tests (CSP1999, CSP2001, CSP2004 and CSP2007), a fifth such collaborative project (CSP2010) was organized at the Cambridge Crystallographic Data Centre. A range of methodologies was used by the participating groups in order to evaluate the ability of the current computational methods to predict the crystal structures of the six organic molecules chosen as targets for this blind test. The first four targets, two rigid molecules, one semi-flexible molecule and a 1:1 salt, matched the criteria for the targets from CSP2007, while the last two targets belonged to two new challenging categories – a larger, much more flexible molecule and a hydrate with more than one polymorph. Each group submitted three predictions for each target it attempted. There was at least one successful prediction for each target, and two groups were able to successfully predict the structure of the large flexible molecule as their first place submission. The results show that while not as many groups successfully predicted the structures of the three smallest molecules as in CSP2007, there is now evidence that methodologies such as dispersion-corrected density functional theory (DFT-D) are able to reliably do so. The results also highlight the many challenges posed by more complex systems and show that there are still issues to be overcome.
doi:10.1107/S0108768111042868
PMCID: PMC3222142  PMID: 22101543
10.  Towards crystal structure prediction of complex organic compounds – a report on the fifth blind test 
The results of the fifth blind test of crystal structure prediction, which show important success with more challenging large and flexible molecules, are presented and discussed.
Following on from the success of the previous crystal structure prediction blind tests (CSP1999, CSP2001, CSP2004 and CSP2007), a fifth such collaborative project (CSP2010) was organized at the Cambridge Crystallographic Data Centre. A range of methodologies was used by the participating groups in order to evaluate the ability of the current computational methods to predict the crystal structures of the six organic molecules chosen as targets for this blind test. The first four targets, two rigid molecules, one semi-flexible molecule and a 1:1 salt, matched the criteria for the targets from CSP2007, while the last two targets belonged to two new challenging categories – a larger, much more flexible molecule and a hydrate with more than one polymorph. Each group submitted three predictions for each target it attempted. There was at least one successful prediction for each target, and two groups were able to successfully predict the structure of the large flexible molecule as their first place submission. The results show that while not as many groups successfully predicted the structures of the three smallest molecules as in CSP2007, there is now evidence that methodologies such as dispersion-corrected density functional theory (DFT-D) are able to reliably do so. The results also highlight the many challenges posed by more complex systems and show that there are still issues to be overcome.
doi:10.1107/S0108768111042868
PMCID: PMC3222142  PMID: 22101543
prediction; blind test; polymorph; crystal structure prediction
11.  Predicting the start week of respiratory syncytial virus outbreaks using real time weather variables 
Background
Respiratory Syncytial Virus (RSV), a major cause of bronchiolitis, has a large impact on the census of pediatric hospitals during outbreak seasons. Reliable prediction of the week these outbreaks will start, based on readily available data, could help pediatric hospitals better prepare for large outbreaks.
Methods
Naïve Bayes (NB) classifier models were constructed using weather data from 1985-2008 considering only variables that are available in real time and that could be used to forecast the week in which an RSV outbreak will occur in Salt Lake County, Utah. Outbreak start dates were determined by a panel of experts using 32,509 records with ICD-9 coded RSV and bronchiolitis diagnoses from Intermountain Healthcare hospitals and clinics for the RSV seasons from 1985 to 2008.
Results
NB models predicted RSV outbreaks up to 3 weeks in advance with an estimated sensitivity of up to 67% and estimated specificities as high as 94% to 100%. Temperature and wind speed were the best overall predictors, but other weather variables also showed relevance depending on how far in advance the predictions were made. The weather conditions predictive of an RSV outbreak in our study were similar to those that lead to temperature inversions in the Salt Lake Valley.
Conclusions
We demonstrate that Naïve Bayes (NB) classifier models based on weather data available in real time have the potential to be used as effective predictive models. These models may be able to predict the week that an RSV outbreak will occur with clinical relevance. Their clinical usefulness will be field tested during the next five years.
doi:10.1186/1472-6947-10-68
PMCID: PMC2987968  PMID: 21044325
12.  Crystal Structure Prediction (CSP) of Flexible Molecules using Parallel Genetic Algorithms with a Standard Force Field 
Journal of computational chemistry  2009;30(13):1973-1985.
This paper describes the application of our distributed computing framework for crystal structure prediction (CSP), Modified Genetic Algorithms for Crystal and Cluster Prediction (MGAC) to predict the crystal structure of flexible molecules using the General Amber Force Field (GAFF) and the CHARMM program. The MGAC distributed computing framework which includes a series of tightly integrated computer programs for generating the molecule’s force field, sampling crystal structures using a distributed parallel genetic algorithm, local energy minimization of the structures followed by the classifying, sorting and archiving of the most relevant structures. Our results indicate that the method can consistently find the experimentally known crystal structures of flexible molecules, but the number of missing structures and poor ranking observed in some crystals show the need for further improvement of the potential.
doi:10.1002/jcc.21189
PMCID: PMC2720422  PMID: 19130496
14.  SaTScan on a Cloud: On-Demand Large Scale Spatial Analysis of Epidemics 
Online Journal of Public Health Informatics  2010;2(1):ojphi.v2i1.2910.
By using cloud computing it is possible to provide on- demand resources for epidemic analysis using computer intensive applications like SaTScan. Using 15 virtual machines (VM) on the Nimbus cloud we were able to reduce the total execution time for the same ensemble run from 8896 seconds in a single machine to 842 seconds in the cloud. Using the caBIG tools and our iterative software development methodology the time required to complete the implementation of the SaTScan cloud system took approximately 200 man-hours, which represents an effort that can be secured within the resources available at State Health Departments. The approach proposed here is technically advantageous and practically possible.
doi:10.5210/ojphi.v2i1.2910
PMCID: PMC3615753  PMID: 23569576
15.  Characterization of uncertainty in the classification of multivariate assays: application to PAM50 centroid-based genomic predictors for breast cancer treatment plans 
Background
Multivariate assays (MVAs) for assisting clinical decisions are becoming commonly available, but due to complexity, are often considered a high-risk approach. A key concern is that uncertainty on the assay's final results is not well understood. This study focuses on developing a process to characterize error introduced in the MVA's results from the intrinsic error in the laboratory process: sample preparation and measurement of the contributing factors, such as gene expression.
Methods
Using the PAM50 Breast Cancer Intrinsic Classifier, we show how to characterize error within an MVA, and how these errors may affect results reported to clinicians. First we estimated the error distribution for measured factors within the PAM50 assay by performing repeated measures on four archetypal samples representative of the major breast cancer tumor subtypes. Then, using the error distributions and the original archetypal sample data, we used Monte Carlo simulations to generate a sufficient number of simulated samples. The effect of these errors on the PAM50 tumor subtype classification was estimated by measuring subtype reproducibility after classifying all simulated samples. Subtype reproducibility was measured as the percentage of simulated samples classified identically to the parent sample. The simulation was thereafter repeated on a large, independent data set of samples from the GEICAM 9906 clinical trial. Simulated samples from the GEICAM sample set were used to explore a more realistic scenario where, unlike archetypal samples, many samples are not easily classified.
Results
All simulated samples derived from the archetypal samples were classified identically to the parent sample. Subtypes for simulated samples from the GEICAM set were also highly reproducible, but there were a non-negligible number of samples that exhibit significant variability in their classification.
Conclusions
We have developed a general methodology to estimate the effects of intrinsic errors within MVAs. We have applied the method to the PAM50 assay, showing that the PAM50 results are resilient to intrinsic errors within the assay, but also finding that in non-archetypal samples, experimental errors can lead to quite different classification of a tumor. Finally we propose a way to provide the uncertainty information in a usable way for clinicians.
doi:10.1186/2043-9113-1-37
PMCID: PMC3275466  PMID: 22196354
Multivariate Assays; PAM50; Monte Carlo Simulations; Breast Cancer
16.  A case for using grid architecture for state public health informatics: the Utah perspective 
This paper presents the rationale for designing and implementing the next-generation of public health information systems using grid computing concepts and tools. Our attempt is to evaluate all grid types including data grids for sharing information and computational grids for accessing computational resources on demand. Public health is a broad domain that requires coordinated uses of disparate and heterogeneous information systems. System interoperability in public health is limited. The next-generation public health information systems must overcome barriers to integration and interoperability, leverage advances in information technology, address emerging requirements, and meet the needs of all stakeholders. Grid-based architecture provides one potential technical solution that deserves serious consideration. Within this context, we describe three discrete public health information system problems and the process by which the Utah Department of Health (UDOH) and the Department of Biomedical Informatics at the University of Utah in the United States has approached the exploration for eventual deployment of a Utah Public Health Informatics Grid. These three problems are: i) integration of internal and external data sources with analytic tools and computational resources; ii) provide external stakeholders with access to public health data and services; and, iii) access, integrate, and analyze internal data for the timely monitoring of population health status and health services. After one year of experience, we have successfully implemented federated queries across disparate administrative domains, and have identified challenges and potential solutions concerning the selection of candidate analytic grid services, data sharing concerns, security models, and strategies for reducing expertise required at a public health agency to implement a public health grid.
doi:10.1186/1472-6947-9-32
PMCID: PMC2707374  PMID: 19545428

Results 1-16 (16)