PMCC PMCC

Search tips
Search criteria

Advanced
Results 1-25 (523757)

Clipboard (0)
None

Related Articles

1.  Software platform virtualization in chemistry research and university teaching 
Background
Modern chemistry laboratories operate with a wide range of software applications under different operating systems, such as Windows, LINUX or Mac OS X. Instead of installing software on different computers it is possible to install those applications on a single computer using Virtual Machine software. Software platform virtualization allows a single guest operating system to execute multiple other operating systems on the same computer. We apply and discuss the use of virtual machines in chemistry research and teaching laboratories.
Results
Virtual machines are commonly used for cheminformatics software development and testing. Benchmarking multiple chemistry software packages we have confirmed that the computational speed penalty for using virtual machines is low and around 5% to 10%. Software virtualization in a teaching environment allows faster deployment and easy use of commercial and open source software in hands-on computer teaching labs.
Conclusion
Software virtualization in chemistry, mass spectrometry and cheminformatics is needed for software testing and development of software for different operating systems. In order to obtain maximum performance the virtualization software should be multi-core enabled and allow the use of multiprocessor configurations in the virtual machine environment. Server consolidation, by running multiple tasks and operating systems on a single physical machine, can lead to lower maintenance and hardware costs especially in small research labs. The use of virtual machines can prevent software virus infections and security breaches when used as a sandbox system for internet access and software testing. Complex software setups can be created with virtual machines and are easily deployed later to multiple computers for hands-on teaching classes. We discuss the popularity of bioinformatics compared to cheminformatics as well as the missing cheminformatics education at universities worldwide.
doi:10.1186/1758-2946-1-18
PMCID: PMC2820496  PMID: 20150997
2.  MOLA: a bootable, self-configuring system for virtual screening using AutoDock4/Vina on computer clusters 
Background
Virtual screening of small molecules using molecular docking has become an important tool in drug discovery. However, large scale virtual screening is time demanding and usually requires dedicated computer clusters. There are a number of software tools that perform virtual screening using AutoDock4 but they require access to dedicated Linux computer clusters. Also no software is available for performing virtual screening with Vina using computer clusters. In this paper we present MOLA, an easy-to-use graphical user interface tool that automates parallel virtual screening using AutoDock4 and/or Vina in bootable non-dedicated computer clusters.
Implementation
MOLA automates several tasks including: ligand preparation, parallel AutoDock4/Vina jobs distribution and result analysis. When the virtual screening project finishes, an open-office spreadsheet file opens with the ligands ranked by binding energy and distance to the active site. All results files can automatically be recorded on an USB-flash drive or on the hard-disk drive using VirtualBox. MOLA works inside a customized Live CD GNU/Linux operating system, developed by us, that bypass the original operating system installed on the computers used in the cluster. This operating system boots from a CD on the master node and then clusters other computers as slave nodes via ethernet connections.
Conclusion
MOLA is an ideal virtual screening tool for non-experienced users, with a limited number of multi-platform heterogeneous computers available and no access to dedicated Linux computer clusters. When a virtual screening project finishes, the computers can just be restarted to their original operating system. The originality of MOLA lies on the fact that, any platform-independent computer available can he added to the cluster, without ever using the computer hard-disk drive and without interfering with the installed operating system. With a cluster of 10 processors, and a potential maximum speed-up of 10x, the parallel algorithm of MOLA performed with a speed-up of 8,64× using AutoDock4 and 8,60× using Vina.
doi:10.1186/1758-2946-2-10
PMCID: PMC2987878  PMID: 21029419
3.  A comparative analysis of dynamic grids vs. virtual grids using the A3pviGrid framework 
Bioinformation  2010;5(5):186-190.
With the proliferation of Quad/Multi-core micro-processors in mainstream platforms such as desktops and workstations; a large number of unused CPU cycles can be utilized for running virtual machines (VMs) as dynamic nodes in distributed environments. Grid services and its service oriented business broker now termed cloud computing could deploy image based virtualization platforms enabling agent based resource management and dynamic fault management. In this paper we present an efficient way of utilizing heterogeneous virtual machines on idle desktops as an environment for consumption of high performance grid services. Spurious and exponential increases in the size of the datasets are constant concerns in medical and pharmaceutical industries due to the constant discovery and publication of large sequence databases. Traditional algorithms are not modeled at handing large data sizes under sudden and dynamic changes in the execution environment as previously discussed. This research was undertaken to compare our previous results with running the same test dataset with that of a virtual Grid platform using virtual machines (Virtualization). The implemented architecture, A3pviGrid utilizes game theoretic optimization and agent based team formation (Coalition) algorithms to improve upon scalability with respect to team formation. Due to the dynamic nature of distributed systems (as discussed in our previous work) all interactions were made local within a team transparently. This paper is a proof of concept of an experimental mini-Grid test-bed compared to running the platform on local virtual machines on a local test cluster. This was done to give every agent its own execution platform enabling anonymity and better control of the dynamic environmental parameters. We also analyze performance and scalability of Blast in a multiple virtual node setup and present our findings. This paper is an extension of our previous research on improving the BLAST application framework using dynamic Grids on virtualization platforms such as the virtual box.
PMCID: PMC3040497  PMID: 21364795
Agents; Blast; Coalition; Grids; Virtual Machines and Virtualization
4.  CIS4/403: Design and Implementation of an Intranet-based system for Real-Time Tele-Consultation in Oncology 
Introduction
This study describes a tele-consultation system (TCS) developed to provide a computing environment over a Wide Area Network (WAN) in North Italy (Province of Trento), that can be used by two or more physicians to share medical data and to work co-operatively on medical records. A pilot study has been carried out in oncology to assess the effectiveness of the system. The aim of this project is to facilitate the management of oncology patients by improving communication among the specialists of central and district hospitals.
Methods and Results
The TCS is an Intranet-based solution. The Intranet is based on a PC WAN with Windows NT Server, Microsoft SQL Server, and Internet Information Server. TCS is composed of native and custom applications developed in the Microsoft Windows (9x and NT) environment. The basic component of the system is the multimedia digital medical record, structured as a collection of HTML and ASP pages. A distributed relational database will allow users to store and retrieve medical records, accessed by a dedicated Web browser via the Web Server. The medical data to be stored and the presentation architecture of the clinical record had been determined in close collaboration with the clinicians involved in the project. TCS will allow a multi-point tele-consultation (TC) among two or more participants on remote computers, providing synchronized surfing through the clinical report. A set of collaborative and personal tools, whiteboard with drawing tools, point-to-point digital audio-conference, chat, local notepad, e-mail service, are integrated in the system to provide an user friendly environment. TCS has been developed as a client-server architecture. The client part of the system is based on the Microsoft Web Browser control and provides the user interface and the tools described above. The server part, running all the time on a dedicated computer, accepts connection requests and manages the connections among the participants in a TC, allowing multiple TC to run simultaneously. TCS has been developed in Visual C++ environment using MFC library and COM technology; ActiveX controls have been written in Visual Basic to perform dedicated tasks from the inside of the HTML clinical report. Before deploying the system in the hospital departments involved in the project, TCS has been tested in our laboratory by clinicians involved in the project to evaluate the usability of the system.
Discussion
TCS has the potential to support a "multi-disciplinary distributed virtual oncological meeting". The specialists of different departments and of different hospitals can attend "virtual meetings" and interactively discuss on medical data. An expected benefit of the "virtual meeting" should be the possibility to provide expert remote advice from oncologists to peripheral cancer units in formulating treatment plans, conducting follow-up sessions and supporting clinical research.
doi:10.2196/jmir.1.suppl1.e9
PMCID: PMC1761746
Intranet; Teleconsultation; Oncology
5.  Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community 
BMC Bioinformatics  2012;13:42.
Background
A steep drop in the cost of next-generation sequencing during recent years has made the technology affordable to the majority of researchers, but downstream bioinformatic analysis still poses a resource bottleneck for smaller laboratories and institutes that do not have access to substantial computational resources. Sequencing instruments are typically bundled with only the minimal processing and storage capacity required for data capture during sequencing runs. Given the scale of sequence datasets, scientific value cannot be obtained from acquiring a sequencer unless it is accompanied by an equal investment in informatics infrastructure.
Results
Cloud BioLinux is a publicly accessible Virtual Machine (VM) that enables scientists to quickly provision on-demand infrastructures for high-performance bioinformatics computing using cloud platforms. Users have instant access to a range of pre-configured command line and graphical software applications, including a full-featured desktop interface, documentation and over 135 bioinformatics packages for applications including sequence alignment, clustering, assembly, display, editing, and phylogeny. Each tool's functionality is fully described in the documentation directly accessible from the graphical interface of the VM. Besides the Amazon EC2 cloud, we have started instances of Cloud BioLinux on a private Eucalyptus cloud installed at the J. Craig Venter Institute, and demonstrated access to the bioinformatic tools interface through a remote connection to EC2 instances from a local desktop computer. Documentation for using Cloud BioLinux on EC2 is available from our project website, while a Eucalyptus cloud image and VirtualBox Appliance is also publicly available for download and use by researchers with access to private clouds.
Conclusions
Cloud BioLinux provides a platform for developing bioinformatics infrastructures on the cloud. An automated and configurable process builds Virtual Machines, allowing the development of highly customized versions from a shared code base. This shared community toolkit enables application specific analysis platforms on the cloud by minimizing the effort required to prepare and maintain them.
doi:10.1186/1471-2105-13-42
PMCID: PMC3372431  PMID: 22429538
6.  Performance Analysis of the Microsoft Kinect Sensor for 2D Simultaneous Localization and Mapping (SLAM) Techniques 
Sensors (Basel, Switzerland)  2014;14(12):23365-23387.
This paper presents a performance analysis of two open-source, laser scanner-based Simultaneous Localization and Mapping (SLAM) techniques (i.e., Gmapping and Hector SLAM) using a Microsoft Kinect to replace the laser sensor. Furthermore, the paper proposes a new system integration approach whereby a Linux virtual machine is used to run the open source SLAM algorithms. The experiments were conducted in two different environments; a small room with no features and a typical office corridor with desks and chairs. Using the data logged from real-time experiments, each SLAM technique was simulated and tested with different parameter settings. The results show that the system is able to achieve real time SLAM operation. The system implementation offers a simple and reliable way to compare the performance of Windows-based SLAM algorithm with the algorithms typically implemented in a Robot Operating System (ROS). The results also indicate that certain modifications to the default laser scanner-based parameters are able to improve the map accuracy. However, the limited field of view and range of Kinect's depth sensor often causes the map to be inaccurate, especially in featureless areas, therefore the Kinect sensor is not a direct replacement for a laser scanner, but rather offers a feasible alternative for 2D SLAM tasks.
doi:10.3390/s141223365
PMCID: PMC4299068  PMID: 25490595
Microsoft Kinect sensor; 2D SLAM; robotics; integrated system; sensor; virtual machine; Robot Operating System
7.  Maestro: An Orchestration Framework for Large-Scale WSN Simulations 
Sensors (Basel, Switzerland)  2014;14(3):5392-5414.
Contemporary wireless sensor networks (WSNs) have evolved into large and complex systems and are one of the main technologies used in cyber-physical systems and the Internet of Things. Extensive research on WSNs has led to the development of diverse solutions at all levels of software architecture, including protocol stacks for communications. This multitude of solutions is due to the limited computational power and restrictions on energy consumption that must be accounted for when designing typical WSN systems. It is therefore challenging to develop, test and validate even small WSN applications, and this process can easily consume significant resources. Simulations are inexpensive tools for testing, verifying and generally experimenting with new technologies in a repeatable fashion. Consequently, as the size of the systems to be tested increases, so does the need for large-scale simulations. This article describes a tool called Maestro for the automation of large-scale simulation and investigates the feasibility of using cloud computing facilities for such task. Using tools that are built into Maestro, we demonstrate a feasible approach for benchmarking cloud infrastructure in order to identify cloud Virtual Machine (VM)instances that provide an optimal balance of performance and cost for a given simulation.
doi:10.3390/s140305392
PMCID: PMC4003997  PMID: 24647123
wireless sensor networks; simulations; cloud computing; Amazon AWS
8.  Simple re-instantiation of small databases using cloud computing 
BMC Genomics  2013;14(Suppl 5):S13.
Background
Small bioinformatics databases, unlike institutionally funded large databases, are vulnerable to discontinuation and many reported in publications are no longer accessible. This leads to irreproducible scientific work and redundant effort, impeding the pace of scientific progress.
Results
We describe a Web-accessible system, available online at http://biodb100.apbionet.org, for archival and future on demand re-instantiation of small databases within minutes. Depositors can rebuild their databases by downloading a Linux live operating system (http://www.bioslax.com), preinstalled with bioinformatics and UNIX tools. The database and its dependencies can be compressed into an ".lzm" file for deposition. End-users can search for archived databases and activate them on dynamically re-instantiated BioSlax instances, run as virtual machines over the two popular full virtualization standard cloud-computing platforms, Xen Hypervisor or vSphere. The system is adaptable to increasing demand for disk storage or computational load and allows database developers to use the re-instantiated databases for integration and development of new databases.
Conclusions
Herein, we demonstrate that a relatively inexpensive solution can be implemented for archival of bioinformatics databases and their rapid re-instantiation should the live databases disappear.
doi:10.1186/1471-2164-14-S5-S13
PMCID: PMC3852246  PMID: 24564380
Database archival; Re-instantiation; Cloud computing; BioSLAX; biodb100; MIABi
9.  A Distributed Parallel Genetic Algorithm of Placement Strategy for Virtual Machines Deployment on Cloud Platform 
The Scientific World Journal  2014;2014:259139.
The cloud platform provides various services to users. More and more cloud centers provide infrastructure as the main way of operating. To improve the utilization rate of the cloud center and to decrease the operating cost, the cloud center provides services according to requirements of users by sharding the resources with virtualization. Considering both QoS for users and cost saving for cloud computing providers, we try to maximize performance and minimize energy cost as well. In this paper, we propose a distributed parallel genetic algorithm (DPGA) of placement strategy for virtual machines deployment on cloud platform. It executes the genetic algorithm parallelly and distributedly on several selected physical hosts in the first stage. Then it continues to execute the genetic algorithm of the second stage with solutions obtained from the first stage as the initial population. The solution calculated by the genetic algorithm of the second stage is the optimal one of the proposed approach. The experimental results show that the proposed placement strategy of VM deployment can ensure QoS for users and it is more effective and more energy efficient than other placement strategies on the cloud platform.
doi:10.1155/2014/259139
PMCID: PMC4109368  PMID: 25097872
10.  Design and Analysis of Self-Adapted Task Scheduling Strategies in Wireless Sensor Networks 
Sensors (Basel, Switzerland)  2011;11(7):6533-6554.
In a wireless sensor network (WSN), the usage of resources is usually highly related to the execution of tasks which consume a certain amount of computing and communication bandwidth. Parallel processing among sensors is a promising solution to provide the demanded computation capacity in WSNs. Task allocation and scheduling is a typical problem in the area of high performance computing. Although task allocation and scheduling in wired processor networks has been well studied in the past, their counterparts for WSNs remain largely unexplored. Existing traditional high performance computing solutions cannot be directly implemented in WSNs due to the limitations of WSNs such as limited resource availability and the shared communication medium. In this paper, a self-adapted task scheduling strategy for WSNs is presented. First, a multi-agent-based architecture for WSNs is proposed and a mathematical model of dynamic alliance is constructed for the task allocation problem. Then an effective discrete particle swarm optimization (PSO) algorithm for the dynamic alliance (DPSO-DA) with a well-designed particle position code and fitness function is proposed. A mutation operator which can effectively improve the algorithm’s ability of global search and population diversity is also introduced in this algorithm. Finally, the simulation results show that the proposed solution can achieve significant better performance than other algorithms.
doi:10.3390/s110706533
PMCID: PMC3231676  PMID: 22163971
wireless sensor networks; task scheduling; particle swarm optimization; dynamic alliance
11.  Large-scale automated image analysis for computational profiling of brain tissue surrounding implanted neuroprosthetic devices using Python 
In this article, we describe the use of Python for large-scale automated server-based bio-image analysis in FARSIGHT, a free and open-source toolkit of image analysis methods for quantitative studies of complex and dynamic tissue microenvironments imaged by modern optical microscopes, including confocal, multi-spectral, multi-photon, and time-lapse systems. The core FARSIGHT modules for image segmentation, feature extraction, tracking, and machine learning are written in C++, leveraging widely used libraries including ITK, VTK, Boost, and Qt. For solving complex image analysis tasks, these modules must be combined into scripts using Python. As a concrete example, we consider the problem of analyzing 3-D multi-spectral images of brain tissue surrounding implanted neuroprosthetic devices, acquired using high-throughput multi-spectral spinning disk step-and-repeat confocal microscopy. The resulting images typically contain 5 fluorescent channels. Each channel consists of 6000 × 10,000 × 500 voxels with 16 bits/voxel, implying image sizes exceeding 250 GB. These images must be mosaicked, pre-processed to overcome imaging artifacts, and segmented to enable cellular-scale feature extraction. The features are used to identify cell types, and perform large-scale analysis for identifying spatial distributions of specific cell types relative to the device. Python was used to build a server-based script (Dell 910 PowerEdge servers with 4 sockets/server with 10 cores each, 2 threads per core and 1TB of RAM running on Red Hat Enterprise Linux linked to a RAID 5 SAN) capable of routinely handling image datasets at this scale and performing all these processing steps in a collaborative multi-user multi-platform environment. Our Python script enables efficient data storage and movement between computers and storage servers, logs all the processing steps, and performs full multi-threaded execution of all codes, including open and closed-source third party libraries.
doi:10.3389/fninf.2014.00039
PMCID: PMC4010742  PMID: 24808857
Python; neuroprostetic device; C++; image processing software; segmentation; microglia tracing; neuroscience
12.  User-centered virtual environment design for virtual rehabilitation 
Background
As physical and cognitive rehabilitation protocols utilizing virtual environments transition from single applications to comprehensive rehabilitation programs there is a need for a new design cycle methodology. Current human-computer interaction designs focus on usability without benchmarking technology within a user-in-the-loop design cycle. The field of virtual rehabilitation is unique in that determining the efficacy of this genre of computer-aided therapies requires prior knowledge of technology issues that may confound patient outcome measures. Benchmarking the technology (e.g., displays or data gloves) using healthy controls may provide a means of characterizing the "normal" performance range of the virtual rehabilitation system. This standard not only allows therapists to select appropriate technology for use with their patient populations, it also allows them to account for technology limitations when assessing treatment efficacy.
Methods
An overview of the proposed user-centered design cycle is given. Comparisons of two optical see-through head-worn displays provide an example of benchmarking techniques. Benchmarks were obtained using a novel vision test capable of measuring a user's stereoacuity while wearing different types of head-worn displays. Results from healthy participants who performed both virtual and real-world versions of the stereoacuity test are discussed with respect to virtual rehabilitation design.
Results
The user-centered design cycle argues for benchmarking to precede virtual environment construction, especially for therapeutic applications. Results from real-world testing illustrate the general limitations in stereoacuity attained when viewing content using a head-worn display. Further, the stereoacuity vision benchmark test highlights differences in user performance when utilizing a similar style of head-worn display. These results support the need for including benchmarks as a means of better understanding user outcomes, especially for patient populations.
Conclusions
The stereoacuity testing confirms that without benchmarking in the design cycle poor user performance could be misconstrued as resulting from the participant's injury state. Thus, a user-centered design cycle that includes benchmarking for the different sensory modalities is recommended for accurate interpretation of the efficacy of the virtual environment based rehabilitation programs.
doi:10.1186/1743-0003-7-11
PMCID: PMC2837672  PMID: 20170519
13.  Quadcopter control in three-dimensional space using a noninvasive motor imagery based brain-computer interface 
Journal of neural engineering  2013;10(4):10.1088/1741-2560/10/4/046003.
Objective
At the balanced intersection of human and machine adaptation is found the optimally functioning brain-computer interface (BCI). In this study, we report a novel experiment of BCI controlling a robotic quadcopter in three-dimensional physical space using noninvasive scalp EEG in human subjects. We then quantify the performance of this system using metrics suitable for asynchronous BCI. Lastly, we examine the impact that operation of a real world device has on subjects’ control with comparison to a two-dimensional virtual cursor task.
Approach
Five human subjects were trained to modulate their sensorimotor rhythms to control an AR Drone navigating a three-dimensional physical space. Visual feedback was provided via a forward facing camera on the hull of the drone. Individual subjects were able to accurately acquire up to 90.5% of all valid targets presented while travelling at an average straight-line speed of 0.69 m/s.
Significance
Freely exploring and interacting with the world around us is a crucial element of autonomy that is lost in the context of neurodegenerative disease. Brain-computer interfaces are systems that aim to restore or enhance a user’s ability to interact with the environment via a computer and through the use of only thought. We demonstrate for the first time the ability to control a flying robot in the three-dimensional physical space using noninvasive scalp recorded EEG in humans. Our work indicates the potential of noninvasive EEG based BCI systems to accomplish complex control in three-dimensional physical space. The present study may serve as a framework for the investigation of multidimensional non-invasive brain-computer interface control in a physical environment using telepresence robotics.
doi:10.1088/1741-2560/10/4/046003
PMCID: PMC3839680  PMID: 23735712
Brain-Computer Interface; BCI; EEG; 3D control; motor imagery; telepresence robotics
14.  Towards a HPC-oriented parallel implementation of a learning algorithm for bioinformatics applications 
BMC Bioinformatics  2014;15(Suppl 5):S2.
Background
The huge quantity of data produced in Biomedical research needs sophisticated algorithmic methodologies for its storage, analysis, and processing. High Performance Computing (HPC) appears as a magic bullet in this challenge. However, several hard to solve parallelization and load balancing problems arise in this context. Here we discuss the HPC-oriented implementation of a general purpose learning algorithm, originally conceived for DNA analysis and recently extended to treat uncertainty on data (U-BRAIN). The U-BRAIN algorithm is a learning algorithm that finds a Boolean formula in disjunctive normal form (DNF), of approximately minimum complexity, that is consistent with a set of data (instances) which may have missing bits. The conjunctive terms of the formula are computed in an iterative way by identifying, from the given data, a family of sets of conditions that must be satisfied by all the positive instances and violated by all the negative ones; such conditions allow the computation of a set of coefficients (relevances) for each attribute (literal), that form a probability distribution, allowing the selection of the term literals. The great versatility that characterizes it, makes U-BRAIN applicable in many of the fields in which there are data to be analyzed. However the memory and the execution time required by the running are of O(n3) and of O(n5) order, respectively, and so, the algorithm is unaffordable for huge data sets.
Results
We find mathematical and programming solutions able to lead us towards the implementation of the algorithm U-BRAIN on parallel computers. First we give a Dynamic Programming model of the U-BRAIN algorithm, then we minimize the representation of the relevances. When the data are of great size we are forced to use the mass memory, and depending on where the data are actually stored, the access times can be quite different. According to the evaluation of algorithmic efficiency based on the Disk Model, in order to reduce the costs of the communications between different memories (RAM, Cache, Mass, Virtual) and to achieve efficient I/O performance, we design a mass storage structure able to access its data with a high degree of temporal and spatial locality. Then we develop a parallel implementation of the algorithm. We model it as a SPMD system together to a Message-Passing Programming Paradigm. Here, we adopt the high-level message-passing systems MPI (Message Passing Interface) in the version for the Java programming language, MPJ. The parallel processing is organized into four stages: partitioning, communication, agglomeration and mapping. The decomposition of the U-BRAIN algorithm determines the necessity of a communication protocol design among the processors involved. Efficient synchronization design is also discussed.
Conclusions
In the context of a collaboration between public and private institutions, the parallel model of U-BRAIN has been implemented and tested on the INTEL XEON E7xxx and E5xxx family of the CRESCO structure of Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), developed in the framework of the European Grid Infrastructure (EGI), a series of efforts to provide access to high-throughput computing resources across Europe using grid computing techniques. The implementation is able to minimize both the memory space and the execution time. The test data used in this study are IPDATA (Irvine Primate splice- junction DATA set), a subset of HS3D (Homo Sapiens Splice Sites Dataset) and a subset of COSMIC (the Catalogue of Somatic Mutations in Cancer). The execution time and the speed-up on IPDATA reach the best values within about 90 processors. Then the parallelization advantage is balanced by the greater cost of non-local communications between the processors. A similar behaviour is evident on HS3D, but at a greater number of processors, so evidencing the direct relationship between data size and parallelization gain. This behaviour is confirmed on COSMIC. Overall, the results obtained show that the parallel version is up to 30 times faster than the serial one.
doi:10.1186/1471-2105-15-S5-S2
PMCID: PMC4095002  PMID: 25077818
15.  Development of a HIPAA-compliant environment for translational research data and analytics 
High-performance computing centers (HPC) traditionally have far less restrictive privacy management policies than those encountered in healthcare. We show how an HPC can be re-engineered to accommodate clinical data while retaining its utility in computationally intensive tasks such as data mining, machine learning, and statistics. We also discuss deploying protected virtual machines. A critical planning step was to engage the university's information security operations and the information security and privacy office. Access to the environment requires a double authentication mechanism. The first level of authentication requires access to the university's virtual private network and the second requires that the users be listed in the HPC network information service directory. The physical hardware resides in a data center with controlled room access. All employees of the HPC and its users take the university's local Health Insurance Portability and Accountability Act training series. In the first 3 years, researcher count has increased from 6 to 58.
doi:10.1136/amiajnl-2013-001769
PMCID: PMC3912719  PMID: 23911553
High-performance Computing; Translational Medical Research; Clinical Research Informatics; HIPAA
16.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition 
PLoS Computational Biology  2014;10(12):e1003963.
The primate visual system achieves remarkable visual object recognition performance even in brief presentations, and under changes to object exemplar, geometric transformations, and background variation (a.k.a. core visual object recognition). This remarkable performance is mediated by the representation formed in inferior temporal (IT) cortex. In parallel, recent advances in machine learning have led to ever higher performing models of object recognition using artificial deep neural networks (DNNs). It remains unclear, however, whether the representational performance of DNNs rivals that of the brain. To accurately produce such a comparison, a major difficulty has been a unifying metric that accounts for experimental limitations, such as the amount of noise, the number of neural recording sites, and the number of trials, and computational limitations, such as the complexity of the decoding classifier and the number of classifier training examples. In this work, we perform a direct comparison that corrects for these experimental limitations and computational considerations. As part of our methodology, we propose an extension of “kernel analysis” that measures the generalization accuracy as a function of representational complexity. Our evaluations show that, unlike previous bio-inspired models, the latest DNNs rival the representational performance of IT cortex on this visual object recognition task. Furthermore, we show that models that perform well on measures of representational performance also perform well on measures of representational similarity to IT, and on measures of predicting individual IT multi-unit responses. Whether these DNNs rely on computational mechanisms similar to the primate visual system is yet to be determined, but, unlike all previous bio-inspired models, that possibility cannot be ruled out merely on representational performance grounds.
Author Summary
Primates are remarkable at determining the category of a visually presented object even in brief presentations, and under changes to object exemplar, position, pose, scale, and background. To date, this behavior has been unmatched by artificial computational systems. However, the field of machine learning has made great strides in producing artificial deep neural network systems that perform highly on object recognition benchmarks. In this study, we measured the responses of neural populations in inferior temporal (IT) cortex across thousands of images and compared the performance of neural features to features derived from the latest deep neural networks. Remarkably, we found that the latest artificial deep neural networks achieve performance equal to the performance of IT cortex. Both deep neural networks and IT cortex create representational spaces in which images with objects of the same category are close, and images with objects of different categories are far apart, even in the presence of large variations in object exemplar, position, pose, scale, and background. Furthermore, we show that the top-level features in these models exceed previous models in predicting the IT neural responses themselves. This result indicates that the latest deep neural networks may provide insight into understanding primate visual processing.
doi:10.1371/journal.pcbi.1003963
PMCID: PMC4270441  PMID: 25521294
17.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules 
The authors use machine learning of compound-protein interactions to explore drug polypharmacology and to efficiently identify bioactive ligands, including novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein coupled receptors and protein kinases.
We have demonstrated that machine learning of multiple compound–protein interactions is useful for efficient ligand screening and for assessing drug polypharmacology.This approach successfully identified novel scaffold-hopping compounds for two pharmaceutically important protein families: G-protein-coupled receptors and protein kinases.These bioactive compounds were not detected by existing computational ligand-screening methods in comparative studies.The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. Perturbations of biological systems by chemical probes provide broader applications not only for analysis of complex systems but also for intentional manipulations of these systems. Nevertheless, the lack of well-characterized chemical modulators has limited their use. Recently, chemical genomics has emerged as a promising area of research applicable to the exploration of novel bioactive molecules, and researchers are currently striving toward the identification of all possible ligands for all target protein families (Wang et al, 2009). Chemical genomics studies have shown that patterns of compound–protein interactions (CPIs) are too diverse to be understood as simple one-to-one events. There is an urgent need to develop appropriate data mining methods for characterizing and visualizing the full complexity of interactions between chemical space and biological systems. However, no existing screening approach has so far succeeded in identifying novel bioactive compounds using multiple interactions among compounds and target proteins.
High-throughput screening (HTS) and computational screening have greatly aided in the identification of early lead compounds for drug discovery. However, the large number of assays required for HTS to identify drugs that target multiple proteins render this process very costly and time-consuming. Therefore, interest in using in silico strategies for screening has increased. The most common computational approaches, ligand-based virtual screening (LBVS) and structure-based virtual screening (SBVS; Oprea and Matter, 2004; Muegge and Oloff, 2006; McInnes, 2007; Figure 1A), have been used for practical drug development. LBVS aims to identify molecules that are very similar to known active molecules and generally has difficulty identifying compounds with novel structural scaffolds that differ from reference molecules. The other popular strategy, SBVS, is constrained by the number of three-dimensional crystallographic structures available. To circumvent these limitations, we have shown that a new computational screening strategy, chemical genomics-based virtual screening (CGBVS), has the potential to identify novel, scaffold-hopping compounds and assess their polypharmacology by using a machine-learning method to recognize conserved molecular patterns in comprehensive CPI data sets.
The CGBVS strategy used in this study was made up of five steps: CPI data collection, descriptor calculation, representation of interaction vectors, predictive model construction using training data sets, and predictions from test data (Figure 1A). Importantly, step 1, the construction of a data set of chemical structures and protein sequences for known CPIs, did not require the three-dimensional protein structures needed for SBVS. In step 2, compound structures and protein sequences were converted into numerical descriptors. These descriptors were used to construct chemical or biological spaces in which decreasing distance between vectors corresponded to increasing similarity of compound structures or protein sequences. In step 3, we represented multiple CPI patterns by concatenating these chemical and protein descriptors. Using these interaction vectors, we could quantify the similarity of molecular interactions for compound–protein pairs, despite the fact that the ligand and protein similarity maps differed substantially. In step 4, concatenated vectors for CPI pairs (positive samples) and non-interacting pairs (negative samples) were input into an established machine-learning method. In the final step, the classifier constructed using training sets was applied to test data.
To evaluate the predictive value of CGBVS, we first compared its performance with that of LBVS by fivefold cross-validation. CGBVS performed with considerably higher accuracy (91.9%) than did LBVS (84.4%; Figure 1B). We next compared CGBVS and SBVS in a retrospective virtual screening based on the human β2-adrenergic receptor (ADRB2). Figure 1C shows that CGBVS provided higher hit rates than did SBVS. These results suggest that CGBVS is more successful than conventional approaches for prediction of CPIs.
We then evaluated the ability of the CGBVS method to predict the polypharmacology of ADRB2 by attempting to identify novel ADRB2 ligands from a group of G-protein-coupled receptor (GPCR) ligands. We ranked the prediction scores for the interactions of 826 reported GPCR ligands with ADRB2 and then analyzed the 50 highest-ranked compounds in greater detail. Of 21 commercially available compounds, 11 showed ADRB2-binding activity and were not previously reported to be ADRB2 ligands. These compounds included ligands not only for aminergic receptors but also for neuropeptide Y-type 1 receptors (NPY1R), which have low protein homology to ADRB2. Most ligands we identified were not detected by LBVS and SBVS, which suggests that only CGBVS could identify this unexpected cross-reaction for a ligand developed as a target to a peptidergic receptor.
The true value of CGBVS in drug discovery must be tested by assessing whether this method can identify scaffold-hopping lead compounds from a set of compounds that is structurally more diverse. To assess this ability, we analyzed 11 500 commercially available compounds to predict compounds likely to bind to two GPCRs and two protein kinases. Functional assays revealed that nine ADRB2 ligands, three NPY1R ligands, five epidermal growth factor receptor (EGFR) inhibitors, and two cyclin-dependent kinase 2 (CDK2) inhibitors were concentrated in the top-ranked compounds (hit rate=30, 15, 25, and 10%, respectively). We also evaluated the extent of scaffold hopping achieved in the identification of these novel ligands. One ADRB2 ligand, two NPY1R ligands, and one CDK2 inhibitor exhibited scaffold hopping (Figure 4), indicating that CGBVS can use this characteristic to rationally predict novel lead compounds, a crucial and very difficult step in drug discovery. This feature of CGBVS is critically different from existing predictive methods, such as LBVS, which depend on similarities between test and reference ligands, and focus on a single protein or highly homologous proteins. In particular, CGBVS is useful for targets with undefined ligands because this method can use CPIs with target proteins that exhibit lower levels of homology.
In summary, we have demonstrated that data mining of multiple CPIs is of great practical value for exploration of chemical space. As a predictive model, CGBVS could provide an important step in the discovery of such multi-target drugs by identifying the group of proteins targeted by a particular ligand, leading to innovation in pharmaceutical research.
The discovery of novel bioactive molecules advances our systems-level understanding of biological processes and is crucial for innovation in drug development. For this purpose, the emerging field of chemical genomics is currently focused on accumulating large assay data sets describing compound–protein interactions (CPIs). Although new target proteins for known drugs have recently been identified through mining of CPI databases, using these resources to identify novel ligands remains unexplored. Herein, we demonstrate that machine learning of multiple CPIs can not only assess drug polypharmacology but can also efficiently identify novel bioactive scaffold-hopping compounds. Through a machine-learning technique that uses multiple CPIs, we have successfully identified novel lead compounds for two pharmaceutically important protein families, G-protein-coupled receptors and protein kinases. These novel compounds were not identified by existing computational ligand-screening methods in comparative studies. The results of this study indicate that data derived from chemical genomics can be highly useful for exploring chemical space, and this systems biology perspective could accelerate drug discovery processes.
doi:10.1038/msb.2011.5
PMCID: PMC3094066  PMID: 21364574
chemical genomics; data mining; drug discovery; ligand screening; systems chemical biology
18.  A Virtual Sensor for Online Fault Detection of Multitooth-Tools 
Sensors (Basel, Switzerland)  2011;11(3):2773-2795.
The installation of suitable sensors close to the tool tip on milling centres is not possible in industrial environments. It is therefore necessary to design virtual sensors for these machines to perform online fault detection in many industrial tasks. This paper presents a virtual sensor for online fault detection of multitooth tools based on a Bayesian classifier. The device that performs this task applies mathematical models that function in conjunction with physical sensors. Only two experimental variables are collected from the milling centre that performs the machining operations: the electrical power consumption of the feed drive and the time required for machining each workpiece. The task of achieving reliable signals from a milling process is especially complex when multitooth tools are used, because each kind of cutting insert in the milling centre only works on each workpiece during a certain time window. Great effort has gone into designing a robust virtual sensor that can avoid re-calibration due to, e.g., maintenance operations. The virtual sensor developed as a result of this research is successfully validated under real conditions on a milling centre used for the mass production of automobile engine crankshafts. Recognition accuracy, calculated with a k-fold cross validation, had on average 0.957 of true positives and 0.986 of true negatives. Moreover, measured accuracy was 98%, which suggests that the virtual sensor correctly identifies new cases.
doi:10.3390/s110302773
PMCID: PMC3231587  PMID: 22163766
virtual sensor; Bayesian classifier; industrial applications; tool condition monitoring; multitooth-tools
19.  Extending the BEAGLE library to a multi-FPGA platform 
BMC Bioinformatics  2013;14:25.
Background
Maximum Likelihood (ML)-based phylogenetic inference using Felsenstein’s pruning algorithm is a standard method for estimating the evolutionary relationships amongst a set of species based on DNA sequence data, and is used in popular applications such as RAxML, PHYLIP, GARLI, BEAST, and MrBayes. The Phylogenetic Likelihood Function (PLF) and its associated scaling and normalization steps comprise the computational kernel for these tools. These computations are data intensive but contain fine grain parallelism that can be exploited by coprocessor architectures such as FPGAs and GPUs. A general purpose API called BEAGLE has recently been developed that includes optimized implementations of Felsenstein’s pruning algorithm for various data parallel architectures. In this paper, we extend the BEAGLE API to a multiple Field Programmable Gate Array (FPGA)-based platform called the Convey HC-1.
Results
The core calculation of our implementation, which includes both the phylogenetic likelihood function (PLF) and the tree likelihood calculation, has an arithmetic intensity of 130 floating-point operations per 64 bytes of I/O, or 2.03 ops/byte. Its performance can thus be calculated as a function of the host platform’s peak memory bandwidth and the implementation’s memory efficiency, as 2.03 × peak bandwidth × memory efficiency. Our FPGA-based platform has a peak bandwidth of 76.8 GB/s and our implementation achieves a memory efficiency of approximately 50%, which gives an average throughput of 78 Gflops. This represents a ~40X speedup when compared with BEAGLE’s CPU implementation on a dual Xeon 5520 and 3X speedup versus BEAGLE’s GPU implementation on a Tesla T10 GPU for very large data sizes. The power consumption is 92 W, yielding a power efficiency of 1.7 Gflops per Watt.
Conclusions
The use of data parallel architectures to achieve high performance for likelihood-based phylogenetic inference requires high memory bandwidth and a design methodology that emphasizes high memory efficiency. To achieve this objective, we integrated 32 pipelined processing elements (PEs) across four FPGAs. For the design of each PE, we developed a specialized synthesis tool to generate a floating-point pipeline with resource and throughput constraints to match the target platform. We have found that using low-latency floating-point operators can significantly reduce FPGA area and still meet timing requirement on the target platform. We found that this design methodology can achieve performance that exceeds that of a GPU-based coprocessor.
doi:10.1186/1471-2105-14-25
PMCID: PMC3599256  PMID: 23331707
20.  APP: an Automated Proteomics Pipeline for the analysis of mass spectrometry data based on multiple open access tools 
BMC Bioinformatics  2014;15(1):441.
Background
Mass spectrometry analyses of complex protein samples yield large amounts of data and specific expertise is needed for data analysis, in addition to a dedicated computer infrastructure. Furthermore, the identification of proteins and their specific properties require the use of multiple independent bioinformatics tools and several database search algorithms to process the same datasets. In order to facilitate and increase the speed of data analysis, there is a need for an integrated platform that would allow a comprehensive profiling of thousands of peptides and proteins in a single process through the simultaneous exploitation of multiple complementary algorithms.
Results
We have established a new proteomics pipeline designated as APP that fulfills these objectives using a complete series of tools freely available from open sources. APP automates the processing of proteomics tasks such as peptide identification, validation and quantitation from LC-MS/MS data and allows easy integration of many separate proteomics tools. Distributed processing is at the core of APP, allowing the processing of very large datasets using any combination of Windows/Linux physical or virtual computing resources.
Conclusions
APP provides distributed computing nodes that are simple to set up, greatly relieving the need for separate IT competence when handling large datasets. The modular nature of APP allows complex workflows to be managed and distributed, speeding up throughput and setup. Additionally, APP logs execution information on all executed tasks and generated results, simplifying information management and validation.
Electronic supplementary material
The online version of this article (doi:10.1186/s12859-014-0441-8) contains supplementary material, which is available to authorized users.
doi:10.1186/s12859-014-0441-8
PMCID: PMC4314934  PMID: 25547515
Proteomics; Automation; Validation; Distributed processing
21.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing 
BMC Bioinformatics  2011;12:356.
Background
Next-generation sequencing technologies have decentralized sequence acquisition, increasing the demand for new bioinformatics tools that are easy to use, portable across multiple platforms, and scalable for high-throughput applications. Cloud computing platforms provide on-demand access to computing infrastructure over the Internet and can be used in combination with custom built virtual machines to distribute pre-packaged with pre-configured software.
Results
We describe the Cloud Virtual Resource, CloVR, a new desktop application for push-button automated sequence analysis that can utilize cloud computing resources. CloVR is implemented as a single portable virtual machine (VM) that provides several automated analysis pipelines for microbial genomics, including 16S, whole genome and metagenome sequence analysis. The CloVR VM runs on a personal computer, utilizes local computer resources and requires minimal installation, addressing key challenges in deploying bioinformatics workflows. In addition CloVR supports use of remote cloud computing resources to improve performance for large-scale sequence processing. In a case study, we demonstrate the use of CloVR to automatically process next-generation sequencing data on multiple cloud computing platforms.
Conclusion
The CloVR VM and associated architecture lowers the barrier of entry for utilizing complex analysis protocols on both local single- and multi-core computers and cloud systems for high throughput data processing.
doi:10.1186/1471-2105-12-356
PMCID: PMC3228541  PMID: 21878105
22.  TiArA: A Virtual Appliance for the Analysis of Tiling Array Data 
PLoS ONE  2010;5(4):e9993.
Background
Genomic tiling arrays have been described in the scientific literature since 2003, yet there is a shortage of user-friendly applications available for their analysis.
Methodology/Principal Findings
Tiling Array Analyzer (TiArA) is a software program that provides a user-friendly graphical interface for the background subtraction, normalization, and summarization of data acquired through the Affymetrix tiling array platform. The background signal is empirically measured using a group of nonspecific probes with varying levels of GC content and normalization is performed to enforce a common dynamic range.
Conclusions/Significance
TiArA is implemented as a standalone program for Linux systems and is available as a cross-platform virtual machine that will run under most modern operating systems using virtualization software such as Sun VirtualBox or VMware. The software is available as a Debian package or a virtual appliance at http://purl.org/NET/tiara.
doi:10.1371/journal.pone.0009993
PMCID: PMC2848623  PMID: 20376318
23.  Virtual Patients on the Semantic Web: A Proof-of-Application Study 
Background
Virtual patients are interactive computer simulations that are increasingly used as learning activities in modern health care education, especially in teaching clinical decision making. A key challenge is how to retrieve and repurpose virtual patients as unique types of educational resources between different platforms because of the lack of standardized content-retrieving and repurposing mechanisms. Semantic Web technologies provide the capability, through structured information, for easy retrieval, reuse, repurposing, and exchange of virtual patients between different systems.
Objective
An attempt to address this challenge has been made through the mEducator Best Practice Network, which provisioned frameworks for the discovery, retrieval, sharing, and reuse of medical educational resources. We have extended the OpenLabyrinth virtual patient authoring and deployment platform to facilitate the repurposing and retrieval of existing virtual patient material.
Methods
A standalone Web distribution and Web interface, which contains an extension for the OpenLabyrinth virtual patient authoring system, was implemented. This extension was designed to semantically annotate virtual patients to facilitate intelligent searches, complex queries, and easy exchange between institutions. The OpenLabyrinth extension enables OpenLabyrinth authors to integrate and share virtual patient case metadata within the mEducator3.0 network. Evaluation included 3 successive steps: (1) expert reviews; (2) evaluation of the ability of health care professionals and medical students to create, share, and exchange virtual patients through specific scenarios in extended OpenLabyrinth (OLabX); and (3) evaluation of the repurposed learning objects that emerged from the procedure.
Results
We evaluated 30 repurposed virtual patient cases. The evaluation, with a total of 98 participants, demonstrated the system’s main strength: the core repurposing capacity. The extensive metadata schema presentation facilitated user exploration and filtering of resources. Usability weaknesses were primarily related to standard computer applications’ ease of use provisions. Most evaluators provided positive feedback regarding educational experiences on both content and system usability. Evaluation results replicated across several independent evaluation events.
Conclusions
The OpenLabyrinth extension, as part of the semantic mEducator3.0 approach, is a virtual patient sharing approach that builds on a collection of Semantic Web services and federates existing sources of clinical and educational data. It is an effective sharing tool for virtual patients and has been merged into the next version of the app (OpenLabyrinth 3.3). Such tool extensions may enhance the medical education arsenal with capacities of creating simulation/game-based learning episodes, massive open online courses, curricular transformations, and a future robust infrastructure for enabling mobile learning.
doi:10.2196/jmir.3933
PMCID: PMC4319094  PMID: 25616272
semantics; medical education; problem-based learning; data sharing; patient simulation; educational assessment
24.  RES1/384: Spreading the Use of Kinetic Modeling Techniques by JAVA Analysis Software 
Introduction
Kinetic modeling is the method of choice for assessing the behaviour of new PET (Positron Emission Tomography) tracers. For suitable tracers, kinetic models allow to derive unique functional information from the acquired PET data, for instance the absolute perfusion or the density of specific receptors in brain tissue. However, the processing steps required are sophisticated. As there has no comprehensive modeling software been available in the past, kinetic models could only be developed and applied by a limited number of sites. This paper presents such a software package called PMOD. Being developed with Internet technologies it can easily be distributed and may thus help consolidate more widespread use of kinetic modeling.
Methods
Aiming at maximal portability, the entire software was programmed in Java 2. An interface was defined such that new models can easily and seamlessly be added by a sort of plug-in programming. It is general enough to cope with virtually all models published so far. Innovative models may therefore directly be implemented in PMOD, or they may easily be incorporated even by external researchers. The supported features include weighted least squares fitting, parameter coupling among models, Monte Carlo simulations for assessing parameter identifiability, and batch processing for scheduling a sequence of time-consuming trials. The software can be configured as a local JAVA application, but can also be installed on an Internet server and be run from any Java2-enabled WWW-browser.
Results
The modeling software currently supports 19 different models ranging from simple tissue ratio methods to complex multi-injection protocols with two input curves plus metabolite correction. It has been tested on different platforms such as HP-UX, Sun Solaris, Linux, and Windows. At present it has been adopted by 7 sites which run the software on PC/NT. Experiences on this platform demonstrate:
The Java Virtual Machine runs with high reliability.
Despite just-in-time compilation there still exists a significant performance penalty for Java applications, especially with respect to memory management.
The kinetic modeling environment in its present form is used on a daily basis at several sites for scientific studies and even for some types of clinical studies.
Discussion
Earlier kinetic modeling programs were typically based on high-level languages such as Matlab or IDL and tailored to the need of individual sites. Every attempt to port them to a different environment was a major undertaking. This is in contrast to the present modeling software which runs on any platform as an application and supports easy data input.
Acknowledgement:
This work was supported by the Swiss National Science Foundation, Project 7PLPJ048289.
doi:10.2196/jmir.1.suppl1.e77
PMCID: PMC1761818
Medical Informatics Applications; PET; Kinetic Modeling; Java
25.  Discrepancy between mRNA and protein abundance: Insight from information retrieval process in computers 
Discrepancy between the abundance of cognate protein and RNA molecules is frequently observed. A theoretical understanding of this discrepancy remains elusive, and it is frequently described as surprises and/or technical difficulties in the literature. Protein and RNA represent different steps of the multi-stepped cellular genetic information flow process, in which they are dynamically produced and degraded. This paper explores a comparison with a similar process in computers - multi-step information flow from storage level to the execution level. Functional similarities can be found in almost every facet of the retrieval process. Firstly, common architecture is shared, as the ribonome (RNA space) and the proteome (protein space) are functionally similar to the computer primary memory and the computer cache memory respectively. Secondly, the retrieval process functions, in both systems, to support the operation of dynamic networks – biochemical regulatory networks in cells and, in computers, the virtual networks (of CPU instructions) that the CPU travels through while executing computer programs. Moreover, many regulatory techniques are implemented in computers at each step of the information retrieval process, with a goal of optimizing system performance. Cellular counterparts can be easily identified for these regulatory techniques. In other words, this comparative study attempted to utilize theoretical insight from computer system design principles as catalysis to sketch an integrative view of the gene expression process, that is, how it functions to ensure efficient operation of the overall cellular regulatory network. In context of this bird’s-eye view, discrepancy between protein and RNA abundance became a logical observation one would expect. It was suggested that this discrepancy, when interpreted in the context of system operation, serves as a potential source of information to decipher regulatory logics underneath biochemical network operation.
doi:10.1016/j.compbiolchem.2008.07.014
PMCID: PMC2637108  PMID: 18757239

Results 1-25 (523757)