Access to public data sets is important to the scientific community as a resource to develop new experiments or validate new data. Projects such as the PeptideAtlas, Ensembl and The Cancer Genome Atlas (TCGA) offer both access to public data and a repository to share their own data. Access to these data sets is often provided through a web page form and a web service API. Access technologies based on web protocols (e.g. http) have been in use for over a decade and are widely adopted across the industry for a variety of functions (e.g. search, commercial transactions, and social media). Each architecture adapts these technologies to provide users with tools to access and share data. Both commonly used web service technologies (e.g. REST and SOAP), and custom-built solutions over HTTP are utilized in providing access to research data. Providing multiple access points ensures that the community can access the data in the simplest and most effective manner for their particular needs. This article examines three common access mechanisms for web accessible data: BioMart, caBIG, and Google Data Sources. These are illustrated by implementing each over the PeptideAtlas repository and reviewed for their suitability based on specific usages common to research. BioMart, Google Data Sources, and caBIG are each suitable for certain uses. The tradeoffs made in the development of the technology are dependent on the uses each was designed for (e.g. security versus speed). This means that an understanding of specific requirements and tradeoffs is necessary before selecting the access technology.
BioMart; Google Data Sources; caBIG; data access; proteomics
With the development of increasingly large and complex genomic and proteomic data sets, an enhancement in the complexity of available Venn diagram analytical programs is becoming increasingly important. Current freely available Venn diagram programs often fail to represent extra complexity among datasets, such as regulation pattern differences between different groups. Here we describe the development of VennPlex, a program that illustrates the often diverse numerical interactions among multiple, high-complexity datasets, using up to four data sets. VennPlex includes versatile output features, where grouped data points in specific regions can be easily exported into a spreadsheet. This program is able to facilitate the analysis of two to four gene sets and their corresponding expression values in a user-friendly manner. To demonstrate its unique experimental utility we applied VennPlex to a complex paradigm, i.e. a comparison of the effect of multiple oxygen tension environments (1–20% ambient oxygen) upon gene transcription of primary rat astrocytes. VennPlex accurately dissects complex data sets reliably into easily identifiable groups for straightforward analysis and data output. This program, which is an improvement over currently available Venn diagram programs, is able to rapidly extract important datasets that represent the variety of expression patterns available within the data sets, showing potential applications in fields like genomics, proteomics, and bioinformatics.
For shotgun mass spectrometry based proteomics the most computationally expensive step is in matching the spectra against an increasingly large database of sequences and their post-translational modifications with known masses. Each mass spectrometer can generate data at an astonishingly high rate, and the scope of what is searched for is continually increasing. Therefore solutions for improving our ability to perform these searches are needed.
We present a sequence database search engine that is specifically designed to run efficiently on the Hadoop MapReduce distributed computing framework. The search engine implements the K-score algorithm, generating comparable output for the same input files as the original implementation. The scalability of the system is shown, and the architecture required for the development of such distributed processing is discussed.
The software is scalable in its ability to handle a large peptide database, numerous modifications and large numbers of spectra. Performance scales with the number of processors in the cluster, allowing throughput to expand with the available resources.
Genomic studies are now being undertaken on thousands of samples requiring new computational tools that can rapidly analyze data to identify clinically important features. Inferring structural variations in cancer genomes from mate-paired reads is a combinatorially difficult problem. We introduce Fastbreak, a fast and scalable toolkit that enables the analysis and visualization of large amounts of data from projects such as The Cancer Genome Atlas.
Cancer genomics; Structural variation; Translocation
Rationale: Clinical reports describe life-threatening cardiac arrhythmias after environmental exposure to carbon monoxide (CO) or accidental CO poisoning. Numerous case studies describe disruption of repolarization and prolongation of the QT interval, yet the mechanisms underlying CO-induced arrhythmias are unknown.
Objectives: To understand the cellular basis of CO-induced arrhythmias and to indentify an effective therapeutic approach.
Methods: Patch-clamp electrophysiology and confocal Ca2+ and nitric oxide (NO) imaging in isolated ventricular myocytes was performed together with protein S-nitrosylation to investigate the effects of CO at the cellular and molecular levels, whereas telemetry was used to investigate effects of CO on electrocardiogram recordings in vivo.
Measurements and Main Results: CO increased the sustained (late) component of the inward Na+ current, resulting in prolongation of the action potential and the associated intracellular Ca2+ transient. In more than 50% of myocytes these changes progressed to early after-depolarization–like arrhythmias. CO elevated NO levels in myocytes and caused S-nitrosylation of the Na+ channel, Nav1.5. All proarrhythmic effects of CO were abolished by the NO synthase inhibitor l-NAME, and reversed by ranolazine, an inhibitor of the late Na+ current. Ranolazine also corrected QT variability and arrhythmias induced by CO in vivo, as monitored by telemetry.
Conclusions: Our data indicate that the proarrhythmic effects of CO arise from activation of NO synthase, leading to NO-mediated nitrosylation of NaV1.5 and to induction of the late Na+ current. We also show that the antianginal drug ranolazine can abolish CO-induced early after-depolarizations, highlighting a novel approach to the treatment of CO-induced arrhythmias.
carbon monoxide; arrhythmia; late Na+ channel; nitric oxide; S-nitrosylation
As the volume, complexity and diversity of the information that scientists work with on a daily basis continues to rise, so too does the requirement for new analytic software. The analytic software must solve the dichotomy that exists between the need to allow for a high level of scientific reasoning, and the requirement to have an intuitive and easy to use tool which does not require specialist, and often arduous, training to use. Information visualization provides a solution to this problem, as it allows for direct manipulation and interaction with diverse and complex data. The challenge addressing bioinformatics researches is how to apply this knowledge to data sets that are continually growing in a field that is rapidly changing.
This paper discusses an approach to the development of visual mining tools capable of supporting the mining of massive data collections used in systems biology research, and also discusses lessons that have been learned providing tools for both local researchers and the wider community. Example tools were developed which are designed to enable the exploration and analyses of both proteomics and genomics based atlases. These atlases represent large repositories of raw and processed experiment data generated to support the identification of biomarkers through mass spectrometry (the PeptideAtlas) and the genomic characterization of cancer (The Cancer Genome Atlas). Specifically the tools are designed to allow for: the visual mining of thousands of mass spectrometry experiments, to assist in designing informed targeted protein assays; and the interactive analysis of hundreds of genomes, to explore the variations across different cancer genomes and cancer types.
The mining of massive repositories of biological data requires the development of new tools and techniques. Visual exploration of the large-scale atlas data sets allows researchers to mine data to find new meaning and make sense at scales from single samples to entire populations. Providing linked task specific views that allow a user to start from points of interest (from diseases to single genes) enables targeted exploration of thousands of spectra and genomes. As the composition of the atlases changes, and our understanding of the biology increase, new tasks will continually arise. It is therefore important to provide the means to make the data available in a suitable manner in as short a time as possible. We have done this through the use of common visualization workflows, into which we rapidly deploy visual tools. These visualizations follow common metaphors where possible to assist users in understanding the displayed data. Rapid development of tools and task specific views allows researchers to mine large-scale data almost as quickly as it is produced. Ultimately these visual tools enable new inferences, new analyses and further refinement of the large scale data being provided in atlases such as PeptideAtlas and The Cancer Genome Atlas.
In computational biology, permutation tests have become a widely used tool to assess the statistical significance of an event under investigation. However, the common way of computing the P-value, which expresses the statistical significance, requires a very large number of permutations when small (and thus interesting) P-values are to be accurately estimated. This is computationally expensive and often infeasible. Recently, we proposed an alternative estimator, which requires far fewer permutations compared to the standard empirical approach while still reliably estimating small P-values .
The proposed P-value estimator has been enriched with additional functionalities and is made available to the general community through a public website and web service, called EPEPT. This means that the EPEPT routines can be accessed not only via a website, but also programmatically using any programming language that can interact with the web. Examples of web service clients in multiple programming languages can be downloaded. Additionally, EPEPT accepts data of various common experiment types used in computational biology. For these experiment types EPEPT first computes the permutation values and then performs the P-value estimation. Finally, the source code of EPEPT can be downloaded.
Different types of users, such as biologists, bioinformaticians and software engineers, can use the method in an appropriate and simple way.
To assess the effect of wireless telephone substitution in a survey of health care reform opinions.
Survey of New Jersey adults conducted by landline and wireless telephones from June 1 to July 9, 2007.
Eighty-one survey measures are compared by wireless status. Logistic regression is used to confirm landline–wireless gaps in support for coverage reforms, controlling for population differences. Weights adjust for selection probability, complex sample design, and demographic distributions.
Significant differences by wireless status were found in many survey measures. Wireless users were significantly more likely to favor coverage reforms. Higher support for government-sponsored universal coverage, income-related state coverage subsidies, and an individual mandate remain after adjustment for demographic variables.
Opinion polls excluding wireless users are likely to understate support for coverage reforms.
Survey research; state health reform; wireless substitution
The advances in high-throughput sequencing technologies and growth in data sizes has highlighted the need for scalable tools to perform quality assurance testing. These tests are necessary to ensure that data is of a minimum necessary standard for use in downstream analysis. In this paper we present the SAMQA tool to rapidly and robustly identify errors in population-scale sequence data.
SAMQA has been used on samples from three separate sets of cancer genome data from The Cancer Genome Atlas (TCGA) project. Using technical standards provided by the SAM specification and biological standards defined by researchers, we have classified errors in these sequence data sets relative to individual reads within a sample. Due to an observed linearithmic speedup through the use of a high-performance computing (HPC) framework for the majority of tasks, poor quality data was identified prior to secondary analysis in significantly less time on the HPC framework than the same data run using alternative parallelization strategies on a single server.
The SAMQA toolset validates a minimum set of data quality standards across whole-genome and exome sequences. It is tuned to run on a high-performance computational framework, enabling QA across hundreds gigabytes of samples regardless of coverage or sample type.
Stable incorporation of labeled amino acids in cell culture is a simple approach to label proteins in vivo for mass spectrometric quantification. Full incorporation of isotopically heavy amino acids facilitates accurate quantification of proteins from different cultures, yet analysis methods for determination of incorporation are cumbersome and time-consuming. We present QTIPS, Quantification by Total Identified Peptides for SILAC, a straightforward, accurate method to determine the level of heavy amino acid incorporation throughout a population of peptides detected by mass spectrometry. Using QTIPS, we show that the incorporation of heavy amino acids in baker’s yeast is unaffected by the use of prototrophic strains, indicating that auxotrophy is not a requirement for SILAC experiments in this organism. This method has general utility for multiple applications where isotopic labeling is used for quantification in mass spectrometry.
QTIPS; SILAC; auxotrophy; yeast
The central nervous system normally functions at O2 levels which would be regarded as hypoxic by most other tissues. However, most in vitro studies of neurons and astrocytes are conducted under hyperoxic conditions without consideration of O2-dependent cellular adaptation. We analyzed the reactivity of astrocytes to 1, 4 and 9% O2 tensions compared to the cell culture standard of 20% O2, to investigate their ability to sense and translate this O2 information to transcriptional activity. Variance of ambient O2 tension for rat astrocytes resulted in profound changes in ribosomal activity, cytoskeletal and energy-regulatory mechanisms and cytokine-related signaling. Clustering of transcriptional regulation patterns revealed four distinct response pattern groups that directionally pivoted around the 4% O2 tension, or demonstrated coherent ascending/decreasing gene expression patterns in response to diverse oxygen tensions. Immune response and cell cycle/cancer-related signaling pathway transcriptomic subsets were significantly activated with increasing hypoxia, whilst hemostatic and cardiovascular signaling mechanisms were attenuated with increasing hypoxia. Our data indicate that variant O2 tensions induce specific and physiologically-focused transcript regulation patterns that may underpin important physiological mechanisms that connect higher neurological activity to astrocytic function and ambient oxygen environments. These strongly defined patterns demonstrate a strong bias for physiological transcript programs to pivot around the 4% O2 tension, while uni-modal programs that do not, appear more related to pathological actions. The functional interaction of these transcriptional ‘programs’ may serve to regulate the dynamic vascular responsivity of the central nervous system during periods of stress or heightened activity.
Peroxisomes are intracellular organelles that house a number of diverse metabolic processes, notably those required for β-oxidation of fatty acids. Peroxisomes biogenesis can be induced by the presence of peroxisome proliferators, including fatty acids, which activate complex cellular programs that underlie the induction process. Here, we used multi-parameter quantitative phenotype analyses of an arrayed mutant collection of yeast cells induced to proliferate peroxisomes, to establish a comprehensive inventory of genes required for peroxisome induction and function. The assays employed include growth in the presence of fatty acids, and confocal imaging and flow cytometry through the induction process. In addition to the classical phenotypes associated with loss of peroxisomal functions, these studies identified 169 genes required for robust signaling, transcription, normal peroxisomal development and morphologies, and transmission of peroxisomes to daughter cells. These gene products are localized throughout the cell, and many have indirect connections to peroxisome function. By integration with extant data sets, we present a total of 211 genes linked to peroxisome biogenesis and highlight the complex networks through which information flows during peroxisome biogenesis and function.
High throughput sequencing has become an increasingly important tool for biological research. However, the existing software systems for managing and processing these data have not provided the flexible infrastructure that research requires.
Existing software solutions provide static and well-established algorithms in a restrictive package. However as high throughput sequencing is a rapidly evolving field, such static approaches lack the ability to readily adopt the latest advances and techniques which are often required by researchers. We have used a loosely coupled, service-oriented infrastructure to develop SeqAdapt. This system streamlines data management and allows for rapid integration of novel algorithms. Our approach also allows computational biologists to focus on developing and applying new methods instead of writing boilerplate infrastructure code.
The system is based around the Addama service architecture and is available at our website as a demonstration web application, an installable single download and as a collection of individual customizable services.
Public proteomics databases such as PeptideAtlas contain peptides and proteins identified in mass spectrometry experiments. However, these databases lack information about human disease for researchers studying disease-related proteins. We have developed mspecLINE, a tool that combines knowledge about human disease in MEDLINE with empirical data about the detectable human proteome in PeptideAtlas. mspecLINE associates diseases with proteins by calculating the semantic distance between annotated terms from a controlled biomedical vocabulary. We used an established semantic distance measure that is based on the co-occurrence of disease and protein terms in the MEDLINE bibliographic database.
The mspecLINE web application allows researchers to explore relationships between human diseases and parts of the proteome that are detectable using a mass spectrometer. Given a disease, the tool will display proteins and peptides from PeptideAtlas that may be associated with the disease. It will also display relevant literature from MEDLINE. Furthermore, mspecLINE allows researchers to select proteotypic peptides for specific protein targets in a mass spectrometry assay.
Although mspecLINE applies an information retrieval technique to the MEDLINE database, it is distinct from previous MEDLINE query tools in that it combines the knowledge expressed in scientific literature with empirical proteomics data. The tool provides valuable information about candidate protein targets to researchers studying human disease and is freely available on a public web server.
Since many children with X-linked agammaglobulinemia (XLA) can now be expected to reach adulthood, knowledge of the status of adults with XLA would be of importance to the patients, their families, and the physicians caring for these patients. We performed the current study in adults with XLA to examine the impact of XLA on their daily lives and quality of life, their educational and socioeconomic status, their knowledge of the inheritance of their disorder, and their reproductive attitudes. Physicians who had entered adult patients with XLA in a national registry were asked to pass on a survey instrument to their patients. The patients then filled out the survey instrument and returned it directly to the investigators. Adults with XLA were hospitalized more frequently and missed more work and/or school than did the general United States population. However, their quality of life was comparable to that of the general United States population. They achieved a higher level of education and had a higher income than did the general United States population. Their knowledge of the inheritance of their disease was excellent. Sixty percent of them would not exercise any reproductive planning options as a result of their disease. The results of the current study suggest that although the disease impacts the daily lives of adults with XLA, they still become productive members of society and excel in many areas.
There continues to be significant controversy related to diagnostic testing for gastroesophageal reflux disease (GERD). Clearly, barium contrast fluoroscopy is superior to any other test in defining the anatomy of the upper gastrointestinal (UGI) tract. Although fluoroscopy can demonstrate gastroesophageal reflux (GER), this observation does not equate to GERD. Fluoroscopy time should not be prolonged to attempt to demonstrate GER during barium contrast radiography. There are no data to justify prolonging fluoroscopy time to perform provocative maneuvers to demonstrate reflux during barium contrast UGI series. Symptoms of GERD may be associated with physiologic esophageal acid exposure measured by intraesophageal pH monitoring, and a significant percentage of patients with abnormal esophageal acid exposure have no or minimal clinical symptoms of reflux. Abnormal acid exposure defined by pH monitoring over a 24-h period does not equate to GERD. In clinical practice presumptive diagnosis of GERD is reasonably assumed by substantial reduction or elimination of suspected reflux symptoms during therapeutic trial of acid reduction therapy.
Gastroesophageal reflux disease; Gastroesophageal reflux
Within research each experiment is different, the focus changes and the data is generated from a continually evolving barrage of technologies. There is a continual introduction of new techniques whose usage ranges from in-house protocols through to high-throughput instrumentation. To support these requirements data management systems are needed that can be rapidly built and readily adapted for new usage.
The adaptable data management system discussed is designed to support the seamless mining and analysis of biological experiment data that is commonly used in systems biology (e.g. ChIP-chip, gene expression, proteomics, imaging, flow cytometry). We use different content graphs to represent different views upon the data. These views are designed for different roles: equipment specific views are used to gather instrumentation information; data processing oriented views are provided to enable the rapid development of analysis applications; and research project specific views are used to organize information for individual research experiments. This management system allows for both the rapid introduction of new types of information and the evolution of the knowledge it represents.
Data management is an important aspect of any research enterprise. It is the foundation on which most applications are built, and must be easily extended to serve new functionality for new scientific areas. We have found that adopting a three-tier architecture for data management, built around distributed standardized content repositories, allows us to rapidly develop new applications to support a diverse user community.
Reversible phosphorylation is the most common posttranslational modification used in the regulation of cellular processes. This study of phosphatases and kinases required for peroxisome biogenesis is the first genome-wide analysis of phosphorylation events controlling organelle biogenesis. We evaluate signaling molecule deletion strains of the yeast Saccharomyces cerevisiae for presence of a green fluorescent protein chimera of peroxisomal thiolase, formation of peroxisomes, and peroxisome functionality. We find that distinct signaling networks involving glucose-mediated gene repression, derepression, oleate-mediated induction, and peroxisome formation promote stages of the biogenesis pathway. Additionally, separate classes of signaling proteins are responsible for the regulation of peroxisome number and size. These signaling networks specify the requirements of early and late events of peroxisome biogenesis. Among the numerous signaling proteins involved, Pho85p is exceptional, with functional involvements in both gene expression and peroxisome formation. Our study represents the first global study of signaling networks regulating the biogenesis of an organelle.
In systems biology, and many other areas of research, there is a need for the interoperability of tools and data sources that were not originally designed to be integrated. Due to the interdisciplinary nature of systems biology, and its association with high throughput experimental platforms, there is an additional need to continually integrate new technologies. As scientists work in isolated groups, integration with other groups is rarely a consideration when building the required software tools.
We illustrate an approach, through the discussion of a purpose built software architecture, which allows disparate groups to reuse tools and access data sources in a common manner. The architecture allows for: the rapid development of distributed applications; interoperability, so it can be used by a wide variety of developers and computational biologists; development using standard tools, so that it is easy to maintain and does not require a large development effort; extensibility, so that new technologies and data types can be incorporated; and non intrusive development, insofar as researchers need not to adhere to a pre-existing object model.
By using a relatively simple integration strategy, based upon a common identity system and dynamically discovered interoperable services, a light-weight software architecture can become the focal point through which scientists can both get access to and analyse the plethora of experimentally derived data.
Periods of prolonged hypoxia are associated clinically with an increased incidence of dementia, the most common form of which is Alzheimer's disease. Here, we review recent studies aimed at providing a cellular basis for this association. Hypoxia promoted an enhanced secretory response of excitable cells via formation of a novel Ca2+ influx pathway associated with the formation of amyloid peptides of Alzheimer's disease. More strikingly, hypoxia potentiated Ca2+ influx specifically through L-type Ca2+ channels in three distinct cellular systems. This effect was post-transcriptional, and evidence suggests it occurred via increased formation of amyloid peptides which alter Ca2+ channel trafficking via a mechanism involving increased production of reactive oxygen species by mitochondria. This action of hypoxia is likely to contribute to dysregulation of Ca2+ homeostasis, which has been proposed as a mechanism of cell death in Alzheimer's disease. We suggest, therefore, that our data provide a cellular basis to account for the known increased incidence of Alzheimer's disease in patients who have suffered prolonged hypoxic episodes.
hypoxia; calcium channel; Alzheimer's disease; reactive oxygen species
Detachment from biofilms is an important consideration in the dissemination of infection and the contamination of industrial systems but is the least-studied biofilm process. By using digital time-lapse microscopy and biofilm flow cells, we visualized localized growth and detachment of discrete cell clusters in mature mixed-species biofilms growing under steady conditions in turbulent flow in situ. The detaching biomass ranged from single cells to an aggregate with a diameter of approximately 500 μm. Direct evidence of local cell cluster detachment from the biofilms was supported by microscopic examination of filtered effluent. Single cells and small clusters detached more frequently, but larger aggregates contained a disproportionately high fraction of total detached biomass. These results have significance in the establishment of an infectious dose and public health risk assessment.
Medium supplements were examined for their effect on the growth of channel catfish ovary cells. It was found that the usual serum supplement of 10% fetal calf serum could be successfully replaced with a combination of 5% fetal calf serum and a mixture of insulin, transferrin, and selenous acid. It was also found that these cells could be grown in a more efficient manner on microcarrier beads. This type of culture produced 14 times the number of cells per milliliter of total medium used compared with the usual tissue culture flasks used for cell growth. The microcarrier system also provided for greater production efficiency of DNA from channel catfish virus, a virus that infects this cell line.