Motivations: Spreadsheet-like tabular formats are ever more popular in the biomedical field as a mean for experimental reporting. The problem of converting the graph of an experimental workflow into a table-based representation occurs in many such formats and is not easy to solve.
Results: We describe graph2tab, a library that implements methods to realise such a conversion in a size-optimised way. Our solution is generic and can be adapted to specific cases of data exporters or data converters that need to be implemented.
Availability and Implementation: The library source code and documentation are available at http://github.com/ISA-tools/graph2tab.
Supplementary Information: A supplementary document describes the theoretical and technical details about the library implementation.
Summary: SIMBioMS is a web-based open source software system for managing data and information in biomedical studies. It provides a solution for the collection, storage, management and retrieval of information about research subjects and biomedical samples, as well as experimental data obtained using a range of high-throughput technologies, including gene expression, genotyping, proteomics and metabonomics. The system can easily be customized and has proven to be successful in several large-scale multi-site collaborative projects. It is compatible with emerging functional genomics data standards and provides data import and export in accepted standard formats. Protocols for transferring data to durable archives at the European Bioinformatics Institute have been implemented.
Availability: The source code, documentation and initialization scripts are available at http://simbioms.org.
Contact: email@example.com; firstname.lastname@example.org
Summary: Bioclipse, a graphical workbench for the life sciences, provides functionality for managing and visualizing life science data. We introduce Bioclipse-R, which integrates Bioclipse and the statistical programming language R. The synergy between Bioclipse and R is demonstrated by the construction of a decision support system for anticancer drug screening and mutagenicity prediction, which shows how Bioclipse-R can be used to perform complex tasks from within a single software system.
Availability and implementation: Bioclipse-R is implemented as a set of Java plug-ins for Bioclipse based on the R-package rj. Source code and binary packages are available from https://github.com/bioclipse and http://www.bioclipse.net/bioclipse-r, respectively.
Supplementary data are available at Bioinformatics online.
Biobanks are essential tools in diagnostics and therapeutics research and development related to personalized medicine. Several
international recommendations, standards and guidelines exist that discuss the legal, ethical, technological, and management
requirements of biobanks. Today's biobanks are much more than just collections of biospecimens. They also store a huge amount of
data related to biological samples which can be either clinical data or data coming from biochemical experiments. A well-designed
biobank software system also provides the possibility of finding associations between stored elements. Modern research biobanks
are able to manage multicenter sample collections while fulfilling all requirements of data protection and security. While
developing several biobanks and analyzing the data stored in them, our research group recognized the need for a well-organized,
easy-to-check requirements guideline that can be used to develop biobank software systems. International best practices along with
relevant ICT standards were integrated into a comprehensive guideline: The Model Requirements for the Management of Biological
Repositories (BioReq), which covers the full range of activities related to biobank development. The guideline is freely available on
the Internet for the research community.
The database is available for free at http://bioreq.astridbio.com/bioreq_v2.0.pdf
Biobank Software System; guideline; model requirement; personalized medicine
Motivation: Data collection in spreadsheets is ubiquitous, but current solutions lack support for collaborative semantic annotation that would promote shared and interdisciplinary annotation practices, supporting geographically distributed players.
Results: OntoMaton is an open source solution that brings ontology lookup and tagging capabilities into a cloud-based collaborative editing environment, harnessing Google Spreadsheets and the NCBO Web services. It is a general purpose, format-agnostic tool that may serve as a component of the ISA software suite. OntoMaton can also be used to assist the ontology development process.
Availability: OntoMaton is freely available from Google widgets under the CPAL open source license; documentation and examples at: https://github.com/ISA-tools/OntoMaton.
Long-distance ocean voyages may have substantial impacts on seamen's health, possibly causing malnutrition and other illness. Measures can possibly be taken to prevent such problems from happening through preparing special diet and making special precautions prior or during the sailing if a detailed understanding can be gained about what specific health effects such voyages may have on the seamen.
We present a computational study on 200 seamen using 41 chemistry indicators measured on their blood samples collected before and after the sailing. Our computational study is done using a data classification approach with a support vector machine-based classifier in conjunction with feature selections using a recursive feature elimination procedure.
Our analysis results suggest that among the 41 blood chemistry measures, nine are most likely to be affected during the sailing, which provide important clues about the specific effects of ocean voyage on seamen's health.
The identification of the nine blood chemistry measures provides important clues about the effects of long-distance voyage on seamen's health. These findings will prove to be useful to guide in improving the living and working environment, as well as food preparation on ships.
Vast quantities of electronic data are collected about patients and service users as they pass through health service and other public sector organisations, and these data present enormous potential for research and policy evaluation. The Health Information Research Unit (HIRU) aims to realise the potential of electronically-held, person-based, routinely-collected data to conduct and support health-related studies. However, there are considerable challenges that must be addressed before such data can be used for these purposes, to ensure compliance with the legislation and guidelines generally known as Information Governance.
A set of objectives was identified to address the challenges and establish the Secure Anonymised Information Linkage (SAIL) system in accordance with Information Governance. These were to: 1) ensure data transportation is secure; 2) operate a reliable record matching technique to enable accurate record linkage across datasets; 3) anonymise and encrypt the data to prevent re-identification of individuals; 4) apply measures to address disclosure risk in data views created for researchers; 5) ensure data access is controlled and authorised; 6) establish methods for scrutinising proposals for data utilisation and approving output; and 7) gain external verification of compliance with Information Governance.
The SAIL databank has been established and it operates on a DB2 platform (Data Warehouse Edition on AIX) running on an IBM 'P' series Supercomputer: Blue-C. The findings of an independent internal audit were favourable and concluded that the systems in place provide adequate assurance of compliance with Information Governance. This expanding databank already holds over 500 million anonymised and encrypted individual-level records from a range of sources relevant to health and well-being. This includes national datasets covering the whole of Wales (approximately 3 million population) and local provider-level datasets, with further growth in progress. The utility of the databank is demonstrated by increasing engagement in high quality research studies.
Through the pragmatic approach that has been adopted, we have been able to address the key challenges in establishing a national databank of anonymised person-based records, so that the data are available for research and evaluation whilst meeting the requirements of Information Governance.
Summary: Methyl-Analyzer is a python package that analyzes genome-wide DNA methylation data produced by the Methyl-MAPS (methylation mapping analysis by paired-end sequencing) method. Methyl-MAPS is an enzymatic-based method that uses both methylation-sensitive and -dependent enzymes covering >80% of CpG dinucleotides within mammalian genomes. It combines enzymatic-based approaches with high-throughput next-generation sequencing technology to provide whole genome DNA methylation profiles. Methyl-Analyzer processes and integrates sequencing reads from methylated and unmethylated compartments and estimates CpG methylation probabilities at single base resolution.
Availability and implementation: Methyl-Analyzer is available at http://github.com/epigenomics/methylmaps. Sample dataset is available for download at http://epigenomicspub.columbia.edu/methylanalyzer_data.html.
Supplementary information: Supplementary data are available at Bioinformatics online.
Summary: Drug versus Disease (DvD) provides a pipeline, available through R
or Cytoscape, for the comparison of drug and disease gene expression profiles from public
microarray repositories. Negatively correlated profiles can be used to generate hypotheses
of drug-repurposing, whereas positively correlated profiles may be used to infer side
effects of drugs. DvD allows users to compare drug and disease signatures with dynamic
access to databases Array Express, Gene Expression Omnibus and data from the Connectivity
Availability and implementation: R package (submitted to Bioconductor) under
GPL 3 and Cytoscape plug-in freely available for download at www.ebi.ac.uk/saezrodriguez/DVD/.
Supplementary data are available at Bioinformatics
Summary: We report CRdata.org, a cloud-based, free, open-source web server for running analyses and sharing data and R scripts with others. In addition to using the free, public service, CRdata users can launch their own private Amazon Elastic Computing Cloud (EC2) nodes and store private data and scripts on Amazon's Simple Storage Service (S3) with user-controlled access rights. All CRdata services are provided via point-and-click menus.
Availability and Implementation: CRdata is open-source and free under the permissive MIT License (opensource.org/licenses/mit-license.php). The source code is in Ruby (ruby-lang.org/en/) and available at: github.com/seerdata/crdata.
Motivation: MethylCoder is a software program that generates per-base methylation data given a set of bisulfite-treated reads. It provides the option to use either of two existing short-read aligners, each with different strengths. It accounts for soft-masked alignments and overlapping paired-end reads. MethylCoder outputs data in text and binary formats in addition to the final alignment in SAM format, so that common high-throughput sequencing tools can be used on the resulting output. It is more flexible than existing software and competitive in terms of speed and memory use.
Availability: MethylCoder requires only a python interpreter and a C compiler to run. Extensive documentation and the full source code are available under the MIT license at: https://github.com/brentp/methylcode.
Biobanks and archived datasets collecting samples and data have become crucial engines of genetic and genomic research. Unresolved, however, is what responsibilities biobanks should shoulder to manage incidental findings (IFs) and individual research results (IRRs) of potential health, reproductive, or personal importance to individual contributors (using “biobank” here to refer to both collections of samples and collections of data). This paper reports recommendations from a 2-year, NIH-funded project. The authors analyze responsibilities to manage return of IFs and IRRs in a biobank research system (primary research or collection sites, the biobank itself, and secondary research sites). They suggest that biobanks shoulder significant responsibility for seeing that the biobank research system addresses the return question explicitly. When re-identification of individual contributors is possible, the biobank should work to enable the biobank research system to discharge four core responsibilities: to (1) clarify the criteria for evaluating findings and roster of returnable findings, (2) analyze a particular finding in relation to this, (3) re-identify the individual contributor, and (4) recontact the contributor to offer the finding. The authors suggest that findings that are analytically valid, reveal an established and substantial risk of a serious health condition, and that are clinically actionable should generally be offered to consenting contributors. The paper specifies 10 concrete recommendations, addressing new biobanks and biobanks already in existence.
incidental findings; return of results; biobanks; research ethics; bioethics; genetics; genomics
Motivation: Organic enzyme cofactors are involved in many enzyme reactions. Therefore, the analysis of cofactors is crucial to gain a better understanding of enzyme catalysis. To aid this, we have created the CoFactor database.
Results: CoFactor provides a web interface to access hand-curated data extracted from the literature on organic enzyme cofactors in biocatalysis, as well as automatically collected information. CoFactor includes information on the conformational and solvent accessibility variation of the enzyme-bound cofactors, as well as mechanistic and structural information about the hosting enzymes.
Availability: The database is publicly available and can be accessed at http://www.ebi.ac.uk/thornton-srv/databases/CoFactor
Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: Dasty3 is a highly interactive and extensible Web-based framework. It provides a rich Application Programming Interface upon which it is possible to develop specialized clients capable of retrieving information from DAS sources as well as from data providers not using the DAS protocol. Dasty3 provides significant improvements on previous Web-based frameworks and is implemented using the 1.6 DAS specification.
Availability: Dasty3 is an open-source tool freely available at http://www.ebi.ac.uk/dasty/ under the terms of the GNU General public license. Source and documentation can be found at http://code.google.com/p/dasty/.
The Sol Genomics Network (SGN; http://solgenomics.net/) is a clade-oriented database (COD) containing biological data for species in the Solanaceae and their close relatives, with data types ranging from chromosomes and genes to phenotypes and accessions. SGN hosts several genome maps and sequences, including a pre-release of the tomato (Solanum lycopersicum cv Heinz 1706) reference genome. A new transcriptome component has been added to store RNA-seq and microarray data. SGN is also an open source software project, continuously developing and improving a complex system for storing, integrating and analyzing data. All code and development work is publicly visible on GitHub (http://github.com). The database architecture combines SGN-specific schemas and the community-developed Chado schema (http://gmod.org/wiki/Chado) for compatibility with other genome databases. The SGN curation model is community-driven, allowing researchers to add and edit information using simple web tools. Currently, over a hundred community annotators help curate the database. SGN can be accessed at http://solgenomics.net/.
Summary: A tool to predict the effect that newly discovered genomic variants have on known transcripts is indispensible in prioritizing and categorizing such variants. In Ensembl, a web-based tool (the SNP Effect Predictor) and API interface can now functionally annotate variants in all Ensembl and Ensembl Genomes supported species.
Availability: The Ensembl SNP Effect Predictor can be accessed via the Ensembl website at http://www.ensembl.org/. The Ensembl API (http://www.ensembl.org/info/docs/api/api_installation.html for installation instructions) is open source software.
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
Although dietary supplements (DSs) in sports are considered a natural need resulting from athletes’ increased physical demands, and although they are often consumed by athletes, data on DS usage in Olympic sailing are scarce. The aim of this study was to study the use of and attitudes towards DSs and doping problems in high-level competitive sailing.
The sample consisted of 44 high-level sailing athletes (5 of whom were female; total mean age 24.13 ± 6.67 years) and 34 coaches (1 of whom was female; total mean age 37.01 ± 11.70). An extensive, self-administered questionnaire of substance use was used, and the subjects were asked about sociodemographic data, sport-related factors, DS-related factors (i.e., usage of and knowledge about DSs, sources of information), and doping-related factors. The Kruskal-Wallis ANOVA was used to determine the differences in group characteristics, and Spearman’s rank order correlation and a logistic regression analysis were used to define the relationships between the studied variables.
DS usage is relatively high. More than 77% of athletes consume DSs, and 38% do so on a regular basis (daily). The athletes place a high degree of trust in their coaches and/or physicians regarding DSs and doping. The most important reason for not consuming DSs is the opinion that DSs are useless and a lack of knowledge about DSs. The likelihood of doping is low, and one-third of the subjects believe that doping occurs in sailing (no significant differences between athletes and coaches). The logistic regression found crew number (i.e., single vs. double crew) to be the single significant predictor of DS usage, with a higher probability of DS consumption among single crews.
Because of the high consumption of DSs future investigations should focus on real nutritional needs in sailing sport. Also, since athletes reported that their coaches are the primary source of information about nutrition and DSs, further studies are necessary to determine the knowledge about nutrition, DSs and doping problems among athletes and their support teams (i.e., coaches, physicians, and strength and conditioning specialists).
Nutritional supplementation; Substances; Testing design; Athlete; Coach
Summary: Analysis of microbial genomes often requires the general organization and comparison of tens to thousands of genomes both from public repositories and unpublished sources. MicrobeDB provides a foundation for such projects by the automation of downloading published, completed bacterial and archaeal genomes from key sources, parsing annotations of all genomes (both public and private) into a local database, and allowing interaction with the database through an easy to use programming interface. MicrobeDB creates a simple to use, easy to maintain, centralized local resource for various large-scale comparative genomic analyses and a back-end for future microbial application design.
Availability: MicrobeDB is freely available under the GNU-GPL at: http://github.com/mlangill/microbedb/
CircaDB (http://circadb.org) is a new database of circadian transcriptional profiles from time course expression experiments from mice and humans. Each transcript’s expression was evaluated by three separate algorithms, JTK_Cycle, Lomb Scargle and DeLichtenberg. Users can query the gene annotations using simple and powerful full text search terms, restrict results to specific data sets and provide probability thresholds for each algorithm. Visualizations of the data are intuitive charts that convey profile information more effectively than a table of probabilities. The CircaDB web application is open source and available at http://github.com/itmat/circadb.
Perdeuteration, selective deuteration, and stereo array isotope labeling (SAIL) are valuable strategies for NMR studies of larger proteins and membrane proteins. To minimize scrambling of the label, it is best to use cell-free methods to prepare selectively labeled proteins. However, when proteins are prepared from deuterated amino acids by cell-free translation in H2O, exchange reactions can lead to contamination of 2H sites by 1H from the solvent. Examination of a sample of SAIL-chlorella ubiquitin prepared by Escherichia coli cell-free synthesis revealed that exchange had occurred at several residues (mainly at Gly, Ala, Asp, Asn, Glu, and Gln). We present results from a study aimed at identifying the exchanging sites and level of exchange and at testing a strategy for minimizing 1H contamination during wheat germ cell-free translation of proteins produced from deuterated amino acids by adding known inhibitors of transaminases (1 mM aminooxyacetic acid) and glutamate synthetase (0.1 mM L-methionine sulfoximine). By using a wheat germ cell-free expression system, we produced [U-2H,15N]-chlorella ubiquitin without and with added inhibitors, and [U-15N]-chlorella ubiquitin as a reference to determine the extent of deuterium incorporation. We also prepared a sample of [U-13C,15N]-chlorella ubiquitin, for use in assigning the sites of exchange. The added inhibitors did not reduce the protein yield and were successful in blocking hydrogen exchange at Cα sites with the exception of Gly. We discovered, in addition, that partial exchange occurred with or without the inhibitors at certain side-chain methyl and methylene groups: Ala-Hβ, Asn-Hβ, Asp-Hβ, Gln-Hγ, Glu-Hγ, and Lys-Hε. The side-chain labeling pattern, in particular the mixed chiral labeling resulting from partial exchange at certain sites, should be of interest in studies of large proteins, protein complexes, and membrane proteins.
Cell-free translation; chlorella ubiquitin; SAIL; perdeuterated protein; proton back exchange; transamination; transaminase inhibitor
We have previously demonstrated that routinely collected primary care data can be used to identify potential participants for trials in depression . Here we demonstrate how patients with psychotic disorders can be identified from primary care records for potential inclusion in a cohort study. We discuss the strengths and limitations of this approach; assess its potential value and report challenges encountered.
We designed an algorithm with which we searched for patients with a lifetime diagnosis of psychotic disorders within the Secure Anonymised Information Linkage (SAIL) database of routinely collected health data. The algorithm was validated against the "gold standard" of a well established operational criteria checklist for psychotic and affective illness (OPCRIT). Case notes of 100 patients from a community mental health team (CMHT) in Swansea were studied of whom 80 had matched GP records.
The algorithm had favourable test characteristics, with a very good ability to detect patients with psychotic disorders (sensitivity > 0.7) and an excellent ability not to falsely identify patients with psychotic disorders (specificity > 0.9).
With certain limitations our algorithm can be used to search the general practice data and reliably identify patients with psychotic disorders. This may be useful in identifying candidates for potential inclusion in cohort studies.
Vast amounts of data are collected about patients and service users in the course of health and social care service delivery. Electronic data systems for patient records have the potential to revolutionise service delivery and research. But in order to achieve this, it is essential that the ability to link the data at the individual record level be retained whilst adhering to the principles of information governance. The SAIL (Secure Anonymised Information Linkage) databank has been established using disparate datasets, and over 500 million records from multiple health and social care service providers have been loaded to date, with further growth in progress.
Having established the infrastructure of the databank, the aim of this work was to develop and implement an accurate matching process to enable the assignment of a unique Anonymous Linking Field (ALF) to person-based records to make the databank ready for record-linkage research studies. An SQL-based matching algorithm (MACRAL, Matching Algorithm for Consistent Results in Anonymised Linkage) was developed for this purpose. Firstly the suitability of using a valid NHS number as the basis of a unique identifier was assessed using MACRAL. Secondly, MACRAL was applied in turn to match primary care, secondary care and social services datasets to the NHS Administrative Register (NHSAR), to assess the efficacy of this process, and the optimum matching technique.
The validation of using the NHS number yielded specificity values > 99.8% and sensitivity values > 94.6% using probabilistic record linkage (PRL) at the 50% threshold, and error rates were < 0.2%. A range of techniques for matching datasets to the NHSAR were applied and the optimum technique resulted in sensitivity values of: 99.9% for a GP dataset from primary care, 99.3% for a PEDW dataset from secondary care and 95.2% for the PARIS database from social care.
With the infrastructure that has been put in place, the reliable matching process that has been developed enables an ALF to be consistently allocated to records in the databank. The SAIL databank represents a research-ready platform for record-linkage studies.
Recruitment to clinical trials can be challenging. We identified anonymous potential participants to an existing pragmatic randomised controlled depression trial to assess the feasibility of using routinely collected data to identify potential trial participants. We discuss the strengths and limitations of this approach, assess its potential value, report challenges and ethical issues encountered.
Swansea University's Health Information Research Unit's Secure Anonymised Information Linkage (SAIL) database of routinely collected health records was interrogated, using Structured Query Language (SQL). Read codes were used to create an algorithm of inclusion/exclusion criteria with which to identify suitable anonymous participants. Two independent clinicians rated the eligibility of the potential participants' identified. Inter-rater reliability was assessed using the kappa statistic and inter-class correlation.
The study population (N = 37263) comprised all adults registered at five general practices in Swansea UK. Using the algorithm 867 anonymous potential participants were identified. The sensitivity and specificity results > 0.9 suggested a high degree of accuracy from the algorithm. The inter-rater reliability results indicated strong agreement between the confirming raters. The Intra Class Correlation Coefficient (Cronbach's Alpha) > 0.9, suggested excellent agreement and Kappa coefficient > 0.8; almost perfect agreement.
This proof of concept study showed that routinely collected primary care data can be used to identify potential participants for a pragmatic randomised controlled trial of folate augmentation of antidepressant therapy for the treatment of depression. Further work will be needed to assess generalisability to other conditions and settings and the inclusion of this approach to support Electronic Enhanced Recruitment (EER).
Methods: The observation period for this study started on October 1, 2003 and ended on May 1, 2004 and included 30 air rescue missions. Data and information were collected prospectively.
Results: The Air Mercy Service in Cape Town Province responded to 30 requests for help. Twenty five accidents were attributed to inability to detach the kite from the harness. Injuries occurred in five incidents and included fractures of the upper arm, ribs and ankle, and lacerations and contusions to the head and neck. Two patients suffered from hypothermia and one experienced severe exhaustion. All surfers were rescued successfully and there were no fatal accidents.
Discussion: The risk potential of this new sport is unclear. Dangerous situations can occur despite proper training and safety precautions due to unpredictable conditions and difficulties with equipment. Safety should be stressed. Surfers should sailing with a fellow kiter and should wear a life vest. More efforts must be taken to make this booming new water sport safer.
Biobanks include biological samples and attached databases. Human biobanks occur in research, technological development and medical activities. Population genomics
is highly dependent on the availability of large biobanks. Ethical issues must be
considered: protecting the rights of those people whose samples or data are in
biobanks (information, autonomy, confidentiality, protection of private life), assuring
the non-commercial use of human body elements and the optimal use of samples
and data. They balance other issues, such as protecting the rights of researchers
and companies, allowing long-term use of biobanks while detailed information on
future uses is not available. At the level of populations, the traditional form of
informed consent is challenged. Other dimensions relate to the rights of a group
as such, in addition to individual rights. Conditions of return of results and/or
benefit to a population need to be defined. With ‘large-scale biobanking’ a marked
trend in genomics, new societal dimensions appear, regarding communication, debate,
regulation, societal control and valorization of such large biobanks. Exploring how
genomics can help health sector biobanks to become more rationally constituted
and exploited is an interesting perspective. For example, evaluating how genomic
approaches can help in optimizing haematopoietic stem cell donor registries using
new markers and high-throughput techniques to increase immunogenetic variability
in such registries is a challenge currently being addressed. Ethical issues in such
contexts are important, as not only individual decisions or projects are concerned,
but also national policies in the international arena and organization of democratic
debate about science, medicine and society.