Transcriptional regulation is one of the most basic regulatory mechanisms in the cell. The accumulation of multiple metazoan genome sequences and the advent of high-throughput experimental techniques have motivated the development of a large number of bioinformatics methods for the detection of regulatory motifs. The regulatory process is extremely complex and individual computational algorithms typically have very limited success in genome-scale studies. Here, we argue the importance of integrating multiple computational algorithms and present an infrastructure that integrates eight web services covering key areas of transcriptional regulation. We have adopted the client-side integration technology and built a consistent input and output environment with a versatile visualization tool named SeqVISTA. The infrastructure will allow for easy integration of gene regulation analysis software that is scattered over the Internet. It will also enable bench biologists to perform an arsenal of analysis using cutting-edge methods in a familiar environment and bioinformatics researchers to focus on developing new algorithms without the need to invest substantial effort on complex pre- or post-processors. SeqVISTA is freely available to academic users and can be launched online at http://zlab.bu.edu/SeqVISTA/web.jnlp, provided that Java Web Start has been installed. In addition, a stand-alone version of the program can be downloaded and run locally. It can be obtained at http://zlab.bu.edu/SeqVISTA.
Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is “cloud computing”, which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate “as is” use by experimental biologists.
We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon’s EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs.
The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.
Laboratory Information Management Systems (LIMS) are an increasingly important part of modern laboratory infrastructure. As typically very sophisticated software products, LIMS often require considerable resources to select, deploy and maintain. Larger organisations may have access to specialist IT support to assist with requirements elicitation and software customisation, however smaller groups will often have limited IT support to perform the kind of iterative development that can resolve the difficulties that biologists often have when specifying requirements. Translational medicine aims to accelerate the process of treatment discovery by bringing together multiple disciplines to discover new approaches to treating disease, or novel applications of existing treatments. The diverse set of disciplines and complexity of processing procedures involved, especially with the use of high throughput technologies, bring difficulties in customizing a generic LIMS to provide a single system for managing sample related data within a translational medicine research setting, especially where limited IT support is available.
We have designed and developed a LIMS, BonsaiLIMS, around a very simple data model that can be easily implemented using a variety of technologies, and can be easily extended as specific requirements dictate. A reference implementation using Oracle 11 g database and the Python framework, Django is presented.
By focusing on a minimal feature set and a modular design we have been able to deploy the BonsaiLIMS system very quickly. The benefits to our institute have been the avoidance of the prolonged implementation timescales, budget overruns, scope creep, off-specifications and user fatigue issues that typify many enterprise software implementations. The transition away from using local, uncontrolled records in spreadsheet and paper formats to a centrally held, secured and backed-up database brings the immediate benefits of improved data visibility, audit and overall data quality. The open-source availability of this software allows others to rapidly implement a LIMS which in itself might sufficiently address user requirements. In situations where this software does not meet requirements, it can serve to elicit more accurate specifications from end-users for a more heavyweight LIMS by acting as a demonstrable prototype.
The enormous throughput and low cost of second-generation sequencing platforms now allow research and clinical geneticists to routinely perform single experiments that identify tens of thousands to millions of variant sites. Existing methods to annotate variant sites using information from publicly available databases via web browsers are too slow to be useful for the large sequencing datasets being routinely generated by geneticists. Because sequence annotation of variant sites is required before functional characterization can proceed, the lack of a high-throughput pipeline to efficiently annotate variant sites can act as a significant bottleneck in genetics research.
SeqAnt (Sequence Annotator) is an open source web service and software package that rapidly annotates DNA sequence variants and identifies recessive or compound heterozygous loci in human, mouse, fly, and worm genome sequencing experiments. Variants are characterized with respect to their functional type, frequency, and evolutionary conservation. Annotated variants can be viewed on a web browser, downloaded in a tab-delimited text file, or directly uploaded in a BED format to the UCSC genome browser. To demonstrate the speed of SeqAnt, we annotated a series of publicly available datasets that ranged in size from 37 to 3,439,107 variant sites. The total time to completely annotate these data completely ranged from 0.17 seconds to 28 minutes 49.8 seconds.
SeqAnt is an open source web service and software package that overcomes a critical bottleneck facing research and clinical geneticists using second-generation sequencing platforms. SeqAnt will prove especially useful for those investigators who lack dedicated bioinformatics personnel or infrastructure in their laboratories.
The advancement of the computational biology field hinges on progress in three fundamental directions – the development of new computational algorithms, the availability of informatics resource management infrastructures and the capability of tools to interoperate and synergize. There is an explosion in algorithms and tools for computational biology, which makes it difficult for biologists to find, compare and integrate such resources. We describe a new infrastructure, iTools, for managing the query, traversal and comparison of diverse computational biology resources. Specifically, iTools stores information about three types of resources–data, software tools and web-services. The iTools design, implementation and resource meta - data content reflect the broad research, computational, applied and scientific expertise available at the seven National Centers for Biomedical Computing. iTools provides a system for classification, categorization and integration of different computational biology resources across space-and-time scales, biomedical problems, computational infrastructures and mathematical foundations. A large number of resources are already iTools-accessible to the community and this infrastructure is rapidly growing. iTools includes human and machine interfaces to its resource meta-data repository. Investigators or computer programs may utilize these interfaces to search, compare, expand, revise and mine meta-data descriptions of existent computational biology resources. We propose two ways to browse and display the iTools dynamic collection of resources. The first one is based on an ontology of computational biology resources, and the second one is derived from hyperbolic projections of manifolds or complex structures onto planar discs. iTools is an open source project both in terms of the source code development as well as its meta-data content. iTools employs a decentralized, portable, scalable and lightweight framework for long-term resource management. We demonstrate several applications of iTools as a framework for integrated bioinformatics. iTools and the complete details about its specifications, usage and interfaces are available at the iTools web page http://iTools.ccb.ucla.edu.
“Core facilities” have become an integral part of modern biomedical research infrastructures and today require integrated management tools to help ensure their optimization for research. Web-based software presents great opportunities, but requires innovation inasmuch as the current generation of laboratory information management systems (LIMS) is mostly comprised of automatons overseeing predictable laboratory equipment processes. By contrast, core facility processes involve not just equipment, and services, but also people: facility staff, users, administrators, PIs etc., each with their own exigencies and unpredictability. Here we present software developed during the last ten years in an academic-commercial partnership, which began as a community-driven effort. Conceived from the outset as a core facility management tool to answer the specific needs of multiple facility research infrastructures. We used an innovative ethnography approach whereby the software design extrapolates manifold use-cases using a framework of co-existing rules and policy matrices that can be constantly tuned by admin. The system can vary the outcome of multiple processes, integral to multiple users, and multiple facilities in parallel, based upon real-time context and diverse metadata that define, for example, a service, a piece of equipment, a preventative action, a training or even a metrology. Here we report on the performance and impact of the software in the singular case-history example of the Institut Pasteur Paris where the software was conceived. Our analyses of the facility evolution and the software's development during ten years reveals evangelization of the user community based on its improvements to better answer the needs of the scientific community. Our results demonstrate a key role for the software bolstering long-term downstream benefits including increased funding, higher scientific output, and quality assurances for services rendered.
Recent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions. This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol's execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time.
High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge.
Increasingly large amounts of DNA sequencing data are being generated within the Wellcome Trust Sanger Institute (WTSI). The traditional file system struggles to handle these increasing amounts of sequence data. A good data management system therefore needs to be implemented and integrated into the current WTSI infrastructure. Such a system enables good management of the IT infrastructure of the sequencing pipeline and allows biologists to track their data.
We have chosen a data grid system, iRODS (Rule-Oriented Data management systems), to act as the data management system for the WTSI. iRODS provides a rule-based system management approach which makes data replication much easier and provides extra data protection. Unlike the metadata provided by traditional file systems, the metadata system of iRODS is comprehensive and allows users to customize their own application level metadata. Users and IT experts in the WTSI can then query the metadata to find and track data.
The aim of this paper is to describe how we designed and used (from both system and user viewpoints) iRODS as a data management system. Details are given about the problems faced and the solutions found when iRODS was implemented. A simple use case describing how users within the WTSI use iRODS is also introduced.
iRODS has been implemented and works as the production system for the sequencing pipeline of the WTSI. Both biologists and IT experts can now track and manage data, which could not previously be achieved. This novel approach allows biologists to define their own metadata and query the genomic data using those metadata.
Whereas genomic data are universally machine-readable, data arising from imaging, multiplex biochemistry, flow cytometry and other cell- and tissue-based assays usually reside in loosely organized files of poorly documented provenance. This arises because the relational databases used in genomic research are difficult to adapt to rapidly evolving experimental designs, data formats and analytic algorithms. Here we describe an adaptive approach to managing experimental data based on semantically-typed data hypercubes (SDCubes) that combine Hierarchical Data Format 5 (HDF5) and Extensible Markup Language (XML) file types. We demonstrate the application of SDCube-based storage using ImageRail, a software package for high-throughput microscopy. Experimental design and its day-to-day evolution, not rigid standards, determine how ImageRail data are organized in SDCubes. We apply ImageRail to the collection and analysis of drug dose-response landscapes in human cell lines at the single-cell level.
The High-Performance Computing and Communications (HPCC) program is a multiagency federal effort to advance the state of computing and communications and to provide the technologic platform on which the National Information Infrastructure (NII) can be built. The HPCC program supports the development of high-speed computers, high-speed telecommunications, related software and algorithms, education and training, and information infrastructure technology and applications. The vision of the NII is to extend access to high-performance computing and communications to virtually every U.S. citizen so that the technology can be used to improve the civil infrastructure, lifelong learning, energy management, health care, etc. Development of the NII will require resolution of complex economic and social issues, including information privacy. Health-related applications supported under the HPCC program and NII initiatives include connection of health care institutions to the Internet; enhanced access to gene sequence data; the "Visible Human" Project; and test-bed projects in telemedicine, electronic patient records, shared informatics tool development, and image systems.
One of the major difficulties for many laboratories setting up proteomics programs has been obtaining and maintaining the computational infrastructure required for the analysis of the large flow of proteomics data. We describe a system that combines distributed cloud computing and open source software to allow laboratories to set up scalable virtual proteomics analysis clusters without the investment in computational hardware or software licensing fees. Additionally, the pricing structure of distributed computing providers, such as Amazon Web Services, allows laboratories or even individuals to have large-scale computational resources at their disposal at a very low cost per run. We provide detailed step by step instructions on how to implement the virtual proteomics analysis clusters as well as a list of current available preconfigured Amazon machine images containing the OMSSA and X!Tandem search algorithms and sequence databases on the Medical College of Wisconsin Proteomics Center website (http://proteomics.mcw.edu/vipdac).
mass spectrometry; data analysis; search algorithms; software; cloud computing
In systems biology, and many other areas of research, there is a need for the interoperability of tools and data sources that were not originally designed to be integrated. Due to the interdisciplinary nature of systems biology, and its association with high throughput experimental platforms, there is an additional need to continually integrate new technologies. As scientists work in isolated groups, integration with other groups is rarely a consideration when building the required software tools.
We illustrate an approach, through the discussion of a purpose built software architecture, which allows disparate groups to reuse tools and access data sources in a common manner. The architecture allows for: the rapid development of distributed applications; interoperability, so it can be used by a wide variety of developers and computational biologists; development using standard tools, so that it is easy to maintain and does not require a large development effort; extensibility, so that new technologies and data types can be incorporated; and non intrusive development, insofar as researchers need not to adhere to a pre-existing object model.
By using a relatively simple integration strategy, based upon a common identity system and dynamically discovered interoperable services, a light-weight software architecture can become the focal point through which scientists can both get access to and analyse the plethora of experimentally derived data.
Reliable access to basic services can improve a community's resilience to HIV/AIDS. Accordingly, work is being done to upgrade the physical infrastructure in affected areas, often employing a strategy of decentralised service provision. Spatial characteristics are one of the major determinants in implementing services, even in the smaller municipal areas, and good quality spatial information is needed to inform decision making processes. However, limited funds, technical infrastructure and human resource capacity result in little or no access to spatial information for crucial infrastructure development decisions at local level.
This research investigated whether it would be possible to develop a GIS for basic infrastructure planning and management at local level. Given the resource constraints of the local government context, particularly in small municipalities, it was decided that open source software should be used for the prototype system.
The design and development of a prototype system illustrated that it is possible to develop an open source GIS system that can be used within the context of local information management. Usability tests show a high degree of usability for the system, which is important considering the heavy workload and high staff turnover that characterises local government in South Africa. Local infrastructure management stakeholders interviewed in a case study of a South African municipality see the potential for the use of GIS as a communication tool and are generally positive about the use of GIS for these purposes. They note security issues that may arise through the sharing of information, lack of skills and resource constraints as the major barriers to adoption.
The case study shows that spatial information is an identified need at local level. Open source GIS software can be used to develop a system to provide local-level stakeholders with spatial information. However, the suitability of the technology is only a part of the system – there are wider information and management issues which need to be addressed before the implementation of a local-level GIS for infrastructure management can be successful.
High-throughput automated sequencing has enabled an exponential growth rate of sequencing data. This requires increasing sequence quality and reliability in order to avoid database contamination with artefactual sequences. The arrival of pyrosequencing enhances this problem and necessitates customisable pre-processing algorithms.
SeqTrim has been implemented both as a Web and as a standalone command line application. Already-published and newly-designed algorithms have been included to identify sequence inserts, to remove low quality, vector, adaptor, low complexity and contaminant sequences, and to detect chimeric reads. The availability of several input and output formats allows its inclusion in sequence processing workflows. Due to its specific algorithms, SeqTrim outperforms other pre-processors implemented as Web services or standalone applications. It performs equally well with sequences from EST libraries, SSH libraries, genomic DNA libraries and pyrosequencing reads and does not lead to over-trimming.
SeqTrim is an efficient pipeline designed for pre-processing of any type of sequence read, including next-generation sequencing. It is easily configurable and provides a friendly interface that allows users to know what happened with sequences at every pre-processing stage, and to verify pre-processing of an individual sequence if desired. The recommended pipeline reveals more information about each sequence than previously described pre-processors and can discard more sequencing or experimental artefacts.
The data generated during a course of a biological experiment/study can be sometimes be massive and its management becomes quite critical for the success of the investigation undertaken. The accumulation and analysis of such large datasets often becomes tedious for biologists and lab technicians. Most of the current phenotype data acquisition management systems do not cater to the specialized needs of large-scale data analysis. The successful application of genomic tools/strategies to introduce desired traits in plants requires extensive and precise phenotyping of plant populations or gene bank material, thus necessitating an efficient data acquisition system.
Here we describe newly developed software "PHENOME" for high-throughput phenotyping, which allows researchers to accumulate, categorize, and manage large volume of phenotypic data. In this study, a large number of individual tomato plants were phenotyped with the "PHENOME" application using a Personal Digital Assistant (PDA) with built-in barcode scanner in concert with customized database specific for handling large populations.
The phenotyping of large population of plants both in the laboratory and in the field is very efficiently managed using PDA. The data is transferred to a specialized database(s) where it can be further analyzed and catalogued. The "PHENOME" aids collection and analysis of data obtained in large-scale mutagenesis, assessing quantitative trait loci (QTLs), raising mapping population, sampling of several individuals in one or more ecological niches etc.
In order to carry out an accurate diagnosis, prognosis, and/or therapeutic assessment for a disease; high-throughput approaches to examine the whole genome and transcriptome are now a necessity for modern research. Furthermore, to more fully understand the underlying causes of disease, high-throughput genomics are required to examine global gene expression, regulation, and interactions. To meet these and the future needs of the research community at Dartmouth, the latest technologies in deep sequencing and microarrays are offered. In addition, we offer the following services: Specialized, expensive, high-end instrumentation; Expert staffing; Cost-effective for individual labs; Competitive pricing and services with outside sources; On site Norris Cotton Cancer Center facility; Free experimental design consultations; Competitive fee for service charges for all high-throughput approaches; and Close proximity to Biostatistics and Bioinformatics shared resources.
Haplotypic sequences contain significantly more information than genotypes of genetic markers and are critical for studying disease association and genome evolution. Current methods for obtaining haplotypic sequences require the physical separation of alleles before sequencing, are time consuming and are not scaleable for large surveys of genetic variation. We have developed a novel method for acquiring haplotypic sequences from long PCR products using simple, high-throughput techniques. This method applies modified shotgun sequencing protocols to sequence both alleles concurrently, with read-pair information allowing the two alleles to be separated during sequence assembly. Although the haplotypic sequences can be assembled manually from the resultant data using pre-existing sequence assembly software, we have devised a novel heuristic algorithm to automate assembly and remove human error. We validated the approach on two long PCR products amplified from the human genome and confirmed the accuracy of our sequences against full-length clones of the same alleles. This method presents a simple high-throughput means to obtain full haplotypic sequences potentially up to 20 kb in length and is suitable for surveying genetic variation even in poorly-characterized genomes as it requires no prior information on sequence variation.
Linking genotypic and phenotypic information is one of the greatest challenges of current genetics research. The definition of an Information Technology infrastructure to support this kind of studies, and in particular studies aimed at the analysis of complex traits, which require the definition of multifaceted phenotypes and the integration genotypic information to discover the most prevalent diseases, is a paradigmatic goal of Biomedical Informatics. This paper describes the use of Information Technology methods and tools to develop a system for the management, inspection and integration of phenotypic and genotypic data.
We present the design and architecture of the Phenotype Miner, a software system able to flexibly manage phenotypic information, and its extended functionalities to retrieve genotype information from external repositories and to relate it to phenotypic data. For this purpose we developed a module to allow customized data upload by the user and a SOAP-based communications layer to retrieve data from existing biomedical knowledge management tools. In this paper we also demonstrate the system functionality by an example application of the system in which we analyze two related genomic datasets.
In this paper we show how a comprehensive, integrated and automated workbench for genotype and phenotype integration can facilitate and improve the hypothesis generation process underlying modern genetic studies.
To operate effectively the public health system requires infrastructure and the capacity to act. Public health's ability to attract funding for infrastructure and capacity development would be enhanced if it was able to demonstrate what level of capacity was required to ensure a high performing system. Australia's public health activities are undertaken within a complex organizational framework that involves three levels of government and a diverse range of other organizations. The question of appropriate levels of infrastructure and capacity is critical at each level. Comparatively little is known about infrastructure and capacity at the local level.
In-depth interviews were conducted with senior managers in two Australian states with different frameworks for health administration. They were asked to reflect on the critical components of infrastructure and capacity required at the local level. The interviews were analyzed to identify the major themes. Workshops with public health experts explored this data further. The information generated was used to develop a tool, designed to be used by groups of organizations within discrete geographical locations to assess local public health capacity.
Local actors in these two different systems pointed to similar areas for inclusion for the development of an instrument to map public health capacity at the local level. The tool asks respondents to consider resources, programs and the cultural environment within their organization. It also asks about the policy environment - recognizing that the broader environment within which organizations operate impacts on their capacity to act. Pilot testing of the tool pointed to some of the challenges involved in such an exercise, particularly if the tool were to be adopted as policy.
This research indicates that it is possible to develop a tool for the systematic assessment of public health capacity at the local level. Piloting the tool revealed some concerns amongst participants, particularly about how the tool would be used. However there was also recognition that the areas covered by the tool were those considered relevant.
We present GobyWeb, a web-based system that facilitates the management and analysis of high-throughput sequencing (HTS) projects. The software provides integrated support for a broad set of HTS analyses and offers a simple plugin extension mechanism. Analyses currently supported include quantification of gene expression for messenger and small RNA sequencing, estimation of DNA methylation (i.e., reduced bisulfite sequencing and whole genome methyl-seq), or the detection of pathogens in sequenced data. In contrast to previous analysis pipelines developed for analysis of HTS data, GobyWeb requires significantly less storage space, runs analyses efficiently on a parallel grid, scales gracefully to process tens or hundreds of multi-gigabyte samples, yet can be used effectively by researchers who are comfortable using a web browser. We conducted performance evaluations of the software and found it to either outperform or have similar performance to analysis programs developed for specialized analyses of HTS data. We found that most biologists who took a one-hour GobyWeb training session were readily able to analyze RNA-Seq data with state of the art analysis tools. GobyWeb can be obtained at http://gobyweb.campagnelab.org and is freely available for non-commercial use. GobyWeb plugins are distributed in source code and licensed under the open source LGPL3 license to facilitate code inspection, reuse and independent extensions http://github.com/CampagneLaboratory/gobyweb2-plugins.
Recent advances in high-throughput technologies dramatically increase biological data generation. However, many research groups lack computing facilities and specialists. This is an obstacle that remains to be addressed. Here, we present a Linux distribution, LXtoo, to provide a flexible computing platform for bioinformatics analysis.
Unlike most of the existing live Linux distributions for bioinformatics limiting their usage to sequence analysis and protein structure prediction, LXtoo incorporates a comprehensive collection of bioinformatics software, including data mining tools for microarray and proteomics, protein-protein interaction analysis, and computationally complex tasks like molecular dynamics. Moreover, most of the programs have been configured and optimized for high performance computing.
LXtoo aims to provide well-supported computing environment tailored for bioinformatics research, reducing duplication of efforts in building computing infrastructure. LXtoo is distributed as a Live DVD and freely available at http://bioinformatics.jnu.edu.cn/LXtoo.
Bioinformatics; Software; Linux; Operating system
Cloud computing is a concept wherein a computer grid is created using the Internet with the sole purpose of utilizing shared resources such as computer software, hardware, on a pay-per-use model. Using Cloud computing, radiology users can efficiently manage multimodality imaging units by using the latest software and hardware without paying huge upfront costs. Cloud computing systems usually work on public, private, hybrid, or community models. Using the various components of a Cloud, such as applications, client, infrastructure, storage, services, and processing power, Cloud computing can help imaging units rapidly scale and descale operations and avoid huge spending on maintenance of costly applications and storage. Cloud computing allows flexibility in imaging. It sets free radiology from the confines of a hospital and creates a virtual mobile office. The downsides to Cloud computing involve security and privacy issues which need to be addressed to ensure the success of Cloud computing in the future.
Cloud computing; PACS; radiology; RIS; teleradiology
Recent advances in sequencing technology have created unprecedented opportunities for biological research. However, the increasing throughput of these technologies has created many challenges for data management and analysis. As the demand for sophisticated analyses increases, the development time of software and algorithms is outpacing the speed of traditional publication. As technologies continue to be developed, methods change rapidly, making publications less relevant for users. The SEQanswers wiki (SEQwiki) is a wiki database that is actively edited and updated by the members of the SEQanswers community (http://SEQanswers.com/). The wiki provides an extensive catalogue of tools, technologies and tutorials for high-throughput sequencing (HTS), including information about HTS service providers. It has been implemented in MediaWiki with the Semantic MediaWiki and Semantic Forms extensions to collect structured data, providing powerful navigation and reporting features. Within 2 years, the community has created pages for over 500 tools, with approximately 400 literature references and 600 web links. This collaborative effort has made SEQwiki the most comprehensive database of HTS tools anywhere on the web. The wiki includes task-focused mini-reviews of commonly used tools, and a growing collection of more than 100 HTS service providers. SEQwiki is available at: http://wiki.SEQanswers.com/.
Motivation: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) is widely used in biological research. ChIP-seq experiments yield many ambiguous tags that can be mapped with equal probability to multiple genomic sites. Such ambiguous tags are typically eliminated from consideration resulting in a potential loss of important biological information.
Results: We have developed a Gibbs sampling-based algorithm for the genomic mapping of ambiguous sequence tags. Our algorithm relies on the local genomic tag context to guide the mapping of ambiguous tags. The Gibbs sampling procedure we use simultaneously maps ambiguous tags and updates the probabilities used to infer correct tag map positions. We show that our algorithm is able to correctly map more ambiguous tags than existing mapping methods. Our approach is also able to uncover mapped genomic sites from highly repetitive sequences that can not be detected based on unique tags alone, including transposable elements, segmental duplications and peri-centromeric regions. This mapping approach should prove to be useful for increasing biological knowledge on the too often neglected repetitive genomic regions.
Supplementary Information: Supplementary data are available at Bioinformatics online.