DNA methylation is an important epigenetic modification involved in many biological processes. Bisulfite treatment coupled with high-throughput sequencing provides an effective approach for studying genome-wide DNA methylation at base resolution. Libraries such as whole genome bisulfite sequencing (WGBS) and reduced represented bisulfite sequencing (RRBS) are widely used for generating DNA methylomes, demanding efficient and versatile tools for aligning bisulfite sequencing data.
We have developed BS-Seeker2, an updated version of BS Seeker, as a full pipeline for mapping bisulfite sequencing data and generating DNA methylomes. BS-Seeker2 improves mappability over existing aligners by using local alignment. It can also map reads from RRBS library by building special indexes with improved efficiency and accuracy. Moreover, BS-Seeker2 provides additional function for filtering out reads with incomplete bisulfite conversion, which is useful in minimizing the overestimation of DNA methylation levels. We also defined CGmap and ATCGmap file formats for full representations of DNA methylomes, as part of the outputs of BS-Seeker2 pipeline together with BAM and WIG files.
Our evaluations on the performance show that BS-Seeker2 works efficiently and accurately for both WGBS data and RRBS data. BS-Seeker2 is freely available at http://pellegrini.mcdb.ucla.edu/BS_Seeker2/ and the Galaxy server.
DNA methylation; Bisulfite sequencing aligner; WGBS; RRBS; BS Seeker; Bisulfite conversion failure; Galaxy toolshed
Multiplecompeting bioinformatics tools exist for next-generation sequencing data analysis. Many of these tools are available as R/Bioconductor modules, and it can be challenging for the bench biologist without any programming background to quickly analyse genomics data. Here, we present an application that is designed to be simple to use, while leveraging the power of R as the analysis engine behind the scenes.
Genome Informatics Data Explorer (Guide) is a desktop application designed for the bench biologist to analyse RNA-seq and microarray gene expression data. It requires a text file of summarised read counts or expression values as input data, and performs differential expression analyses at both the gene and pathway level. It uses well-established R/Bioconductor packages such as limma for its analyses, without requiring the user to have specific knowledge of the underlying R functions. Results are presented in figures or interactive tables which integrate useful data from multiple sources such as gene annotation and orthologue data. Advanced options include the ability to edit R commands to customise the analysis pipeline.
Guide is a desktop application designed to query gene expression data in a user-friendly way while automatically communicating with R. Its customisation options make it possible to use different bioinformatics tools available through R/Bioconductor for its analyses, while keeping the core usage simple. Guide is written in the cross-platform framework of Qt, and is freely available for use from http://guide.wehi.edu.au.
Data analysis; R; Gene expression; RNA-seq; Microarray; Differential expression; Software
The advent of next-generation high-throughput technologies has revolutionized whole genome sequencing, yet some experiments require sequencing only of targeted regions of the genome from a very large number of samples. These regions can be amplified by PCR and sequenced by next-generation methods using a multidimensional pooling strategy. However, there is at present no available generalized tool for the computational analysis of target-enriched NGS data from multidimensional pools.
Here we present InsertionMapper, a pipeline tool for the identification of targeted sequences from multidimensional high throughput sequencing data. InsertionMapper consists of four independently working modules: Data Preprocessing, Database Modeling, Dimension Deconvolution and Element Mapping. We illustrate InsertionMapper with an example from our project 'New reverse genetics resources for maize’, which aims to sequence-index a collection of 15,000 independent insertion sites of the transposon Ds in maize. Identified sequences are validated by PCR assays. This pipeline tool is applicable to similar scenarios requiring analysis of the tremendous output of short reads produced in NGS sequencing experiments of targeted genome sequences.
InsertionMapper is proven efficacious for the identification of target-enriched sequences from multidimensional high throughput sequencing data. With adjustable parameters and experiment configurations, this tool can save great computational effort to biologists interested in identifying their sequences of interest within the huge output of modern DNA sequencers. InsertionMapper is freely accessible at https://sourceforge.net/p/insertionmapper and http://bo.csam.montclair.edu/du/insertionmapper.
Next-generation sequencing; Sequence identification; Target enrichment; Multidimensional pooling
The large-scale identification of physical protein-protein interactions (PPIs) is an important step toward understanding how biological networks evolve and generate emergent phenotypes. However, experimental identification of PPIs is a laborious and error-prone process, and current methods of PPI prediction tend to be highly conservative or require large amounts of functional data that may not be available for newly-sequenced organisms.
In this study we demonstrate a random-forest based technique, ENTS, for the computational prediction of protein-protein interactions based only on primary sequence data. Our approach is able to efficiently predict interactions on a whole-genome scale for any eukaryotic organism, using pairwise combinations of conserved domains and predicted subcellular localization of proteins as input features. We present the first predicted interactome for the forest tree Populus trichocarpa in addition to the predicted interactomes for Saccharomyces cerevisiae, Homo sapiens, Mus musculus, and Arabidopsis thaliana. Comparing our approach to other PPI predictors, we find that ENTS performs comparably to or better than a number of existing approaches, including several that utilize a variety of functional information for their predictions. We also find that the predicted interactions are biologically meaningful, as indicated by similarity in functional annotations and enrichment of co-expressed genes in public microarray datasets. Furthermore, we demonstrate some of the biological insights that can be gained from these predicted interaction networks. We show that the predicted interactions yield informative groupings of P. trichocarpa metabolic pathways, literature-supported associations among human disease states, and theory-supported insight into the evolutionary dynamics of duplicated genes in paleopolyploid plants.
We conclude that the ENTS classifier will be a valuable tool for the de novo annotation of genome sequences, providing initial clues about regulatory and metabolic network topology, and revealing relationships that are not immediately obvious from traditional homology-based annotations.
High-throughput omics technologies such as microarrays and next-generation sequencing (NGS) have become indispensable tools in biological research. Computational analysis and biological interpretation of omics data can pose significant challenges due to a number of factors, in particular the systems integration required to fully exploit and compare data from different studies and/or technology platforms. In transcriptomics, the identification of differentially expressed genes when studying effect(s) or contrast(s) of interest constitutes the starting point for further downstream computational analysis (e.g. gene over-representation/enrichment analysis, reverse engineering) leading to mechanistic insights. Therefore, it is important to systematically store the full list of genes with their associated statistical analysis results (differential expression, t-statistics, p-value) corresponding to one or more effect(s) or contrast(s) of interest (shortly termed as ” contrast data”) in a comparable manner and extract gene sets in order to efficiently support downstream analyses and further leverage data on a long-term basis. Filling this gap would open new research perspectives for biologists to discover disease-related biomarkers and to support the understanding of molecular mechanisms underlying specific biological perturbation effects (e.g. disease, genetic, environmental, etc.).
To address these challenges, we developed Confero, a contrast data and gene set platform for downstream analysis and biological interpretation of omics data. The Confero software platform provides storage of contrast data in a simple and standard format, data transformation to enable cross-study and platform data comparison, and automatic extraction and storage of gene sets to build new a priori knowledge which is leveraged by integrated and extensible downstream computational analysis tools. Gene Set Enrichment Analysis (GSEA) and Over-Representation Analysis (ORA) are currently integrated as an analysis module as well as additional tools to support biological interpretation. Confero is a standalone system that also integrates with Galaxy, an open-source workflow management and data integration system. To illustrate Confero platform functionality we walk through major aspects of the Confero workflow and results using the Bioconductor estrogen package dataset.
Confero provides a unique and flexible platform to support downstream computational analysis facilitating biological interpretation. The system has been designed in order to provide the researcher with a simple, innovative, and extensible solution to store and exploit analyzed data in a sustainable and reproducible manner thereby accelerating knowledge-driven research. Confero source code is freely available from http://sourceforge.net/projects/confero/.
Gene expression; Contrast data; Gene set; Gene set enrichment; Omics; Microarray; Next-generation sequencing; Reproducible research system; Knowledge acquisition
Funded by the National Institutes of Health (NIH), the aim of the Model Organism ENCyclopedia of DNA Elements (modENCODE) project is to provide the biological research community with a comprehensive encyclopedia of functional genomic elements for both model organisms C. elegans (worm) and D. melanogaster (fly). With a total size of just under 10 terabytes of data collected and released to the public, one of the challenges faced by researchers is to extract biologically meaningful knowledge from this large data set. While the basic quality control, pre-processing, and analysis of the data has already been performed by members of the modENCODE consortium, many researchers will wish to reinterpret the data set using modifications and enhancements of the original protocols, or combine modENCODE data with other data sets. Unfortunately this can be a time consuming and logistically challenging proposition.
In recognition of this challenge, the modENCODE DCC has released uniform computing resources for analyzing modENCODE data on Galaxy (https://github.com/modENCODE-DCC/Galaxy), on the public Amazon Cloud (http://aws.amazon.com), and on the private Bionimbus Cloud for genomic research (http://www.bionimbus.org). In particular, we have released Galaxy workflows for interpreting ChIP-seq data which use the same quality control (QC) and peak calling standards adopted by the modENCODE and ENCODE communities. For convenience of use, we have created Amazon and Bionimbus Cloud machine images containing Galaxy along with all the modENCODE data, software and other dependencies.
Using these resources provides a framework for running consistent and reproducible analyses on modENCODE data, ultimately allowing researchers to use more of their time using modENCODE data, and less time moving it around.
Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses.
Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies.
Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html.
Cloud computing; Whole genome sequencing; Single nucleotide polymorphism; SNP; Next generation sequencing; Software
Visualization plays an essential role in genomics research by making it possible to observe correlations and trends in large datasets as well as communicate findings to others. Visual analysis, which combines visualization with analysis tools to enable seamless use of both approaches for scientific investigation, offers a powerful method for performing complex genomic analyses. However, there are numerous challenges that arise when creating rich, interactive Web-based visualizations/visual analysis applications for high-throughput genomics. These challenges include managing data flow from Web server to Web browser, integrating analysis tools and visualizations, and sharing visualizations with colleagues.
We have created a platform simplifies the creation of Web-based visualization/visual analysis applications for high-throughput genomics. This platform provides components that make it simple to efficiently query very large datasets, draw common representations of genomic data, integrate with analysis tools, and share or publish fully interactive visualizations. Using this platform, we have created a Circos-style genome-wide viewer, a generic scatter plot for correlation analysis, an interactive phylogenetic tree, a scalable genome browser for next-generation sequencing data, and an application for systematically exploring tool parameter spaces to find good parameter values. All visualizations are interactive and fully customizable. The platform is integrated with the Galaxy (http://galaxyproject.org) genomics workbench, making it easy to integrate new visual applications into Galaxy.
Visualization and visual analysis play an important role in high-throughput genomics experiments, and approaches are needed to make it easier to create applications for these activities. Our framework provides a foundation for creating Web-based visualizations and integrating them into Galaxy. Finally, the visualizations we have created using the framework are useful tools for high-throughput genomics experiments.
Visualization; Visual analysis; Galaxy; Genome browser; Circos; Phylogenetic tree
Phenomena such as incomplete lineage sorting, horizontal gene transfer, gene duplication and subsequent sub- and neo-functionalisation can result in distinct local phylogenetic relationships that are discordant with species phylogeny. In order to assess the possible biological roles for these subdivisions, they must first be identified and characterised, preferably on a large scale and in an automated fashion.
We developed Saguaro, a combination of a Hidden Markov Model (HMM) and a Self Organising Map (SOM), to characterise local phylogenetic relationships among aligned sequences using cacti, matrices of pair-wise distance measures. While the HMM determines the genomic boundaries from aligned sequences, the SOM hypothesises new cacti in an unsupervised and iterative fashion based on the regions that were modelled least well by existing cacti. After testing the software on simulated data, we demonstrate the utility of Saguaro by testing two different data sets: (i) 181 Dengue virus strains, and (ii) 5 primate genomes. Saguaro identifies regions under lineage-specific constraint for the first set, and genomic segments that we attribute to incomplete lineage sorting in the second dataset. Intriguingly for the primate data, Saguaro also classified an additional ~3% of the genome as most incompatible with the expected species phylogeny. A substantial fraction of these regions was found to overlap genes associated with both the innate and adaptive immune systems.
Saguaro detects distinct cacti describing local phylogenetic relationships without requiring any a priori hypotheses. We have successfully demonstrated Saguaro’s utility with two contrasting data sets, one containing many members with short sequences (Dengue viral strains: n = 181, genome size = 10,700 nt), and the other with few members but complex genomes (related primate species: n = 5, genome size = 3 Gb), suggesting that the software is applicable to a wide variety of experimental populations. Saguaro is written in C++, runs on the Linux operating system, and can be downloaded from http://saguarogw.sourceforge.net/.
With the emergence of next-generation sequencing, the availability of prokaryotic genome sequences is expanding rapidly. A total of 5,276 genomes have been released since 2008, yet only 1,692 genomes were complete. The final phase of microbial genome sequencing, particularly gap closing, is frequently the rate-limiting step either because of complex genomic structures that cause sequence bias even with high genomic coverage, or the presence of repeat sequences that may cause gaps in assembly.
We have developed a Cytoscape plugin to facilitate gap closing for high-throughput sequencing data from microbial genomes. This plugin is capable of interactively displaying the relationships among genomic contigs derived from various sequencing formats. The sequence contigs of plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats, etc.) can be displayed as well.
Displaying relationships between contigs using graphs in Cytoscape rather than tables provides a more straightforward visual representation. This will facilitate a faster and more precise determination of the linkages among contigs and greatly improve the efficiency of gap closing.
ContigScape; Repeat contig; Microbial; Visualization; Linkage; Gap closing
The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need.
We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities.
CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from
Chloroplast genome; Annotation; Web server; CPGAVAS
Protein-coding regions in human genes harbor 85% of the mutations that are associated with disease-related traits. Compared with whole-genome sequencing of complex samples, exome sequencing serves as an alternative option because of its dramatically reduced cost. In fact, exome sequencing has been successfully applied to identify the cause of several Mendelian disorders, such as Miller and Schinzel-Giedio syndrome. However, there remain great challenges in handling the huge data generated by exome sequencing and in identifying potential disease-related genetic variations.
In this study, Exome-assistant (http://22.214.171.124/exomeassistant), a convenient tool for submitting and annotating single nucleotide polymorphisms (SNPs) and insertion/deletion variations (InDels), was developed to rapidly detect candidate disease-related genetic variations from exome sequencing projects. Versatile filter criteria are provided by Exome-assistant to meet different users’ requirements. Exome-assistant consists of four modules: the single case module, the two cases module, the multiple cases module, and the reanalysis module. The two cases and multiple cases modules allow users to identify sample-specific and common variations. The multiple cases module also supports family-based studies and Mendelian filtering. The identified candidate disease-related genetic variations can be annotated according to their sample features.
In summary, by exploring exome sequencing data, Exome-assistant can provide researchers with detailed biological insights into genetic variation events and permits the identification of potential genetic causes of human diseases and related traits.
Next generation sequencing; Mendelian disease; Single nucleotide polymorphisms; Insertions and deletions; Variation filtering; Minor allele frequency
Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even distribution of biological groups and confounding factors across batches. However, due to the practical complications, the availability of the final collection of samples in genomics study might be unbalanced and incomplete, which, without appropriate attention in sample-to-batch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects.
We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated sample-to-batch allocations in genomics experiments.
OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it can reduce the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs.
Information about the composition of regulatory regions is of great value for designing experiments to functionally characterize gene expression. The multiplicity of available applications to predict transcription factor binding sites in a particular locus contrasts with the substantial computational expertise that is demanded to manipulate them, which may constitute a potential barrier for the experimental community.
CBS (Conserved regulatory Binding Sites, http://compfly.bio.ub.es/CBS) is a public platform of evolutionarily conserved binding sites and enhancers predicted in multiple Drosophila genomes that is furnished with published chromatin signatures associated to transcriptionally active regions and other experimental sources of information. The rapid access to this novel body of knowledge through a user-friendly web interface enables non-expert users to identify the binding sequences available for any particular gene, transcription factor, or genome region.
The CBS platform is a powerful resource that provides tools for data mining individual sequences and groups of co-expressed genes with epigenomics information to conduct regulatory screenings in Drosophila.
Gene regulation; Genomics; Epigenomics; Comparative genomics; ChIP-seq
Cancer progression is associated with genomic instability and an accumulation of gains and losses of DNA. The growing variety of tools for measuring genomic copy numbers, including various types of array-CGH, SNP arrays and high-throughput sequencing, calls for a coherent framework offering unified and consistent handling of single- and multi-track segmentation problems. In addition, there is a demand for highly computationally efficient segmentation algorithms, due to the emergence of very high density scans of copy number.
A comprehensive Bioconductor package for copy number analysis is presented. The package offers a unified framework for single sample, multi-sample and multi-track segmentation and is based on statistically sound penalized least squares principles. Conditional on the number of breakpoints, the estimates are optimal in the least squares sense. A novel and computationally highly efficient algorithm is proposed that utilizes vector-based operations in R. Three case studies are presented.
The R package copynumber is a software suite for segmentation of single- and multi-track copy number data using algorithms based on coherent least squares principles.
Copy number; aCGH; Segmentation; Allele-specific segmentation; Penalized regression; Least squares; Bioconductor
Next generation sequencing platforms are now well implanted in sequencing centres and some laboratories. Upcoming smaller scale machines such as the 454 junior from Roche or the MiSeq from Illumina will increase the number of laboratories hosting a sequencer. In such a context, it is important to provide these teams with an easily manageable environment to store and process the produced reads.
We describe a user-friendly information system able to manage large sets of sequencing data. It includes, on one hand, a workflow environment already containing pipelines adapted to different input formats (sff, fasta, fastq and qseq), different sequencers (Roche 454, Illumina HiSeq) and various analyses (quality control, assembly, alignment, diversity studies,…) and, on the other hand, a secured web site giving access to the results. The connected user will be able to download raw and processed data and browse through the analysis result statistics. The provided workflows can easily be modified or extended and new ones can be added. Ergatis is used as a workflow building, running and monitoring system. The analyses can be run locally or in a cluster environment using Sun Grid Engine.
NG6 is a complete information system designed to answer the needs of a sequencing platform. It provides a user-friendly interface to process, store and download high-throughput sequencing data.
Interpreting in vivo sampled microarray data is often complicated by changes in the cell population demographics. To put gene expression into its proper biological context, it is necessary to distinguish differential gene transcription from artificial gene expression induced by changes in the cellular demographics.
CTen (cell type enrichment) is a web-based analytical tool which uses our highly expressed, cell specific (HECS) gene database to identify enriched cell types in heterogeneous microarray data. The web interface is designed for differential expression and gene clustering studies, and the enrichment results are presented as heatmaps or downloadable text files.
In this work, we use an independent, cell-specific gene expression data set to assess CTen's performance in accurately identifying the appropriate cell type and provide insight into the suggested level of enrichment to optimally minimize the number of false discoveries. We show that CTen, when applied to microarray data developed from infected lung tissue, can correctly identify the cell signatures of key lymphocytes in a highly heterogeneous environment and compare its performance to another popular bioinformatics tool. Furthermore, we discuss the strong implications cell type enrichment has in the design of effective microarray workflow strategies and show that, by combining CTen with gene expression clustering, we may be able to determine the relative changes in the number of key cell types.
CTen is available at http://www.influenza-x.org/~jshoemaker/cten/
Cell type enrichment; Microarray data; Deconvolution; Influenza; Systems immunology
Multi-locus sequence typing (MLST) has become the gold standard for population analyses of bacterial pathogens. This method focuses on the sequences of a small number of loci (usually seven) to divide the population and is simple, robust and facilitates comparison of results between laboratories and over time. Over the last decade, researchers and population health specialists have invested substantial effort in building up public MLST databases for nearly 100 different bacterial species, and these databases contain a wealth of important information linked to MLST sequence types such as time and place of isolation, host or niche, serotype and even clinical or drug resistance profiles. Recent advances in sequencing technology mean it is increasingly feasible to perform bacterial population analysis at the whole genome level. This offers massive gains in resolving power and genetic profiling compared to MLST, and will eventually replace MLST for bacterial typing and population analysis. However given the wealth of data currently available in MLST databases, it is crucial to maintain backwards compatibility with MLST schemes so that new genome analyses can be understood in their proper historical context.
We present a software tool, SRST, for quick and accurate retrieval of sequence types from short read sets, using inputs easily downloaded from public databases. SRST uses read mapping and an allele assignment score incorporating sequence coverage and variability, to determine the most likely allele at each MLST locus. Analysis of over 3,500 loci in more than 500 publicly accessible Illumina read sets showed SRST to be highly accurate at allele assignment. SRST output is compatible with common analysis tools such as eBURST, Clonal Frame or PhyloViz, allowing easy comparison between novel genome data and MLST data. Alignment, fastq and pileup files can also be generated for novel alleles.
SRST is a novel software tool for accurate assignment of sequence types using short read data. Several uses for the tool are demonstrated, including quality control for high-throughput sequencing projects, plasmid MLST and analysis of genomic data during outbreak investigation. SRST is open-source, requires Python, BWA and SamTools, and is available from http://srst.sourceforge.net.
MLST; Short read; Illumina; Sequence analysis; Plasmid; Chromosome; Microbiology; Bacteria; Population analysis; Outbreak
Large amounts of mammalian protein-protein interaction (PPI) data have been generated and are available for public use. From a systems biology perspective, Proteins/genes interactions encode the key mechanisms distinguishing disease and health, and such mechanisms can be uncovered through network analysis. An effective network analysis tool should integrate different content-specific PPI databases into a comprehensive network format with a user-friendly platform to identify key functional modules/pathways and the underlying mechanisms of disease and toxicity.
atBioNet integrates seven publicly available PPI databases into a network-specific knowledge base. Knowledge expansion is achieved by expanding a user supplied proteins/genes list with interactions from its integrated PPI network. The statistically significant functional modules are determined by applying a fast network-clustering algorithm (SCAN: a Structural Clustering Algorithm for Networks). The functional modules can be visualized either separately or together in the context of the whole network. Integration of pathway information enables enrichment analysis and assessment of the biological function of modules. Three case studies are presented using publicly available disease gene signatures as a basis to discover new biomarkers for acute leukemia, systemic lupus erythematosus, and breast cancer. The results demonstrated that atBioNet can not only identify functional modules and pathways related to the studied diseases, but this information can also be used to hypothesize novel biomarkers for future analysis.
atBioNet is a free web-based network analysis tool that provides a systematic insight into proteins/genes interactions through examining significant functional modules. The identified functional modules are useful for determining underlying mechanisms of disease and biomarker discovery. It can be accessed at: http://www.fda.gov/ScienceResearch/BioinformaticsTools/ucm285284.htm.
Protein-protein interaction; Network analysis; Functional module; Disease biomarker; KEGG pathway analysis; Visualization tool; Genomics
Measuring gene transcription using real-time reverse transcription polymerase chain reaction (RT-qPCR) technology is a mainstay of molecular biology. Technologies now exist to measure the abundance of many transcripts in parallel. The selection of the optimal reference gene for the normalisation of this data is a recurring problem, and several algorithms have been developed in order to solve it. So far nothing in R exists to unite these methods, together with other functions to read in and normalise the data using the chosen reference gene(s).
We have developed two R/Bioconductor packages, ReadqPCR and NormqPCR, intended for a user with some experience with high-throughput data analysis using R, who wishes to use R to analyse RT-qPCR data. We illustrate their potential use in a workflow analysing a generic RT-qPCR experiment, and apply this to a real dataset. Packages are available from http://www.bioconductor.org/packages/release/bioc/html/ReadqPCR.htmland http://www.bioconductor.org/packages/release/bioc/html/NormqPCR.html
These packages increase the repetoire of RT-qPCR analysis tools available to the R user and allow them to (amongst other things) read their data into R, hold it in an ExpressionSet compatible R object, choose appropriate reference genes, normalise the data and look for differential expression between samples.
Accurate prediction of DNA motifs that are targets of RNA polymerases, sigma factors and transcription factors (TFs) in prokaryotes is a difficult mission mainly due to as yet undiscovered features in DNA sequences or structures in promoter regions. Improved prediction and comparison algorithms are currently available for identifying transcription factor binding sites (TFBSs) and their accompanying TFs and regulon members.
We here extend the current databases of TFs, TFBSs and regulons with our knowledge on Lactococcus lactis and developed a webserver for prediction, mining and visualization of prokaryote promoter elements and regulons via a novel concept. This new approach includes an all-in-one method of data mining for TFs, TFBSs, promoters, and regulons for any bacterial genome via a user-friendly webserver. We demonstrate the power of this method by mining WalRK regulons in Lactococci and Streptococci and, vice versa, use L. lactis regulon data (CodY) to mine closely related species.
The PePPER webserver offers, besides the all-in-one analysis method, a toolbox for mining for regulons, promoters and TFBSs and accommodates a new L. lactis regulon database in addition to already existing regulon data. Identification of putative regulons and full annotation of intergenic regions in any bacterial genome on the basis of existing knowledge on a related organism can now be performed by biologists and it can be done for a wide range of regulons. On the basis of the PePPER output, biologist can design experiments to further verify the existence and extent of the proposed regulons. The PePPER webserver is freely accessible at http://pepper.molgenrug.nl.
Functional analyses of genomic data within the context of a priori biomolecular networks can give valuable mechanistic insights. However, such analyses are not a trivial task, owing to the complexity of biological networks and lack of computational methods for their effective integration with experimental data.
We developed a software application suite, NetWalker, as a one-stop platform featuring a number of novel holistic (i.e. assesses the whole data distribution without requiring data cutoffs) data integration and analysis methods for network-based comparative interpretations of genome-scale data. The central analysis components, NetWalk and FunWalk, are novel random walk-based network analysis methods that provide unique analysis capabilities to assess the entire data distributions together with network connectivity to prioritize molecular and functional networks, respectively, most highlighted in the supplied data. Extensive inter-operability between the analysis components and with external applications, including R, adds to the flexibility of data analyses. Here, we present a detailed computational analysis of our microarray gene expression data from MCF7 cells treated with lethal and sublethal doses of doxorubicin.
NetWalker, a detailed step-by-step tutorial containing the analyses presented in this paper and a manual are available at the web site http://netwalkersuite.org.
Biological networks; NetWalker; NetWalk; Network analyses
Continued sequencing efforts coupled with advances in sequencing technology will lead to the completion of a vast number of small genomes. Whole-genome comparisons represent an important part of the analysis of any new genome sequence, as they can provide a better understanding of the biology and evolution of the source organism. Visualization of the results is important, as it allows information from a variety of sources to be integrated and interpreted. However, existing graphical comparison tools lack features needed for efficiently comparing a new genome to hundreds or thousands of existing sequences. Moreover, existing tools are limited in terms of the types of comparisons that can be performed, the extent to which the output can be customized, and the ease with which the entire process can be automated.
The CGView Comparison Tool (CCT) is a package for visually comparing bacterial, plasmid, chloroplast, or mitochondrial sequences of interest to existing genomes or sequence collections. The comparisons are conducted using BLAST, and the BLAST results are presented in the form of graphical maps that can also show sequence features, gene and protein names, COG (Clusters of Orthologous Groups of proteins) category assignments, and sequence composition characteristics. CCT can generate maps in a variety of sizes, including 400 Megapixel maps suitable for posters. Comparisons can be conducted within a particular species or genus, or all available genomes can be used. The entire map creation process, from downloading sequences to redrawing zoomed maps, can be completed easily using scripts included with the CCT. User-defined features or analysis results can be included on maps, and maps can be extensively customized. To simplify program setup, a CCT virtual machine that includes all dependencies preinstalled is available. Detailed tutorials illustrating the use of CCT are included with the CCT documentation.
CCT can be used to visually compare a reference sequence to thousands of existing genomes or sequence collections (next-generation sequencing reads for example) on a standard desktop computer. It provides analysis and visualization functionality not available in any existing circular genome visualization tool. By visually presenting sequence conservation information along with functional classifications and sequence composition characteristics, CCT can be a useful tool for identifying rapidly evolving or novel sequences, horizontally transferred sequences, or unusual functional properties in newly sequenced genomes. CCT is freely available for download at http://stothard.afns.ualberta.ca/downloads/CCT/.