Illumina sequencing platform is widely used in genome research. Sequence reads quality assessment and control are needed for downstream analysis. However, software that provides efficient quality assessment and versatile filtration methods is still lacking.
We have developed a toolkit named HTQC – abbreviation of High-Throughput Quality Control – for sequence reads quality control, which consists of six programs for reads quality assessment, reads filtration and generation of graphic reports.
The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance for reads filtration utilities that allow users to choose different strategies to remove low quality reads.
The increasing volume of ChIP-chip and ChIP-seq data being generated creates a challenge for standard, integrative and reproducible bioinformatics data analysis platforms. We developed a web-based application called Cistrome, based on the Galaxy open source framework. In addition to the standard Galaxy functions, Cistrome has 29 ChIP-chip- and ChIP-seq-specific tools in three major categories, from preliminary peak calling and correlation analyses to downstream genome feature association, gene expression analyses, and motif discovery. Cistrome is available at http://cistrome.org/ap/.
The rapid development of next generation sequencing (NGS) technology provides a novel avenue for genomic exploration and research. Single nucleotide variants (SNVs) inferred from next generation sequencing are expected to reveal gene mutations in cancer. However, NGS has lower sequence coverage and poor SNVs detection capability in the regulatory regions of the genome. Post probabilistic based methods are efficient for detection of SNVs in high coverage regions or sequencing data with high depth. However, for data with low sequencing depth, the efficiency of such algorithms remains poor and needs to be improved.
A new tool SNVHMM basing on a discrete hidden Markov model (HMM) was developed to infer the genotype for each position on the genome. We incorporated the mapping quality of each read and the corresponding base quality on the reads into the emission probability of HMM. The context information of the whole observation as well as its confidence were completely utilized to infer the genotype for each position on the genome in study. Therefore, more probability power can be gained over the Bayes based methods, which is very useful for SNVs detection for data with low sequencing depth. Moreover, our model was verified by testing against two sets of lobular breast tumor and Myelodysplastic Syndromes (MDS) data each. Comparing against a recently published SNVs calling algorithm SNVMix2, our model improved the performance of SNVMix2 largely when the sequencing depth is low and also outperformed SNVMix2 when SNVMix2 is well trained by large datasets.
SNVHMM can detect SNVs from NGS cancer data efficiently even if the sequence depth is very low. The training data size can be very small for SNVHMM to work. SNVHMM incorporated the base quality and mapping quality of all observed bases and reads, and also provides the option for users to choose the confidence of the observation for SNVs prediction.
Gene fusions, which result from abnormal chromosome rearrangements, are a pathogenic factor in cancer development. The emerging RNA-Seq technology enables us to detect gene fusions and profile their features.
In this paper, we proposed a novel fusion detection tool, FusionQ, based on paired-end RNA-Seq data. This tool can detect gene fusions, construct the structures of chimerical transcripts, and estimate their abundances. To confirm the read alignment on both sides of a fusion point, we employed a new approach, “residual sequence extension”, which extended the short segments of the reads by aggregating their overlapping reads. We also proposed a list of filters to control the false-positive rate. In addition, we estimated fusion abundance using the Expectation-Maximization algorithm with sparse optimization, and further adopted it to improve the detection accuracy of the fusion transcripts. Simulation was performed by FusionQ and another two stated-of-art fusion detection tools. FusionQ exceeded the other two in both sensitivity and specificity, especially in low coverage fusion detection. Using paired-end RNA-Seq data from breast cancer cell lines, FusionQ detected both the previously reported and new fusions. FusionQ reported the structures of these fusions and provided their expressions. Some highly expressed fusion genes detected by FusionQ are important biomarkers in breast cancer. The performances of FusionQ on cancel line data still showed better specificity and sensitivity in the comparison with another two tools.
FusionQ is a novel tool for fusion detection and quantification based on RNA-Seq data. It has both good specificity and sensitivity performance. FusionQ is free and available at http://www.wakehealth.edu/CTSB/Software/Software.htm.
Fusion detection; chimerical transcripts quantification; EM algorithm
Chloroplast is an essential organelle in plants which contains independent genome. Chloroplast genomes have been widely used for plant phylogenetic inference recently. The number of complete chloroplast genomes increases rapidly with the development of various genome sequencing projects. However, no comprehensive platform or tool has been developed for the comparative and phylogenetic analysis of chloroplast genomes. Thus, we constructed a comprehensive platform for the comparative and phylogenetic analysis of complete chloroplast genomes which was named as chloroplast genome analysis platform (CGAP).
CGAP is an interactive web-based platform which was designed for the comparative analysis of complete chloroplast genomes. CGAP integrated genome collection, visualization, content comparison, phylogeny analysis and annotation functions together. CGAP implemented four web servers including creating complete and regional genome maps of high quality, comparing genome features, constructing phylogenetic trees using complete genome sequences, and annotating draft chloroplast genomes submitted by users.
Both CGAP and source code are available at http://www.herbbol.org:8000/chloroplast. CGAP will facilitate the collection, visualization, comparison and annotation of complete chloroplast genomes. Users can customize the comparative and phylogenetic analysis using their own unpublished chloroplast genomes.
Chloroplast genomes; Comparative and phylogenetic analysis; Web-based platform
Microarrays have been a popular tool for gene expression profiling at genome-scale for over a decade due to the low cost, short turn-around time, excellent quantitative accuracy and ease of data generation. The Bioconductor package puma incorporates a suite of analysis methods for determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analysis. As isoform level expression profiling receives more and more interest within genomics in recent years, exon microarray technology offers an important tool to quantify expression level of the majority of exons and enables the possibility of measuring isoform level expression. However, puma does not include methods for the analysis of exon array data. Moreover, the current expression summarisation method for Affymetrix 3’ GeneChip data suffers from instability for low expression genes. For the downstream analysis, the method for differential expression detection is computationally intensive and the original expression clustering method does not consider the variance across the replicated technical and biological measurements. It is therefore necessary to develop improved uncertainty propagation methods for gene and transcript expression analysis.
We extend the previously developed Bioconductor package puma with a new method especially designed for GeneChip Exon arrays and a set of improved downstream approaches. The improvements include: (i) a new gamma model for exon arrays which calculates isoform and gene expression measurements and a level of uncertainty associated with the estimates, using the multi-mappings between probes, isoforms and genes, (ii) a variant of the existing approach for the probe-level analysis of Affymetrix 3’ GeneChip data to produce more stable gene expression estimates, (iii) an improved method for detecting differential expression which is computationally more efficient than the existing approach in the package and (iv) an improved method for robust model-based clustering of gene expression, which takes technical and biological replicate information into consideration.
With the extensions and improvements, the puma package is now applicable to the analysis of both Affymetrix 3’ GeneChips and Exon arrays for gene and isoform expression estimation. It propagates the uncertainty of expression measurements into more efficient and comprehensive downstream analysis at both gene and isoform level. Downstream methods are also applicable to other expression quantification platforms, such as RNA-Seq, when uncertainty information is available from expression measurements. puma is available through Bioconductor and can be found at http://www.bioconductor.org.
The estimation of genetic ancestry in human populations has important applications in medical genetic studies. Genetic ancestry is used to control for population stratification in genetic association studies, and is used to understand the genetic basis for ethnic differences in disease susceptibility. In this review, we present an overview of genetic ancestry estimation in human disease studies, followed by a review of popular softwares and methods used for this estimation.
Ancestry; Genetic; Polymorphism; Structure
The complete sequences of chloroplast genomes provide wealthy information regarding the evolutionary history of species. With the advance of next-generation sequencing technology, the number of completely sequenced chloroplast genomes is expected to increase exponentially, powerful computational tools annotating the genome sequences are in urgent need.
We have developed a web server CPGAVAS. The server accepts a complete chloroplast genome sequence as input. First, it predicts protein-coding and rRNA genes based on the identification and mapping of the most similar, full-length protein, cDNA and rRNA sequences by integrating results from Blastx, Blastn, protein2genome and est2genome programs. Second, tRNA genes and inverted repeats (IR) are identified using tRNAscan, ARAGORN and vmatch respectively. Third, it calculates the summary statistics for the annotated genome. Fourth, it generates a circular map ready for publication. Fifth, it can create a Sequin file for GenBank submission. Last, it allows the extractions of protein and mRNA sequences for given list of genes and species. The annotation results in GFF3 format can be edited using any compatible annotation editing tools. The edited annotations can then be uploaded to CPGAVAS for update and re-analyses repeatedly. Using known chloroplast genome sequences as test set, we show that CPGAVAS performs comparably to another application DOGMA, while having several superior functionalities.
CPGAVAS allows the semi-automatic and complete annotation of a chloroplast genome sequence, and the visualization, editing and analysis of the annotation results. It will become an indispensible tool for researchers studying chloroplast genomes. The software is freely accessible from
Chloroplast genome; Annotation; Web server; CPGAVAS
Protein-coding regions in human genes harbor 85% of the mutations that are associated with disease-related traits. Compared with whole-genome sequencing of complex samples, exome sequencing serves as an alternative option because of its dramatically reduced cost. In fact, exome sequencing has been successfully applied to identify the cause of several Mendelian disorders, such as Miller and Schinzel-Giedio syndrome. However, there remain great challenges in handling the huge data generated by exome sequencing and in identifying potential disease-related genetic variations.
In this study, Exome-assistant (http://188.8.131.52/exomeassistant), a convenient tool for submitting and annotating single nucleotide polymorphisms (SNPs) and insertion/deletion variations (InDels), was developed to rapidly detect candidate disease-related genetic variations from exome sequencing projects. Versatile filter criteria are provided by Exome-assistant to meet different users’ requirements. Exome-assistant consists of four modules: the single case module, the two cases module, the multiple cases module, and the reanalysis module. The two cases and multiple cases modules allow users to identify sample-specific and common variations. The multiple cases module also supports family-based studies and Mendelian filtering. The identified candidate disease-related genetic variations can be annotated according to their sample features.
In summary, by exploring exome sequencing data, Exome-assistant can provide researchers with detailed biological insights into genetic variation events and permits the identification of potential genetic causes of human diseases and related traits.
Next generation sequencing; Mendelian disease; Single nucleotide polymorphisms; Insertions and deletions; Variation filtering; Minor allele frequency
Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even distribution of biological groups and confounding factors across batches. However, due to the practical complications, the availability of the final collection of samples in genomics study might be unbalanced and incomplete, which, without appropriate attention in sample-to-batch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects.
We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated sample-to-batch allocations in genomics experiments.
OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it can reduce the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs.
Large amounts of mammalian protein-protein interaction (PPI) data have been generated and are available for public use. From a systems biology perspective, Proteins/genes interactions encode the key mechanisms distinguishing disease and health, and such mechanisms can be uncovered through network analysis. An effective network analysis tool should integrate different content-specific PPI databases into a comprehensive network format with a user-friendly platform to identify key functional modules/pathways and the underlying mechanisms of disease and toxicity.
atBioNet integrates seven publicly available PPI databases into a network-specific knowledge base. Knowledge expansion is achieved by expanding a user supplied proteins/genes list with interactions from its integrated PPI network. The statistically significant functional modules are determined by applying a fast network-clustering algorithm (SCAN: a Structural Clustering Algorithm for Networks). The functional modules can be visualized either separately or together in the context of the whole network. Integration of pathway information enables enrichment analysis and assessment of the biological function of modules. Three case studies are presented using publicly available disease gene signatures as a basis to discover new biomarkers for acute leukemia, systemic lupus erythematosus, and breast cancer. The results demonstrated that atBioNet can not only identify functional modules and pathways related to the studied diseases, but this information can also be used to hypothesize novel biomarkers for future analysis.
atBioNet is a free web-based network analysis tool that provides a systematic insight into proteins/genes interactions through examining significant functional modules. The identified functional modules are useful for determining underlying mechanisms of disease and biomarker discovery. It can be accessed at: http://www.fda.gov/ScienceResearch/BioinformaticsTools/ucm285284.htm.
Protein-protein interaction; Network analysis; Functional module; Disease biomarker; KEGG pathway analysis; Visualization tool; Genomics
Many molecules of interest are flexible and undergo significant shape deformation as part of their function, but most existing methods of molecular shape comparison treat them as rigid shapes, which may lead to incorrect measure of the shape similarity of flexible molecules. Currently, there still is a limited effort in retrieval and navigation for flexible molecular shape comparison, which would improve data retrieval by helping users locate the desirable molecule in a convenient way.
To address this issue, we develop a web-based retrieval and navigation tool, named 3DMolNavi, for flexible molecular shape comparison. This tool is based on the histogram of Inner Distance Shape Signature (IDSS) for fast retrieving molecules that are similar to a query molecule, and uses dimensionality reduction to navigate the retrieved results in 2D and 3D spaces. We tested 3DMolNavi in the Database of Macromolecular Movements (MolMovDB) and CATH. Compared to other shape descriptors, it achieves good performance and retrieval results for different classes of flexible molecules.
The advantages of 3DMolNavi, over other existing softwares, are to integrate retrieval for flexible molecular shape comparison and enhance navigation for user’s interaction. 3DMolNavi can be accessed via https://engineering.purdue.edu/PRECISE/3dmolnavi/index.html.
Most recently, with maturing of bovine genome sequencing and high throughput SNP genotyping technologies, a large number of significant SNPs associated with economic important traits can be identified by genome-wide association studies (GWAS). To further determine true association findings in GWAS, the common strategy is to sift out most promising SNPs for follow-up replication studies. Hence it is crucial to explore the functional significance of the candidate SNPs in order to screen and select the potential functional ones. To systematically prioritize these statistically significant SNPs and facilitate follow-up replication studies, we developed a bovine SNP annotation tool (Snat) based on a web interface.
With Snat, various sources of genomic information are integrated and retrieved from several leading online databases, including SNP information from dbSNP, gene information from Entrez Gene, protein features from UniProt, linkage information from AnimalQTLdb, conserved elements from UCSC Genome Browser Database and gene functions from Gene Ontology (GO), KEGG PATHWAY and Online Mendelian Inheritance in Animals (OMIA). Snat provides two different applications, including a CGI-based web utility and a command-line version, to access the integrated database, target any single nucleotide loci of interest and perform multi-level functional annotations. For further validation of the practical significance of our study, SNPs involved in two commercial bovine SNP chips, i.e., the Affymetrix Bovine 10K chip array and the Illumina 50K chip array, have been annotated by Snat, and the corresponding outputs can be directly downloaded from Snat website. Furthermore, a real dataset involving 20 identified SNPs associated with milk yield in our recent GWAS was employed to demonstrate the practical significance of Snat.
To our best knowledge, Snat is one of first tools focusing on SNP annotation for livestock. Snat confers researchers with a convenient and powerful platform to aid functional analyses and accurate evaluation on genes/variants related to SNPs, and facilitates follow-up replication studies in the post-GWAS era.
One of the most promising aspects of metabolomics is metabolic modeling and simulation. Central to such applications is automated high-throughput identification and quantification of metabolites. NMR spectroscopy is a reproducible, nondestructive, and nonselective method that has served as the foundation of metabolomics studies. However, the automated high-throughput identification and quantification of metabolites in NMR spectroscopy is limited by severe spectral overlap. Although numerous software programs have been developed for resolving overlapping resonances, as well as for identifying and quantifying metabolites, most of these programs are frequency-domain methods, considerably influenced by phase shifts and baseline distortions, and effective only in small-scale studies. Almost all these programs require multiple spectra for each application, and do not automatically identify and quantify metabolites in batches.
We created IQMNMR, an R package that integrates a relaxation algorithm, digital filter, and similarity search algorithm. It differs from existing software in that it is a time-domain method; it uses not only frequency to resolve overlapping resonances but also relaxation time constants; it requires only one NMR spectrum per application; is uninfluenced by phase shifts and baseline distortions; and most important, yields a batch of quantified metabolites.
IQMNMR provides a solution that can automatically identify and quantify metabolites by one-dimensional proton NMR spectroscopy. Its time-domain nature, stability against phase shifts and baseline distortions, requirement for only one NMR spectrum, and capability to output a batch of quantified metabolites are of considerable significance to metabolic modeling and simulation.
IQMNMR is available at http://cran.r-project.org/web/packages/IQMNMR/.
Contemporary informatics and genomics research require efficient, flexible and robust management of large heterogeneous data, advanced computational tools, powerful visualization, reliable hardware infrastructure, interoperability of computational resources, and detailed data and analysis-protocol provenance. The Pipeline is a client-server distributed computational environment that facilitates the visual graphical construction, execution, monitoring, validation and dissemination of advanced data analysis protocols.
This paper reports on the applications of the LONI Pipeline environment to address two informatics challenges - graphical management of diverse genomics tools, and the interoperability of informatics software. Specifically, this manuscript presents the concrete details of deploying general informatics suites and individual software tools to new hardware infrastructures, the design, validation and execution of new visual analysis protocols via the Pipeline graphical interface, and integration of diverse informatics tools via the Pipeline eXtensible Markup Language syntax. We demonstrate each of these processes using several established informatics packages (e.g., miBLAST, EMBOSS, mrFAST, GWASS, MAQ, SAMtools, Bowtie) for basic local sequence alignment and search, molecular biology data analysis, and genome-wide association studies. These examples demonstrate the power of the Pipeline graphical workflow environment to enable integration of bioinformatics resources which provide a well-defined syntax for dynamic specification of the input/output parameters and the run-time execution controls.
The LONI Pipeline environment http://pipeline.loni.ucla.edu provides a flexible graphical infrastructure for efficient biomedical computing and distributed informatics research. The interactive Pipeline resource manager enables the utilization and interoperability of diverse types of informatics resources. The Pipeline client-server model provides computational power to a broad spectrum of informatics investigators - experienced developers and novice users, user with or without access to advanced computational-resources (e.g., Grid, data), as well as basic and translational scientists. The open development, validation and dissemination of computational networks (pipeline workflows) facilitates the sharing of knowledge, tools, protocols and best practices, and enables the unbiased validation and replication of scientific findings by the entire community.
The construction of the Disease Ontology (DO) has helped promote the investigation of diseases and disease risk factors. DO enables researchers to analyse disease similarity by adopting semantic similarity measures, and has expanded our understanding of the relationships between different diseases and to classify them. Simultaneously, similarities between genes can also be analysed by their associations with similar diseases. As a result, disease heterogeneity is better understood and insights into the molecular pathogenesis of similar diseases have been gained. However, bioinformatics tools that provide easy and straight forward ways to use DO to study disease and gene similarity simultaneously are required.
We have developed an R-based software package (DOSim) to compute the similarity between diseases and to measure the similarity between human genes in terms of diseases. DOSim incorporates a DO-based enrichment analysis function that can be used to explore the disease feature of an independent gene set. A multilayered enrichment analysis (GO and KEGG annotation) annotation function that helps users explore the biological meaning implied in a newly detected gene module is also part of the DOSim package. We used the disease similarity application to demonstrate the relationship between 128 different DO cancer terms. The hierarchical clustering of these 128 different cancers showed modular characteristics. In another case study, we used the gene similarity application on 361 obesity-related genes. The results revealed the complex pathogenesis of obesity. In addition, the gene module detection and gene module multilayered annotation functions in DOSim when applied on these 361 obesity-related genes helped extend our understanding of the complex pathogenesis of obesity risk phenotypes and the heterogeneity of obesity-related diseases.
DOSim can be used to detect disease-driven gene modules, and to annotate the modules for functions and pathways. The DOSim package can also be used to visualise DO structure. DOSim can reflect the modular characteristic of disease related genes and promote our understanding of the complex pathogenesis of diseases. DOSim is available on the Comprehensive R Archive Network (CRAN) or http://bioinfo.hrbmu.edu.cn/dosim.
MicroRNAs are a family of ~22 nt small RNAs that can regulate gene expression at the post-transcriptional level. Identification of these molecules and their targets can aid understanding of regulatory processes. Recently, HTS has become a common identification method but there are two major limitations associated with the technique. Firstly, the method has low efficiency, with typically less than 1 in 10,000 sequences representing miRNA reads and secondly the method preferentially targets highly expressed miRNAs. If sequences are available, computational methods can provide a screening step to investigate the value of an HTS study and aid interpretation of results. However, current methods can only predict miRNAs for short fragments and have usually been trained against small datasets which don't always reflect the diversity of these molecules.
We have developed a software tool, miRPara, that predicts most probable mature miRNA coding regions from genome scale sequences in a species specific manner. We classified sequences from miRBase into animal, plant and overall categories and used a support vector machine to train three models based on an initial set of 77 parameters related to the physical properties of the pre-miRNA and its miRNAs. By applying parameter filtering we found a subset of ~25 parameters produced higher prediction ability compared to the full set. Our software achieves an accuracy of up to 80% against experimentally verified mature miRNAs, making it one of the most accurate methods available.
miRPara is an effective tool for locating miRNAs coding regions in genome sequences and can be used as a screening step prior to HTS experiments. It is available at http://www.whiov.ac.cn/bioinformatics/mirpara
Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets.
We present DecGPU, the first parallel and distributed error correction algorithm for high-throughput short reads (HTSRs) using a hybrid combination of CUDA and MPI parallel programming models. DecGPU provides CPU-based and GPU-based versions, where the CPU-based version employs coarse-grained and fine-grained parallelism using the MPI and OpenMP parallel programming models, and the GPU-based version takes advantage of the CUDA and MPI parallel programming models and employs a hybrid CPU+GPU computing model to maximize the performance by overlapping the CPU and GPU computation. The distributed feature of our algorithm makes it feasible and flexible for the error correction of large-scale HTSR datasets. Using simulated and real datasets, our algorithm demonstrates superior performance, in terms of error correction quality and execution speed, to the existing error correction algorithms. Furthermore, when combined with Velvet and ABySS, the resulting DecGPU-Velvet and DecGPU-ABySS assemblers demonstrate the potential of our algorithm to improve de novo assembly quality for de-Bruijn-graph-based assemblers.
DecGPU is publicly available open-source software, written in CUDA C++ and MPI. The experimental results suggest that DecGPU is an effective and feasible error correction algorithm to tackle the flood of short reads produced by next-generation sequencing technologies.
The Gene Ontology (GO) Consortium organizes genes into hierarchical categories based on biological process, molecular function and subcellular localization. Tools such as GoMiner can leverage GO to perform ontological analysis of microarray and proteomics studies, typically generating a list of significant functional categories. Two or more of the categories are often redundant, in the sense that identical or nearly-identical sets of genes map to the categories. The redundancy might typically inflate the report of significant categories by a factor of three-fold, create an illusion of an overly long list of significant categories, and obscure the relevant biological interpretation.
We now introduce a new resource, RedundancyMiner, that de-replicates the redundant and nearly-redundant GO categories that had been determined by first running GoMiner. The main algorithm of RedundancyMiner, MultiClust, performs a novel form of cluster analysis in which a GO category might belong to several category clusters. Each category cluster follows a "complete linkage" paradigm. The metric is a similarity measure that captures the overlap in gene mapping between pairs of categories.
RedundancyMiner effectively eliminated redundancies from a set of GO categories. For illustration, we have applied it to the clarification of the results arising from two current studies: (1) assessment of the gene expression profiles obtained by laser capture microdissection (LCM) of serial cryosections of the retina at the site of final optic fissure closure in the mouse embryos at specific embryonic stages, and (2) analysis of a conceptual data set obtained by examining a list of genes deemed to be "kinetochore" genes.
In the last decade, a large amount of microarray gene expression data has been accumulated in public repositories. Integrating and analyzing high-throughput gene expression data have become key activities for exploring gene functions, gene networks and biological pathways. Effectively utilizing these invaluable microarray data remains challenging due to a lack of powerful tools to integrate large-scale gene-expression information across diverse experiments and to search and visualize a large number of gene-expression data points.
Gene Expression Browser is a microarray data integration, management and processing system with web-based search and visualization functions. An innovative method has been developed to define a treatment over a control for every microarray experiment to standardize and make microarray data from different experiments homogeneous. In the browser, data are pre-processed offline and the resulting data points are visualized online with a 2-layer dynamic web display. Users can view all treatments over control that affect the expression of a selected gene via Gene View, and view all genes that change in a selected treatment over control via treatment over control View. Users can also check the changes of expression profiles of a set of either the treatments over control or genes via Slide View. In addition, the relationships between genes and treatments over control are computed according to gene expression ratio and are shown as co-responsive genes and co-regulation treatments over control.
Gene Expression Browser is composed of a set of software tools, including a data extraction tool, a microarray data-management system, a data-annotation tool, a microarray data-processing pipeline, and a data search & visualization tool. The browser is deployed as a free public web service (http://www.ExpressionBrowser.com) that integrates 301 ATH1 gene microarray experiments from public data repositories (viz. the Gene Expression Omnibus repository at the National Center for Biotechnology Information and Nottingham Arabidopsis Stock Center). The set of Gene Expression Browser software tools can be easily applied to the large-scale expression data generated by other platforms and in other species.
Most analyses of microarray data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analyses it has been shown that we can improve results of differential expression detection, principal component analysis and clustering. Previously, implementations of these uncertainty propagation methods have only been available as separate packages, written in different languages. Previous implementations have also suffered from being very costly to compute, and in the case of differential expression detection, have been limited in the experimental designs to which they can be applied.
puma is a Bioconductor package incorporating a suite of analysis methods for use on Affymetrix GeneChip data. puma extends the differential expression detection methods of previous work from the 2-class case to the multi-factorial case. puma can be used to automatically create design and contrast matrices for typical experimental designs, which can be used both within the package itself but also in other Bioconductor packages. The implementation of differential expression detection methods has been parallelised leading to significant decreases in processing time on a range of computer architectures. puma incorporates the first R implementation of an uncertainty propagation version of principal component analysis, and an implementation of a clustering method based on uncertainty propagation. All of these techniques are brought together in a single, easy-to-use package with clear, task-based documentation.
For the first time, the puma package makes a suite of uncertainty propagation methods available to a general audience. These methods can be used to improve results from more traditional analyses of microarray data. puma also offers improvements in terms of scope and speed of execution over previously available methods. puma is recommended for anyone working with the Affymetrix GeneChip platform for gene expression analysis and can also be applied more generally.
Millions of single nucleotide polymorphisms have been identified as a result of the human genome project and the rapid advance of high throughput genotyping technology. Genetic association studies, such as recent genome-wide association studies (GWAS), have provided a springboard for exploring the contribution of inherited genetic variation and gene/environment interactions in relation to disease. Given the capacity of such studies to produce a plethora of information that may then be described in a number of publications, selecting possible disease susceptibility genes and identifying related modifiable risk factors is a major challenge. A Web-based application for finding evidence of such relationships is key to the development of follow-up studies and evidence for translational research.
We developed a Web-based application that selects and prioritizes potential disease-related genes by using a highly curated and updated literature database of genetic association studies. The application, called Gene Prospector, also provides a comprehensive set of links to additional data sources.
We compared Gene Prospector results for the query "Parkinson" with a list of 13 leading candidate genes (Top Results) from a curated, specialty database for genetic associations with Parkinson disease (PDGene). Nine of the thirteen leading candidate genes from PDGene were in the top 10th percentile of the ranked list from Gene Prospector. In fact, Gene Prospector included more published genetic association studies for the 13 leading candidate genes than PDGene did.
Gene Prospector provides an online gateway for searching for evidence about human genes in relation to diseases, other phenotypes, and risk factors, and provides links to published literature and other online data sources. Gene Prospector can be accessed via .
A number of programs have been developed for simulating population genetic and genetic epidemiological data conforming to one of three main algorithmic approaches: 'forwards', 'backwards' and 'sideways'. This review aims to make the reader aware of the range of options currently available to them. While no one program emerges as the best choice in all circumstances, we nominate a set of those which currently appear most promising.
Population genetics; genetic epidemiology; simulation software
Population structure is an important cause leading to inconsistent results in population-based association studies (PBAS) of human diseases. Various statistical methods have been proposed to reduce the negative impact of population structure on PBAS. Due to lack of structural information in real populations, it is difficult to evaluate the impact of population structure on PBAS in real populations.
We developed a genetic simulation platform, HAPSIMU, based on real haplotype data from the HapMap ENCODE project. This platform can simulate heterogeneous populations with various known and controllable structures under the continuous migration model or the discrete model. Moreover, both qualitative and quantitative traits can be simulated using additive genetic model with various genetic parameters designated by users.
HAPSIMU provides a common genetic simulation platform to evaluate the impact of population structure on PBAS, and compare the relative performance of various population structure identification and PBAS methods.