PSEUDOMARKER is a software package that performs joint linkage and linkage disequilibrium analysis between a marker and a putative disease locus. A key feature of PSEUDOMARKER is that it can combine case-controls and pedigrees of varying structure into a single unified analysis. Thus it maximizes the full likelihood of the data over marker allele frequencies or conditional allele frequencies on disease and recombination fraction.
The new version 2.0 uses the software package NOMAD to maximize likelihoods, resulting in generally comparable or better optima with many fewer evaluations of the likelihood functions.
After being modified substantially to use modern optimization methods, PSEUDOMARKER version 2.0 is more robust and substantially faster than version 1.0. NOMAD may be useful in other bioinformatics problems where complex likelihood functions are optimized.
Recent increases in genomic studies of the developing human fetus and neonate have led to a need for widespread characterization of the functional roles of genes at different developmental stages. The Gene Ontology (GO), a valuable and widely-used resource for characterizing gene function, offers perhaps the most suitable functional annotation system for this purpose. However, due in part to the difficulty of studying molecular genetic effects in humans, even the current collection of comprehensive GO annotations for human genes and gene products often lacks adequate developmental context for scientists wishing to study gene function in the human fetus.
The Developmental FunctionaL Annotation at Tufts (DFLAT) project aims to improve the quality of analyses of fetal gene expression and regulation by curating human fetal gene functions using both manual and semi-automated GO procedures. Eligible annotations are then contributed to the GO database and included in GO releases of human data. DFLAT has produced a considerable body of functional annotation that we demonstrate provides valuable information about developmental genomics. A collection of gene sets (genes implicated in the same function or biological process), made by combining existing GO annotations with the 13,344 new DFLAT annotations, is available for use in novel analyses. Gene set analyses of expression in several data sets, including amniotic fluid RNA from fetuses with trisomies 21 and 18, umbilical cord blood, and blood from newborns with bronchopulmonary dysplasia, were conducted both with and without the DFLAT annotation.
Functional analysis of expression data using the DFLAT annotation increases the number of implicated gene sets, reflecting the DFLAT’s improved representation of current knowledge. Blinded literature review supports the validity of newly significant findings obtained with the DFLAT annotations. Newly implicated significant gene sets also suggest specific hypotheses for future research. Overall, the DFLAT project contributes new functional annotation and gene sets likely to enhance our ability to interpret genomic studies of human fetal and neonatal development.
Human development; Functional annotation; Databases; Gene function; Fetal; Neonatal; Gene set analysis
Mixtures of internationally traded organic substances can contain parts of species protected by the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES). These mixtures often raise the suspicion of border control and customs offices, which can lead to confiscation, for example in the case of Traditional Chinese medicines (TCMs). High-throughput sequencing of DNA barcoding markers obtained from such samples provides insight into species constituents of mixtures, but manual cross-referencing of results against the CITES appendices is labor intensive. Matching DNA barcodes against NCBI GenBank using BLAST may yield misleading results both as false positives, due to incorrectly annotated sequences, and false negatives, due to spurious taxonomic re-assignment. Incongruence between the taxonomies of CITES and NCBI GenBank can result in erroneous estimates of illegal trade.
The HTS barcode checker pipeline is an application for automated processing of sets of 'next generation’ barcode sequences to determine whether these contain DNA barcodes obtained from species listed on the CITES appendices. This analytical pipeline builds upon and extends existing open-source applications for BLAST matching against the NCBI GenBank reference database and for taxonomic name reconciliation. In a single operation, reads are converted into taxonomic identifications matched with names on the CITES appendices. By inclusion of a blacklist and additional names databases, the HTS barcode checker pipeline prevents false positives and resolves taxonomic heterogeneity.
The HTS barcode checker pipeline can detect and correctly identify DNA barcodes of CITES-protected species from reads obtained from TCM samples in just a few minutes. The pipeline facilitates and improves molecular monitoring of trade in endangered species, and can aid in safeguarding these species from extinction in the wild. The HTS barcode checker pipeline is available at https://github.com/naturalis/HTS-barcode-checker.
Biodiversity informatics; High-throughput sequencing; Taxonomy; Wildlife forensics
The advent of next-generation DNA sequencing platforms has revolutionized molecular microbial ecology by making the detailed analysis of complex communities over time and space a tractable research pursuit for small research groups. However, the ability to generate 105–108 reads with relative ease brings with it many downstream complications. Beyond the computational resources and skills needed to process and analyze data, it is difficult to compare datasets in an intuitive and interactive manner that leads to hypothesis generation and testing.
We developed the free web service VAMPS (Visualization and Analysis of Microbial Population Structures, http://vamps.mbl.edu) to address these challenges and to facilitate research by individuals or collaborating groups working on projects with large-scale sequencing data. Users can upload marker gene sequences and associated metadata; reads are quality filtered and assigned to both taxonomic structures and to taxonomy-independent clusters. A simple point-and-click interface allows users to select for analysis any combination of their own or their collaborators’ private data and data from public projects, filter these by their choice of taxonomic and/or abundance criteria, and then explore these data using a wide range of analytic methods and visualizations. Each result is extensively hyperlinked to other analysis and visualization options, promoting data exploration and leading to a greater understanding of data relationships.
VAMPS allows researchers using marker gene sequence data to analyze the diversity of microbial communities and the relationships between communities, to explore these analyses in an intuitive visual context, and to download data, results, and images for publication. VAMPS obviates the need for individual research groups to make the considerable investment in computational infrastructure and bioinformatic support otherwise necessary to process, analyze, and interpret massive amounts of next-generation sequence data. Any web-capable device can be used to upload, process, explore, and extract data and results from VAMPS. VAMPS encourages researchers to share sequence and metadata, and fosters collaboration between researchers of disparate biomes who recognize common patterns in shared data.
Microbiome; Microbial ecology; Microbial diversity; Data visualization; Website; Bacteria; SSU rRNA; Next-generation sequencing
The rapid advancements in the field of genome sequencing are aiding our understanding on many biological systems. In the last five years, computational biologists and bioinformatics specialists have come up with newer, better and more efficient tools towards the discovery, analysis and interpretation of different genomic variants from high-throughput sequencing data. Availability of reliable simulated dataset is essential and is the first step towards testing any newly developed analytical tools for variant discovery. Although there are tools currently available that can simulate variants, none present the possibility of simulating all the three major types of variations (Single Nucleotide Polymorphisms, Insertions and Deletions and Copy Number Variations) and can generate reads taking a realistic error-model into consideration. Therefore, an efficient simulator and read generator is needed that can simulate variants taking the error rates of true biological samples into consideration.
We report SInC (Snp, Indel and Cnv) an open-source variant simulator and read generator capable of simulating all the three common types of biological variants taking into account a distribution of base quality score from a most commonly used next-generation sequencing instrument from Illumina. SInC is capable of generating single- and paired-end reads with user-defined insert size and with high efficiency compared to the other existing tools. SInC, due to its multi-threaded capability during read generation, has a low time footprint. SInC is currently optimised to work in limited infrastructure setup and can efficiently exploit the commonly used quad-core desktop architecture to simulate short sequence reads with deep coverage for large genomes.
We have come up with a user-friendly multi-variant simulator and read-generator tools called SInC. SInC can be downloaded from
New technologies for analyzing biological samples, like next generation sequencing, are producing a growing amount of data together with quality scores. Moreover, software tools (e.g., for mapping sequence reads), calculating transcription factor binding probabilities, estimating epigenetic modification enriched regions or determining single nucleotide polymorphism increase this amount of position-specific DNA-related data even further. Hence, requesting data becomes challenging and expensive and is often implemented using specialised hardware. In addition, picking specific data as fast as possible becomes increasingly important in many fields of science. The general problem of handling big data sets was addressed by developing specialized databases like HBase, HyperTable or Cassandra. However, these database solutions require also specialized or distributed hardware leading to expensive investments. To the best of our knowledge, there is no database capable of (i) storing billions of position-specific DNA-related records, (ii) performing fast and resource saving requests, and (iii) running on a single standard computer hardware.
Here, we present DRUMS (Disk Repository with Update Management and Select option), satisfying demands (i)-(iii). It tackles the weaknesses of traditional databases while handling position-specific DNA-related data in an efficient manner. DRUMS is capable of storing up to billions of records. Moreover, it focuses on optimizing relating single lookups as range request, which are needed permanently for computations in bioinformatics. To validate the power of DRUMS, we compare it to the widely used MySQL database. The test setting considers two biological data sets. We use standard desktop hardware as test environment.
DRUMS outperforms MySQL in writing and reading records by a factor of two up to a factor of 10000. Furthermore, it can work with significantly larger data sets. Our work focuses on mid-sized data sets up to several billion records without requiring cluster technology. Storing position-specific data is a general problem and the concept we present here is a generalized approach. Hence, it can be easily applied to other fields of bioinformatics.
Database; HERV; SNP; DNA related data; High throughput data
Several small open reading frames located within the 5′ untranslated regions of mRNAs have recently been shown to be translated. In humans, about 50% of mRNAs contain at least one upstream open reading frame representing a large resource of coding potential. We propose that some upstream open reading frames encode peptides that are functional and contribute to proteome complexity in humans and other organisms. We use the term uPEPs to describe peptides encoded by upstream open reading frames.
We have developed an online tool, termed uPEPperoni, to facilitate the identification of putative bioactive peptides. uPEPperoni detects conserved upstream open reading frames in eukaryotic transcripts by comparing query nucleotide sequences against mRNA sequences within the NCBI RefSeq database. The algorithm first locates the main coding sequence and then searches for open reading frames 5′ to the main start codon which are subsequently analysed for conservation. uPEPperoni also determines the substitution frequency for both the upstream open reading frames and the main coding sequence. In addition, the uPEPperoni tool produces sequence identity heatmaps which allow rapid visual inspection of conserved regions in paired mRNAs.
uPEPperoni features user-nominated settings including, nucleotide match/mismatch, gap penalties, Ka/Ks ratios and output mode. The heatmap output shows levels of identity between any two sequences and provides easy recognition of conserved regions. Furthermore, this web tool allows comparison of evolutionary pressures acting on the upstream open reading frame against other regions of the mRNA. Additionally, the heatmap web applet can also be used to visualise the degree of conservation in any pair of sequences. uPEPperoni is freely available on an interactive web server at http://upep-scmb.biosci.uq.edu.au.
5′UTR; uORFs; mRNAs; Sequence conservation; Short peptides; Poly-cistronic; Homology heatmaps
The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools.
geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research.
geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.
Many high throughput sequencing (HTS) approaches, such as the Roche/454 platform, produce sequences in which the quality of the sequence (as measured by a Phred-like quality scores) decreases linearly across a sequence read. Undertaking quality trimming of this data is essential to enable confidence in the results of subsequent downstream analysis. Here, we have developed a novel, highly sensitive and accurate approach (QTrim) for the quality trimming of sequence reads generated using the Roche/454 sequencing platform (or any platform with long reads that outputs Phred-like quality scores).
The performance of QTrim was evaluated against all other available quality trimming approaches on both poor and high quality 454 sequence data. In all cases, QTrim appears to perform equally as well as the best other approach (PRINSEQ) with these two methods significantly outperforming all other methods. Further analysis of the trimmed data revealed that the novel trimming approach implemented in QTrim ensures that the prevalence of low quality bases in the resulting trimmed data is substantially lower than PRINSEQ or any of the other approaches tested.
QTrim is a novel, highly sensitive and accurate algorithm for the quality trimming of Roche/454 sequence reads. It is implemented both as an executable program that can be integrated with standalone sequence analysis pipelines and as a web-based application to enable individuals with little or no bioinformatics experience to quality trim their sequence data.
Quality trimming; Next-generation sequencing; High-throughput sequencing; Phred scores
Massively parallel DNA sequencing generates staggering amounts of data. Decreasing cost, increasing throughput, and improved annotation have expanded the diversity of genomics applications in research and clinical practice. This expanding scale creates analytical challenges: accommodating peak compute demand, coordinating secure access for multiple analysts, and sharing validated tools and results.
To address these challenges, we have developed the Mercury analysis pipeline and deployed it in local hardware and the Amazon Web Services cloud via the DNAnexus platform. Mercury is an automated, flexible, and extensible analysis workflow that provides accurate and reproducible genomic results at scales ranging from individuals to large cohorts.
By taking advantage of cloud computing and with Mercury implemented on the DNAnexus platform, we have demonstrated a powerful combination of a robust and fully validated software pipeline and a scalable computational resource that, to date, we have applied to more than 10,000 whole genome and whole exome samples.
NGS data; Variant calling; Annotation; Clinical sequencing; Cloud computing
Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.
To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).
Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.
SRA; TCGA; nsSNV; SNV; SNP; Next-gen; NGS; Phylogenetics; Cancer
The introduction of next-generation sequencing (NGS) technology has made it possible to detect genomic alterations within tumor cells on a large scale. However, most applications of NGS show the genetic content of mixtures of cells. Recently developed single cell sequencing technology can identify variation within a single cell. Characterization of multiple samples from a tumor using single cell sequencing can potentially provide information on the evolutionary history of that tumor. This may facilitate understanding how key mutations accumulate and evolve in lineages to form a heterogeneous tumor.
We provide a computational method to infer an evolutionary mutation tree based on single cell sequencing data. Our approach differs from traditional phylogenetic tree approaches in that our mutation tree directly describes temporal order relationships among mutation sites. Our method also accommodates sequencing errors. Furthermore, we provide a method for estimating the proportion of time from the earliest mutation event of the sample to the most recent common ancestor of the sample of cells. Finally, we discuss current limitations on modeling with single cell sequencing data and possible improvements under those limitations.
Inferring the temporal ordering of mutational sites using current single cell sequencing data is a challenge. Our proposed method may help elucidate relationships among key mutations and their role in tumor progression.
The interest of the scientific community in investigating the impact of rare variants on complex traits has stimulated the development of novel statistical methodologies for association studies. The fact that many of the recently proposed methods for association studies suffer from low power to identify a genetic association motivates the incorporation of prior knowledge into statistical tests.
In this article we propose a methodology to incorporate prior information into the region-based score test. Within our framework prior information is used to partition variants within a region into several groups, following which asymptotically independent group statistics are constructed and then combined into a global test statistic. Under the null hypothesis the distribution of our test statistic has lower degrees of freedom compared with those of the region-based score statistic. Theoretical power comparison, population genetics simulations and results from analysis of the GAW17 sequencing data set suggest that under some scenarios our method may perform as well as or outperform the score test and other competing methods.
An approach which uses prior information to improve the power of the region-based score test is proposed. Theoretical power comparison, population genetics simulations and the results of GAW17 data analysis showed that for some scenarios power of our method is on the level with or higher than those of the score test and other methods.
Prior information; Association analysis; Score test; Rare variants
The methylation of cytosines at CpG dinucleotides, which plays an important role in gene expression regulation, is one of the most studied epigenetic modifications. Thus far, the detection of DNA methylation has been determined mostly by experimental methods, which are not only prone to bench effects and artifacts but are also time-consuming, expensive, and cannot be easily scaled up to many samples. It is therefore useful to develop computational prediction methods for DNA methylation. Our previous studies highlighted the existence of correlations between the GC content of the third codon position (GC3), methylation, and gene expression. We thus designed a model to predict methylation in Oryza sativa based on genomic sequence features and gene expression data.
We first derive equations to describe the relationship between gene methylation levels, GC3, expression, length, and other gene compositional features. We next assess gene compositional features involving sixmers and their association with methylation levels and other gene level properties. By applying our sixmer-based approach on rice gene expression data we show that it can accurately predict methylation (Pearson’s correlation coefficient r = 0.79) for the majority (79%) of the genes. Matlab code with our model is included.
Gene expression variation can be used as predictors of gene methylation levels.
DNA methylation; Gene expression; GC3; Prediction; Oryza sativa
Glioblastoma is the most aggressive primary central nervous tumor and carries a very poor prognosis. Invasion precludes effective treatment and virtually assures tumor recurrence. In the current study, we applied analytical and bioinformatics approaches to identify a set of microRNAs (miRs) from several different human glioblastoma cell lines that exhibit significant differential expression between migratory (edge) and migration-restricted (core) cell populations. The hypothesis of the study is that differential expression of miRs provides an epigenetic mechanism to drive cell migration and invasion.
Our research data comprise gene expression values for a set of 805 human miRs collected from matched pairs of migratory and migration-restricted cell populations from seven different glioblastoma cell lines. We identified 62 down-regulated and 2 up-regulated miRs that exhibit significant differential expression in the migratory (edge) cell population compared to matched migration-restricted (core) cells. We then conducted target prediction and pathway enrichment analysis with these miRs to investigate potential associated gene and pathway targets. Several miRs in the list appear to directly target apoptosis related genes. The analysis identifies a set of genes that are predicted by 3 different algorithms, further emphasizing the potential validity of these miRs to promote glioblastoma.
The results of this study identify a set of miRs with potential for decreased expression in invasive glioblastoma cells. The verification of these miRs and their associated targeted proteins provides new insights for further investigation into therapeutic interventions. The methodological approaches employed here could be applied to the study of other diseases to provide biomedical researchers and clinicians with increased opportunities for therapeutic interventions.
MicroRNA; miR; Glioblastoma; Cell migration; Gene expression; Target prediction; Pathway analysis
Structural study of retinal blood vessels provides an early indication of diseases such as diabetic retinopathy, glaucoma, and hypertensive retinopathy. These studies require accurate tracing of retinal vessel tree structure from fundus images in an automated manner. However, the existing work encounters great difficulties when dealing with the crossover issue commonly-seen in vessel networks.
In this paper, we consider a novel graph-based approach to address this tracing with crossover problem: After initial steps of segmentation and skeleton extraction, its graph representation can be established, where each segment in the skeleton map becomes a node, and a direct contact between two adjacent segments is translated to an undirected edge of the two corresponding nodes. The segments in the skeleton map touching the optical disk area are considered as root nodes. This determines the number of trees to-be-found in the vessel network, which is always equal to the number of root nodes. Based on this undirected graph representation, the tracing problem is further connected to the well-studied transductive inference in machine learning, where the goal becomes that of properly propagating the tree labels from those known root nodes to the rest of the graph, such that the graph is partitioned into disjoint sub-graphs, or equivalently, each of the trees is traced and separated from the rest of the vessel network. This connection enables us to address the tracing problem by exploiting established development in transductive inference. Empirical experiments on public available fundus image datasets demonstrate the applicability of our approach.
We provide a novel and systematic approach to trace retinal vessel trees with the present of crossovers by solving a transductive learning problem on induced undirected graphs.
Automated gene-calling is still an error-prone process, particularly for the highly plastic genomes of fungal species. Improvement through quality control and manual curation of gene models is a time-consuming process that requires skilled biologists and is only marginally performed. The wealth of available fungal genomes has not yet been exploited by an automated method that applies quality control of gene models in order to obtain more accurate genome annotations.
We provide a novel method named alignment-based fungal gene prediction (ABFGP) that is particularly suitable for plastic genomes like those of fungi. It can assess gene models on a gene-by-gene basis making use of informant gene loci. Its performance was benchmarked on 6,965 gene models confirmed by full-length unigenes from ten different fungi. 79.4% of all gene models were correctly predicted by ABFGP. It improves the output of ab initio gene prediction software due to a higher sensitivity and precision for all gene model components. Applicability of the method was shown by revisiting the annotations of six different fungi, using gene loci from up to 29 fungal genomes as informants. Between 7,231 and 8,337 genes were assessed by ABFGP and for each genome between 1,724 and 3,505 gene model revisions were proposed. The reliability of the proposed gene models is assessed by an a posteriori introspection procedure of each intron and exon in the multiple gene model alignment. The total number and type of proposed gene model revisions in the six fungal genomes is correlated to the quality of the genome assembly, and to sequencing strategies used in the sequencing centre, highlighting different types of errors in different annotation pipelines. The ABFGP method is particularly successful in discovering sequence errors and/or disruptive mutations causing truncated and erroneous gene models.
The ABFGP method is an accurate and fully automated quality control method for fungal gene catalogues that can be easily implemented into existing annotation pipelines. With the exponential release of new genomes, the ABFGP method will help decreasing the number of gene models that require additional manual curation.
Gene model; Automated gene model curation; Sequence error; Truncated gene model; Pseudogene; Fungal genome; Cladosporium fulvum
Independent data sources can be used to augment post-marketing drug safety signal detection. The vast amount of publicly available biomedical literature contains rich side effect information for drugs at all clinical stages. In this study, we present a large-scale signal boosting approach that combines over 4 million records in the US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) and over 21 million biomedical articles.
The datasets are comprised of 4,285,097 records from FAERS and 21,354,075 MEDLINE articles. We first extracted all drug-side effect (SE) pairs from FAERS. Our study implemented a total of seven signal ranking algorithms. We then compared these different ranking algorithms before and after they were boosted with signals from MEDLINE sentences or abstracts. Finally, we manually curated all drug-cardiovascular (CV) pairs that appeared in both data sources and investigated whether our approach can detect many true signals that have not been included in FDA drug labels. We extracted a total of 2,787,797 drug-SE pairs from FAERS with a low initial precision of 0.025. The ranking algorithm combined signals from both FAERS and MEDLINE, significantly improving the precision from 0.025 to 0.371 for top-ranked pairs, representing a 13.8 fold elevation in precision. We showed by manual curation that drug-SE pairs that appeared in both data sources were highly enriched with true signals, many of which have not yet been included in FDA drug labels.
We have developed an efficient and effective drug safety signal ranking and strengthening approach We demonstrate that large-scale combining information from FAERS and biomedical literature can significantly contribute to drug safety surveillance.
Information about genes, transcripts and proteins is spread over a wide variety of databases. Different tools have been developed using these databases to identify biological signals in gene lists from large scale analysis. Mostly, they search for enrichments of specific features. But, these tools do not allow an explorative walk through different views and to change the gene lists according to newly upcoming stories.
To fill this niche, we have developed ISAAC, the InterSpecies Analysing Application using Containers. The central idea of this web based tool is to enable the analysis of sets of genes, transcripts and proteins under different biological viewpoints and to interactively modify these sets at any point of the analysis. Detailed history and snapshot information allows tracing each action. Furthermore, one can easily switch back to previous states and perform new analyses. Currently, sets can be viewed in the context of genomes, protein functions, protein interactions, pathways, regulation, diseases and drugs. Additionally, users can switch between species with an automatic, orthology based translation of existing gene sets. As todays research usually is performed in larger teams and consortia, ISAAC provides group based functionalities. Here, sets as well as results of analyses can be exchanged between members of groups.
ISAAC fills the gap between primary databases and tools for the analysis of large gene lists. With its highly modular, JavaEE based design, the implementation of new modules is straight forward. Furthermore, ISAAC comes with an extensive web-based administration interface including tools for the integration of third party data. Thus, a local installation is easily feasible. In summary, ISAAC is tailor made for highly explorative interactive analyses of gene, transcript and protein sets in a collaborative environment.
Teamwork; Gene sets; Explorative analyses; Cross-species analyses
Different methods have been proposed for analyzing differentially expressed (DE) genes in microarray data. Methods based on statistical tests that incorporate expression level variability are used more commonly than those based on fold change (FC). However, FC based results are more reproducible and biologically relevant.
We propose a new method based on fold change rank ordering statistics (FCROS). We exploit the variation in calculated FC levels using combinatorial pairs of biological conditions in the datasets. A statistic is associated with the ranks of the FC values for each gene, and the resulting probability is used to identify the DE genes within an error level. The FCROS method is deterministic, requires a low computational runtime and also solves the problem of multiple tests which usually arises with microarray datasets.
We compared the performance of FCROS with those of other methods using synthetic and real microarray datasets. We found that FCROS is well suited for DE gene identification from noisy datasets when compared with existing FC based methods.
Differentially expressed genes; Fold change; Averages of ranks; Microarray
Given the estimate that 30% of our genes are controlled by microRNAs, it is essential that we understand the precise relationship between microRNAs and their targets. OncomiRs are microRNAs (miRNAs) that have been frequently shown to be deregulated in cancer. However, although several oncomiRs have been identified and characterized, there is as yet no comprehensive compilation of this data which has rendered it underutilized by cancer biologists. There is therefore an unmet need in generating bioinformatic platforms to speed the identification of novel therapeutic targets.
We describe here OncomiRdbB, a comprehensive database of oncomiRs mined from different existing databases for mouse and humans along with novel oncomiRs that we have validated in human breast cancer samples. The database also lists their respective predicted targets, identified using miRanda, along with their IDs, sequences, chromosome location and detailed description. This database facilitates querying by search strings including microRNA name, sequence, accession number, target genes and organisms. The microRNA networks and their hubs with respective targets at 3'UTR, 5'UTR and exons of different pathway genes were also deciphered using the 'R' algorithm.
OncomiRdbB is a comprehensive and integrated database of oncomiRs and their targets in breast cancer with multiple query options which will help enhance both understanding of the biology of breast cancer and the development of new and innovative microRNA based diagnostic tools and targets of therapeutic significance. OncomiRdbB is freely available for download through the URL link http://tdb.ccmb.res.in/OncomiRdbB/index.htm.
MicroRNAs; Breast cancer; Targets; 3'UTR; miRanda; TLDA
Interpretation of binding modes of protein–small ligand complexes from 3D structure data is essential for understanding selective ligand recognition by proteins. It is often performed by visual inspection and sometimes largely depends on a priori knowledge about typical interactions such as hydrogen bonds and π-π stacking. Because it can introduce some biases due to scientists’ subjective perspectives, more objective viewpoints considering a wide range of interactions are required.
In this paper, we present a web server for analyzing protein–small ligand interactions on the basis of patterns of atomic contacts, or “interaction patterns” obtained from the statistical analyses of 3D structures of protein–ligand complexes in our previous study. This server can guide visual inspection by providing information about interaction patterns for each atomic contact in 3D structures. Users can visually investigate what atomic contacts in user-specified 3D structures of protein–small ligand complexes are statistically overrepresented. This server consists of two main components: “Complex Analyzer”, and “Pattern Viewer”. The former provides a 3D structure viewer with annotations of interacting amino acid residues, ligand atoms, and interacting pairs of these. In the annotations of interacting pairs, assignment to an interaction pattern of each contact and statistical preferences of the patterns are presented. The “Pattern Viewer” provides details of each interaction pattern. Users can see visual representations of probability density functions of interactions, and a list of protein–ligand complexes showing similar interactions.
Users can interactively analyze protein–small ligand binding modes with statistically determined interaction patterns rather than relying on a priori knowledge of the users, by using our new web server named GIANT that is freely available at http://giant.hgc.jp/.
Molecular recognition; Ligand binding site; Protein–ligand interactions; Protein structure; Protein function; Pattern recognition; Database; Web-server
Gene set analysis (GSA) is useful in deducing biological significance of gene lists using a priori defined gene sets such as gene ontology (GO) or pathways. Phenotypic annotation is sparse for human genes, but is far more abundant for other model organisms such as mouse, fly, and worm. Often, GSA needs to be done highly interactively by combining or modifying gene lists or inspecting gene-gene interactions in a molecular network.
We developed gsGator, a web-based platform for functional interpretation of gene sets with useful features such as cross-species GSA, simultaneous analysis of multiple gene sets, and a fully integrated network viewer for visualizing both GSA results and molecular networks. An extensive set of gene annotation information is amassed including GO & pathways, genomic annotations, protein-protein interaction, transcription factor-target (TF-target), miRNA targeting, and phenotype information for various model organisms. By combining the functionalities of Set Creator, Set Operator and Network Navigator, user can perform highly flexible and interactive GSA by creating a new gene list by any combination of existing gene sets (intersection, union and difference) or expanding genes interactively along the molecular networks such as protein-protein interaction and TF-target. We also demonstrate the utility of our interactive and cross-species GSA implemented in gsGator by several usage examples for interpreting genome-wide association study (GWAS) results. gsGator is freely available at http://gsGator.ewha.ac.kr.
Interactive and cross-species GSA in gsGator greatly extends the scope and utility of GSA, leading to novel insights via conserved functional gene modules across different species.
Gene set analysis; Cross-species GSA; Interactive GSA; Gene function; Omics
The Kruskal-Wallis test is a popular non-parametric statistical test for identifying expression quantitative trait loci (eQTLs) from genome-wide data due to its robustness against variations in the underlying genetic model and expression trait distribution, but testing billions of marker-trait combinations one-by-one can become computationally prohibitive.
We developed kruX, an algorithm implemented in Matlab, Python and R that uses matrix multiplications to simultaneously calculate the Kruskal-Wallis test statistic for several millions of marker-trait combinations at once. KruX is more than ten thousand times faster than computing associations one-by-one on a typical human dataset. We used kruX and a dataset of more than 500k SNPs and 20k expression traits measured in 102 human blood samples to compare eQTLs detected by the Kruskal-Wallis test to eQTLs detected by the parametric ANOVA and linear model methods. We found that the Kruskal-Wallis test is more robust against data outliers and heterogeneous genotype group sizes and detects a higher proportion of non-linear associations, but is more conservative for calling additive linear associations.
kruX enables the use of robust non-parametric methods for massive eQTL mapping without the need for a high-performance computing infrastructure and is freely available from
eQTL; Non-parametric methods; Matrix algebra; Software
The new sequencing technologies enable to scan very long and dense genetic sequences, obtaining datasets of genetic markers that are an order of magnitude larger than previously available. Such genetic sequences are characterized by common alleles interspersed with multiple rarer alleles. This situation has renewed the interest for the identification of haplotypes carrying the rare risk alleles. However, large scale explorations of the linkage-disequilibrium (LD) pattern to identify haplotype blocks are not easy to perform, because traditional algorithms have at least Θ(n2) time and memory complexity.
We derived three incremental optimizations of the widely used haplotype block recognition algorithm proposed by Gabriel et al. in 2002. Our most efficient solution, called MIG ++, has only Θ(n) memory complexity and, on a genome-wide scale, it omits >80% of the calculations, which makes it an order of magnitude faster than the original algorithm. Differently from the existing software, the MIG ++ analyzes the LD between SNPs at any distance, avoiding restrictions on the maximal block length. The haplotype block partition of the entire HapMap II CEPH dataset was obtained in 457 hours. By replacing the standard likelihood-based D′ variance estimator with an approximated estimator, the runtime was further improved. While producing a coarser partition, the approximate method allowed to obtain the full-genome haplotype block partition of the entire 1000 Genomes Project CEPH dataset in 44 hours, with no restrictions on allele frequency or long-range correlations. These experiments showed that LD-based haplotype blocks can span more than one million base-pairs in both HapMap II and 1000 Genomes datasets. An application to the North American Rheumatoid Arthritis Consortium (NARAC) dataset shows how the MIG ++ can support genome-wide haplotype association studies.
The MIG ++ enables to perform LD-based haplotype block recognition on genetic sequences of any length and density. In the new generation sequencing era, this can help identify haplotypes that carry rare variants of interest. The low computational requirements open the possibility to include the haplotype block structure into genome-wide association scans, downstream analyses, and visual interfaces for online genome browsers.