High-throughput biological experiments can produce a large amount of data showing little overlap with current knowledge. This may be a problem when evaluating alternative scoring mechanisms for such data according to a gold standard dataset because standard statistical tests may not be appropriate.
To address this problem we have implemented the QiSampler tool that uses a repetitive sampling strategy to evaluate several scoring schemes or experimental parameters for any type of high-throughput data given a gold standard. We provide two example applications of the tool: selection of the best scoring scheme for a high-throughput protein-protein interaction dataset by comparison to a dataset derived from the literature, and evaluation of functional enrichment in a set of tumour-related differentially expressed genes from a thyroid microarray dataset.
QiSampler is implemented as an open source R script and a web server, which can be accessed at http://cbdm.mdc-berlin.de/tools/sampler/.
Summary: The increasing availability of large network datasets along with the progresses in experimental high-throughput technologies have prompted the need for tools allowing easy integration of experimental data with data derived form network computational analysis. In order to enrich experimental data with network topological parameters, we have developed the Cytoscape plug-in CentiScaPe. The plug-in computes several network centrality parameters and allows the user to analyze existing relationships between experimental data provided by the users and node centrality values computed by the plug-in. CentiScaPe allows identifying network nodes that are relevant from both experimental and topological viewpoints. CentiScaPe also provides a Boolean logic-based tool that allows easy characterization of nodes whose topological relevance depends on more than one centrality. Finally, different graphic outputs and the included description of biological significance for each computed centrality facilitate the analysis by the end users not expert in graph theory, thus allowing easy node categorization and experimental prioritization.
Availability: CentiScaPe can be downloaded via the Cytoscape web site: http://chianti.ucsd.edu/cyto_web/plugins/index.php. Tutorial, centrality descriptions and example data are available at: http://profs.sci.univr.it/∼scardoni/centiscape/centiscapepage.php
Supplementary information: Supplementary data are available at Bioinformatics online.
Various strategies employed in genomic DNA cloning by subtractive hybridization have been examined by computer simulations, with the comparison between the predictions and the published results. The result shows that the efficiency of target sequence enrichment and the sensitivity to experimental conditions depend strongly on the enrichment strategy employed. The strategy selecting only tester/tester after hybridization can be very efficient to enrich targets. For successful target enrichment, however, the strategy requires a highly efficient subtraction method and proper hybridization conditions. The strategy also requires that the selected DNA be amplified by polymerase chain reaction (PCR) after each or each alternate subtraction. By contrast, the strategy selecting tester/tester plus single-stranded tester is less sensitive to various experimental factors, compared with the strategy selecting only tester/tester. However, it is not as efficient. With this strategy, the tester DNA selected may or may not be amplified by PCR before the next round. In the case of the strategy selecting single-stranded tester, the target DNA can be successfully enriched only when the selected DNA is directly used without PCR amplification in the next round. The strong features of existing methods can be combined to develop a protocol that is more efficient and more reliable.
The reliable recognition of eukaryotic RNA polymerase II core promoters, and the associated transcription start sites (TSSs) of genes, has been an ongoing challenge for computational biology. High throughput experimental methods such as tiling arrays or 5′ SAGE/EST sequencing have recently lead to much larger datasets of core promoters, and to the assessment that the well-known core promoter sequence elements such as the TATA box appear to be much less frequent than thought. Here, we address the co-occurrence of several previously identified core promoter sequence motifs in Drosophila melanogaster to determine frequently occurring core promoter modules. We then use this in a new strategy to model core promoters as a set of alternative submodels for different core promoter architectures reflecting these different motif modules. We show that this system improves greatly on computational promoter recognition and leads to highly accurate in silico TSS prediction. Our results indicate that at least for the case of the fruit fly, we are getting closer to an understanding of how the beginning of a gene is defined in a eukaryotic genome.
Although nearly half of the human genome is comprised of repetitive sequences, the expression profile of these elements remains largely uncharacterized. Recently developed high throughput sequencing technologies provide us with a powerful new set of tools to study repeat elements. Hence, we performed whole transcriptome sequencing to investigate the expression of repetitive elements in human frontal cortex using postmortem tissue obtained from the Stanley Medical Research Institute.
We found a significant amount of reads from the human frontal cortex originate from repeat elements. We also noticed that Alu elements were expressed at levels higher than expected by random or background transcription. In contrast, L1 elements were expressed at lower than expected amounts.
Repetitive elements are expressed abundantly in the human brain. This expression pattern appears to be element specific and can not be explained by random or background transcription. These results demonstrate that our knowledge about repetitive elements is far from complete. Further characterization is required to determine the mechanism, the control, and the effects of repeat element expression.
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein–protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 3–8 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.
Motivation: The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating a variety of systems-level biological relationships. As thousands of such datasets become available, they provide an opportunity to study high-level associations between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries.
Results: We analyze the functional structure of Saccharomyces cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of associations between processes, and a measure of similar functional activity between genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens.
Availability: Predictions are provided in supplementary tables; software and additional data are available from the authors by request.
Supplementary information: Supplementary data are available at Bioinformatics online.
Finding functional regulatory elements in DNA sequences is a very important problem in computational biology and providing a reliable algorithm for this task would be a major step towards understanding regulatory mechanisms on genome-wide scale. Major obstacles in this respect are that the fact that the amount of non-coding DNA is vast, and that the methods for predicting functional transcription factor binding sites tend to produce results with a high percentage of false positives. This makes the problem of finding regions significantly enriched in binding sites difficult.
We develop a novel method for predicting regulatory regions in DNA sequences, which is designed to exploit the evolutionary conservation of regulatory elements between species without assuming that the order of motifs is preserved across species. We have implemented our method and tested its predictive abilities on various datasets from different organisms.
We show that our approach enables us to find a majority of the known CRMs using only sequence information from different species together with currently publicly available motif data. Also, our method is robust enough to perform well in predicting CRMs, despite differences in tissue specificity and even across species, provided that the evolutionary distances between compared species do not change substantially. The complexity of the proposed algorithm is polynomial, and the observed running times show that it may be readily applied.
Chromatin immunoprecipitation followed by tag sequencing (ChIP-Seq) using high-throughput next-generation instrumentation is replacing ChIP-chip for mapping of sites of transcription-factor binding and chromatin modification. To develop a scoring approach for this new technique, we produce two deeply sequenced datasets for human RNA polymerase II and STAT1 with matching input-DNA controls. In these, we observe that signal peaks corresponding to sites of potential binding are strongly correlated with peaks in the control, likely revealing features of open chromatin. Based on these observations, we develop a two-pass approach for scoring ChIP-Seq relative to controls. The first pass identifies putative binding sites and compensates for genomic variation in the mappability of sequences. The second pass filters sites not significantly enriched compared to the normalized control, computing precise enrichments and significances. Using our scoring we investigate optimal experimental design – i.e. depth of sequencing and value of replicas (showing marginal information gain beyond two).
Motivation: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein–DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
Results: To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein–DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein–DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Availability: Flynet is available at https://www.cistrack.org/flynet/.
Supplementary information: Supplementary data are available at Bioinformatics online.
The high-throughput annotation of open reading frames (ORFs) required by modern genome sequencing projects necessitates computational protocols that sometimes annotate orthologous ORFs inconsistently. Such inconsistencies hinder comparative analyses by non-uniformly extending or truncating 5′ and/or 3′ sequence ends, causing ORFs that are in fact identical to artificially diverge. Whereas strategies exist to correct such inconsistencies during whole-genome annotation, equivalent software designed to correct subsets of these data without genome reannotation is lacking. We therefore developed ORFcor, which corrects annotation inconsistencies using consensus start and stop positions derived from sets of closely related orthologs. ORFcor corrects inconsistent ORF annotations in diverse test datasets with specificities and sensitivities approaching 100% when sufficiently related orthologs (e.g., from the same taxonomic family) are available for comparison. The ORFcor package is implemented in Perl, multithreaded to handle large datasets, includes related scripts to facilitate high-throughput phylogenomic analyses, and is freely available at www.currielab.wisc.edu/downloads.html.
Allopolyploidy (interspecific hybridisation and polyploidy) has played a significant role in the evolutionary history of angiosperms and can result in genomic, epigenetic and transcriptomic perturbations. We examine the immediate effects of allopolyploidy on repetitive DNA by comparing the genomes of synthetic and natural Nicotiana tabacum with diploid progenitors N. tomentosiformis (paternal progenitor) and N. sylvestris (maternal progenitor). Using next generation sequencing, a recently developed graph-based repeat identification pipeline, Southern blot and fluorescence in situ hybridisation (FISH) we characterise two highly repetitive DNA sequences (NicCL3 and NicCL7/30). Analysis of two independent high-throughput DNA sequencing datasets indicates NicCL3 forms 1.6–1.9% of the genome in N. tomentosiformis, sequences that occur in multiple, discontinuous tandem arrays scattered over several chromosomes. Abundance estimates, based on sequencing depth, indicate NicCL3 is almost absent in N. sylvestris and has been dramatically reduced in copy number in the allopolyploid N. tabacum. Surprisingly elimination of NicCL3 is repeated in some synthetic lines of N. tabacum in their forth generation. The retroelement NicCL7/30, which occurs interspersed with NicCL3, is also under-represented but to a much lesser degree, revealing targeted elimination of the latter. Analysis of paired-end sequencing data indicates the tandem component of NicCL3 has been preferentially removed in natural N. tabacum, increasing the proportion of the dispersed component. This occurs across multiple blocks of discontinuous repeats and based on the distribution of nucleotide similarity among NicCL3 units, was concurrent with rounds of sequence homogenisation.
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
The field of regulatory genomics today is characterized by the generation of high-throughput data sets that capture genome-wide transcription factor (TF) binding, histone modifications, or DNAseI hypersensitive regions across many cell types and conditions. In this context, a critical question is how to make optimal use of these publicly available datasets when studying transcriptional regulation. Here, we address this question in Drosophila melanogaster for which a large number of high-throughput regulatory datasets are available. We developed i-cisTarget (where the ‘i’ stands for integrative), for the first time enabling the discovery of different types of enriched ‘regulatory features’ in a set of co-regulated sequences in one analysis, being either TF motifs or ‘in vivo’ chromatin features, or combinations thereof. We have validated our approach on 15 co-expressed gene sets, 21 ChIP data sets, 628 curated gene sets and multiple individual case studies, and show that meaningful regulatory features can be confidently discovered; that bona fide enhancers can be identified, both by in vivo events and by TF motifs; and that combinations of in vivo events and TF motifs further increase the performance of enhancer prediction.
Random community genomes (metagenomes) are now commonly used to study microbes in different environments. Over the past few years, the major challenge associated with metagenomics shifted from generating to analyzing sequences. High-throughput, low-cost next-generation sequencing has provided access to metagenomics to a wide range of researchers.
A high-throughput pipeline has been constructed to provide high-performance computing to all researchers interested in using metagenomics. The pipeline produces automated functional assignments of sequences in the metagenome by comparing both protein and nucleotide databases. Phylogenetic and functional summaries of the metagenomes are generated, and tools for comparative metagenomics are incorporated into the standard views. User access is controlled to ensure data privacy, but the collaborative environment underpinning the service provides a framework for sharing datasets between multiple users. In the metagenomics RAST, all users retain full control of their data, and everything is available for download in a variety of formats.
The open-source metagenomics RAST service provides a new paradigm for the annotation and analysis of metagenomes. With built-in support for multiple data sources and a back end that houses abstract data types, the metagenomics RAST is stable, extensible, and freely available to all researchers. This service has removed one of the primary bottlenecks in metagenome sequence analysis – the availability of high-performance computing for annotating the data.
Despite the ever-increasing throughput and steadily decreasing cost of next
generation sequencing (NGS), whole genome sequencing of humans is still not a
viable option for the majority of genetics laboratories. This is particularly
true in the case of complex disease studies, where large sample sets are often
required to achieve adequate statistical power. To fully leverage the potential
of NGS technology on large sample sets, several methods have been developed to
selectively enrich for regions of interest. Enrichment reduces both monetary and
computational costs compared to whole genome sequencing, while allowing
researchers to take advantage of NGS throughput. Several targeted enrichment
approaches are currently available, including molecular inversion probe ligation
sequencing (MIPS), oligonucleotide hybridization based approaches, and PCR-based
strategies. To assess how these methods performed when used in conjunction with
the ABI SOLID3+, we investigated three enrichment techniques: Nimblegen
oligonucleotide hybridization array-based capture; Agilent SureSelect
oligonucleotide hybridization solution-based capture; and Raindance
Technologies' multiplexed PCR-based approach. Target regions were selected
from exons and evolutionarily conserved areas throughout the human genome. Probe
and primer pair design was carried out for all three methods using their
respective informatics pipelines. In all, approximately 0.8 Mb of target space
was identical for all 3 methods. SOLiD sequencing results were analyzed for
several metrics, including consistency of coverage depth across samples,
on-target versus off-target efficiency, allelic bias, and genotype concordance
with array-based genotyping data. Agilent SureSelect exhibited superior
on-target efficiency and correlation of read depths across samples. Nimblegen
performance was similar at read depths at 20× and below. Both Raindance
and Nimblegen SeqCap exhibited tighter distributions of read depth around the
mean, but both suffered from lower on-target efficiency in our experiments.
Raindance demonstrated the highest versatility in assay design.
Summary: Measuring the consequences of mutation in proteins is critical to understanding their function. These measurements are essential in such applications as protein engineering, drug development, protein design and genome sequence analysis. Recently, high-throughput sequencing has been coupled to assays of protein activity, enabling the analysis of large numbers of mutations in parallel. We present Enrich, a tool for analyzing such deep mutational scanning data. Enrich identifies all unique variants (mutants) of a protein in high-throughput sequencing datasets and can correct for sequencing errors using overlapping paired-end reads. Enrich uses the frequency of each variant before and after selection to calculate an enrichment ratio, which is used to estimate fitness. Enrich provides an interactive interface to guide users. It generates user-accessible output for downstream analyses as well as several visualizations of the effects of mutation on function, thereby allowing the user to rapidly quantify and comprehend sequence–function relationships.
Availability and Implementation: Enrich is implemented in Python and is available under a FreeBSD license at http://depts.washington.edu/sfields/software/enrich/. Enrich includes detailed documentation as well as a small example dataset.
Contact: firstname.lastname@example.org; email@example.com
Supplementary Information: Supplementary data is available at Bioinformatics online.
DNA methylation profiling reveals important differentially methylated regions (DMRs) of the genome that are altered during development or that are perturbed by disease. To date, few programs exist for regional analysis of enriched or whole-genome bisulfate conversion sequencing data, even though such data are increasingly common. Here, we describe an open-source, optimized method for determining empirically based DMRs (eDMR) from high-throughput sequence data that is applicable to enriched whole-genome methylation profiling datasets, as well as other globally enriched epigenetic modification data.
Here we show that our bimodal distribution model and weighted cost function for optimized regional methylation analysis provides accurate boundaries of regions harboring significant epigenetic modifications. Our algorithm takes the spatial distribution of CpGs into account for the enrichment assay, allowing for optimization of the definition of empirical regions for differential methylation. Combined with the dependent adjustment for regional p-value combination and DMR annotation, we provide a method that may be applied to a variety of datasets for rapid DMR analysis. Our method classifies both the directionality of DMRs and their genome-wide distribution, and we have observed that shows clinical relevance through correct stratification of two Acute Myeloid Leukemia (AML) tumor sub-types.
Our weighted optimization algorithm eDMR for calling DMRs extends an established DMR R pipeline (methylKit) and provides a needed resource in epigenomics. Our method enables an accurate and scalable way of finding DMRs in high-throughput methylation sequencing experiments. eDMR is available for download at http://code.google.com/p/edmr/.
Predicting protein complexes from protein-protein interaction data is becoming a fundamental problem in computational biology. The identification and characterization of protein complexes implicated are crucial to the understanding of the molecular events under normal and abnormal physiological conditions. On the other hand, large datasets of experimentally detected protein-protein interactions were determined using High-throughput experimental techniques. However, experimental data is usually liable to contain a large number of spurious interactions. Therefore, it is essential to validate these interactions before exploiting them to predict protein complexes.
In this paper, we propose a novel graph mining algorithm (PEWCC) to identify such protein complexes. Firstly, the algorithm assesses the reliability of the interaction data, then predicts protein complexes based on the concept of weighted clustering coefficient. To demonstrate the effectiveness of the proposed method, the performance of PEWCC was compared to several methods. PEWCC was able to detect more matched complexes than any of the state-of-the-art methods with higher quality scores.
The higher accuracy achieved by PEWCC in detecting protein complexes is a valid argument in favor of the proposed method. The datasets and programs are freely available at
High-throughput sequencing of eukaryotic genomes has revived interest in the structure and function of repetitive genomic sequences, previously referred to as junk DNA. Repetitive sequences, including transposable elements, are now believed to play a significant role in genomic differentiation and evolution. Some are also expressed as regulatory noncoding RNAs. Vast DNA databases exist for higher eukaryotes; however, with the exception of homologues of known repetitive-sequence-families and transposable elements, most repetitive elements still need to be annotated. Revolver and Superior, both discovered in the Triticeae, are novel classes of transposon-like genes and major components of large cereal genomes. Revolver was isolated from rye via genome subtraction of sequences common to rye and wheat. Superior was isolated from rye by cleavage with EcoO109I, the recognition sites of which consist of a 5′- PuGGNCCPy-3′ multi-sequence. Revolver is 2929–3041 bp long with an inverted repeat sequence on each end. The Superior family elements are 1292–1432 bp in length, with divergent 5′ regions, indicating the presence of considerable structural diversity. Revolver and Superior are transcriptionally active elements; Revolver harbors a single gene consisting of three exons and two introns, encoding a protein of 139 amino acid residues. Revolver variants range in size from 2665 bp to 4269 bp, with some variants lacking the 5′ region, indicating structural diversity around the first exon. Revolver and Superior are dispersed across all seven chromosomes of rye. Revolver has existed since the diploid progenitor of wheat, and has been amplified or lost in several species during the evolution of the Triticeae. This article reviews the recently discovered Revolver and Superior families of plant transposons, which do not share identity with any known autonomous transposable elements or repetitive elements from any living species.
Transposon; gene family; Revolver; Superior; structural diversity; quantitative change; evolution; plant kingdom.
High-throughput microarray technologies have generated and accumulated massive amounts of gene expression datasets that contain expression levels of thousands of genes under hundreds of different experimental conditions. The microarray datasets are usually presented in 2D matrices, where rows represent genes and columns represent experimental conditions. The analysis of such datasets can discover local structures composed by sets of genes that show coherent expression patterns under subsets of experimental conditions. It leads to the development of sophisticated algorithms capable of extracting novel and useful knowledge from a biomedical point of view. In the medical domain, these patterns are useful for understanding various diseases, and aid in more accurate diagnosis, prognosis, treatment planning, as well as drug discovery.
In this work we present the CMOPSOB (Crowding distance based Multi-objective Particle Swarm Optimization Biclustering), a novel clustering approach for microarray datasets to cluster genes and conditions highly related in sub-portions of the microarray data. The objective of biclustering is to find sub-matrices, i.e. maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a subset of conditions. Since these objectives are mutually conflicting, they become suitable candidates for multi-objective modelling. Our approach CMOPSOB is based on a heuristic search technique, multi-objective particle swarm optimization, which simulates the movements of a flock of birds which aim to find food. In the meantime, the nearest neighbour search strategies based on crowding distance and ϵ-dominance can rapidly converge to the Pareto front and guarantee diversity of solutions. We compare the potential of this methodology with other biclustering algorithms by analyzing two common and public datasets of gene expression profiles. In all cases our method can find localized structures related to sets of genes that show consistent expression patterns across subsets of experimental conditions. The mined patterns present a significant biological relevance in terms of related biological processes, components and molecular functions in a species-independent manner.
The proposed CMOPSOB algorithm is successfully applied to biclustering of microarray dataset. It achieves a good diversity in the obtained Pareto front, and rapid convergence. Therefore, it is a useful tool to analyze large microarray datasets.
Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads.
We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources.
Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources.
Identifying similarities between patterns of differential gene expression provides an opportunity to identify similarities between the experimental and biological conditions that give rise to these gene expression alterations. The growing volume of gene expression data in open data repositories such as the NCBI Gene Expression Omnibus (GEO) presents an opportunity to identify these gene expression similarities on a large scale across a diverse collection of datasets. We have developed a fast, pattern-based computational approach, named openSESAME (Search of Expression Signatures Across Many Experiments), that identifies datasets enriched in samples that display coordinate differential expression of a query signature. Importantly, openSESAME performs this search without prior knowledge of the phenotypic or experimental groups in the datasets being searched. This allows openSESAME to identify perturbations of gene expression that are due to phenotypic attributes that may not have been described in the sample annotation included in the repository.
To demonstrate the utility of openSESAME, we used gene expression signatures of two biological perturbations to query a set of 75,164 human expression profiles that were generated using Affymetrix microarrays and deposited in GEO. The first query, using a signature of estradiol treatment, identified experiments in which estrogen signaling was perturbed and also identified differences in estrogen signaling between estrogen receptor-positive and -negative breast cancers. The second query, which used a signature of silencing of the transcription factor p63 (a key regulator of epidermal differentiation), identified datasets related to stratified squamous epithelia or epidermal diseases such as melanoma.
openSESAME is a tool for leveraging the growing body of publicly available microarray data to discover relationships between different biological states based on common patterns of differential gene expression. These relationships may serve to generate hypotheses about the causes and consequences of specific patterns of observed differential gene expression. To encourage others to explore the utility of this approach, we have made a website for performing openSESAME queries freely available at http://opensesame.bu.edu.
The horse genome is sequenced, allowing equine researchers to use high-throughput functional genomics platforms such as microarrays; next-generation sequencing for gene expression and proteomics. However, for researchers to derive value from these functional genomics datasets, they must be able to model this data in biologically relevant ways; to do so requires that the equine genome be more fully annotated. There are two interrelated types of genomic annotation: structural and functional. Structural annotation is delineating and demarcating the genomic elements (such as genes, promoters, and regulatory elements). Functional annotation is assigning function to structural elements. The Gene Ontology (GO) is the de facto standard for functional annotation, and is routinely used as a basis for modelling and hypothesis testing, large functional genomics datasets.
An Equine Whole Genome Oligonucleotide (EWGO) array with 21,351 elements was developed at Texas A&M University. This 70-mer oligoarray was designed using the approximately 7× assembled and annotated sequence of the equine genome to be one of the most comprehensive arrays available for expressed equine sequences. To assist researchers in determining the biological meaning of data derived from this array, we have structurally annotated it by mapping the elements to multiple database accessions, including UniProtKB, Entrez Gene, NRPD (Non-Redundant Protein Database) and UniGene. We next provided GO functional annotations for the gene transcripts represented on this array. Overall, we GO annotated 14,531 gene products (68.1% of the gene products represented on the EWGO array) with 57,912 annotations. GAQ (GO Annotation Quality) scores were calculated for this array both before and after we added GO annotation. The additional annotations improved the meanGAQ score 16-fold. This data is publicly available at AgBase http://www.agbase.msstate.edu/.
Providing additional information about the public databases which link to the gene products represented on the array allows users more flexibility when using gene expression modelling and hypothesis-testing computational tools. Moreover, since different databases provide different types of information, users have access to multiple data sources. In addition, our GO annotation underpins functional modelling for most gene expression analysis tools and enables equine researchers to model large lists of differentially expressed transcripts in biologically relevant ways.
HIV proteins target host hub proteins for transient binding interactions. The presence of viral proteins in the infected cell results in out-competition of host proteins in their interaction with hub proteins, drastically affecting cell physiology. Functional genomics and interactome datasets can be used to quantify the sequence hotspots on the HIV proteome mediating interactions with host hub proteins. In this study, we used the HIV and human interactome databases to identify HIV targeted host hub proteins and their host binding partners (H2). We developed a high throughput computational procedure utilizing motif discovery algorithms on sets of protein sequences, including sequences of HIV and H2 proteins. We identified as HIV sequence hotspots those linear motifs that are highly conserved on HIV sequences and at the same time have a statistically enriched presence on the sequences of H2 proteins. The HIV protein motifs discovered in this study are expressed by subsets of H2 host proteins potentially outcompeted by HIV proteins. A large subset of these motifs is involved in cleavage, nuclear localization, phosphorylation, and transcription factor binding events. Many such motifs are clustered on an HIV sequence in the form of hotspots. The sequential positions of these hotspots are consistent with the curated literature on phenotype altering residue mutations, as well as with existing binding site data. The hotspot map produced in this study is the first global portrayal of HIV motifs involved in altering the host protein network at highly connected hub nodes.