Recent studies have demonstrated that gene set analysis, which tests disease association with genetic variants in a group of functionally related genes, is a promising approach for analyzing and interpreting genome-wide association studies (GWAS) data. These approaches aim to increase power by combining association signals from multiple genes in the same gene set. In addition, gene set analysis can also shed more light on the biological processes underlying complex diseases. However, current approaches for gene set analysis are still in an early stage of development in that analysis results are often prone to sources of bias, including gene set size and gene length, linkage disequilibrium patterns and the presence of overlapping genes. In this paper, we provide an in-depth review of the gene set analysis procedures, along with parameter choices and the particular methodology challenges at each stage. In addition to providing a survey of recently developed tools, we also classify the analysis methods into larger categories and discuss their strengths and limitations. In the last section, we outline several important areas for improving the analytical strategies in gene set analysis.
Genome-wide association study; Gene set; Pathway; Gene-set enrichment analysis; Statistical significance; Complex disease
Gene-expression microarrays allow researchers to characterize biological phenomena in a high-throughput fashion but are subject to technological biases and inevitable variabilities that arise during sample collection and processing. Normalization techniques aim to correct such biases. Most existing methods require multiple samples to be processed in aggregate; consequently, each sample's output is influenced by other samples processed jointly. However, in personalized-medicine workflows, samples may arrive serially, so renormalizing all samples upon each new arrival would be impractical. We have developed Single Channel Array Normalization (SCAN), a single-sample technique that models the effects of probe-nucleotide composition on fluorescence intensity and corrects for such effects, dramatically increasing the signal-to-noise ratio within individual samples while decreasing variation across samples. In various benchmark comparisons, we show that SCAN performs as well as or better than competing methods yet has no dependence on external reference samples and can be applied to any single-channel microarray platform.
Method; normalization; microarray; linear model; mixture model; single-sample technique
The derivation of stably cultured cell lines has been critical to the advance of molecular biology. We profiled gene expression in the first two generally available cell lines derived from zebra finch. Using Illumina RNA-seq, we generated ~93 million reads and mapped the majority to the recently assembled zebra finch genome. Expression of most Ensembl-annotated genes was detected, but over half of the mapped reads aligned outside annotated genes. The male-derived G266 line expressed Z-linked genes at a higher level than did the female-derived ZFTMA line, indicating persistence in culture of the distinctive lack of avian sex chromosome dosage compensation. Although these cell lines were not derived from neural tissue, many neurobiologically relevant genes were expressed, although typically at lower levels than in a reference sample from auditory forebrain. These cell lines recapitulate fundamental songbird biology and will be useful for future studies of songbird gene regulation and function.
zebra finch; RNA-seq; song learning; gene expression; Illumina; dosage compensation; bird; sex chromosome
TrxG and PcG complexes play key roles in the epigenetic regulation of development through H3K4me3 and H3K27me3 modification at specific sites throughout the human genome, but how these sites are selected is poorly understood. We find that in pluripotent cells, clustered CpG-islands at genes predict occupancy of H3K4me3 and H3K27me3, and these “bivalent” chromatin domains precisely span the boundaries of CpG-island clusters. These relationships are specific to pluripotent stem cells and are not retained at H3K4me3 and H3K27me3 sites unique to differentiated cells. We show that putative transcripts from clustered CpG-islands predict stem-loop structures characteristic of those bound by PcG complexes, consistent with the possibility that RNA facilitates PcG recruitment or maintenance at these sites. These studies suggest that CpG-island structure plays a fundamental role in establishing developmentally important chromatin structures in the pluripotent genome, and a subordinate role in establishing TrxG/PcG chromatin structure at sites unique to differentiated cells.
Polycomb; trithorax; stem cell; bivalent; H3K4me3; H3K27me3; stem-loop
Sequencing data analysis remains limiting and problematic, especially for low complexity repeat sequences and transposon elements due to inherent sequencing errors and short sequence read lengths. We have developed a program, ReviSeq, which uses a hybrid method comprised of iterative remapping and local assembly upon a bacterial sequence backbone. Application of this method to six Brucella suis field isolates compared to the newly revised Brucella suis 1330 reference genome identified on average 13, 15, 19 and 9 more variants per sample than STAMPY/SAMtools, BWA/SAMtools, iCORN and BWA/PINDEL pipelines, and excluded on average 4, 2, 3 and 19 variants per sample, respectively. In total, using this iterative approach, we identified on average 87 variants including SNVs, short INDELs and long INDELs per strain when compared to the reference. Our program outperforms other methods especially for long INDEL calling.
The program is available at http://reviseq.sourceforge.net.
Brucella; sequence assembly; resequencing; variant calling; comparative genomics; iterative mapping
The indirect biological effects of ionizing radiation (IR) are thought to be mediated largely by reactive oxygen and nitrogen species (ROS and RNS). However, no data are available on how nitric oxide (NO) modulates the response of normal human cells to IR exposures at the level of the whole transcriptome. Here, we examined the effects of NO and ROS scavengers, carboxy-PTIO and DMSO, on changes in global gene expression in cultured normal human fibroblasts after exposures to gamma-rays, aiming to elucidate the involvement of ROS and RNS in transcriptional response to IR. We found that NO depletion dramatically affects the gene expression in normal human cells following irradiation with gamma-rays. We observed striking (more than seven-fold) reduction of the number of upregulated genes upon NO scavenging compared to reference irradiated cell cultures. NO scavenging in irradiated IMR-90 cells results in induction of p53 signaling, DNA damage and DNA repair pathways.
Nitric oxide; normal human fibroblast; ionizing radiation; DNA microarray
Gene expression is a dynamic process, and what factors influence gene expression changes upon external stimulus have not been clearly understood. We studied gene expression profiles in human umbilical vein endothelial cells (HUVEC) after the Tumor Necrosis Factor (TNF) stimulus, and found that: the promoters of fast-response up-regulated genes were enriched with several “active” chromatin markers like H3K27ac and H3K4me3, and also preferentially bound by Pol II and c-Myc; the core-promoter regions of slow-response up-regulated genes were frequently occupied by nucleosomes; down-regulated genes were more intensively regulated by microRNAs. Moreover, the Gene Ontology and motif analysis of the promoter regions revealed that gene clusters with different response behaviors had different functions and were regulated by different sets of transcription factors. Our observations suggested that the different gene expression patterns upon external stimulus were regulated by a combination of multi-layer regulators.
TNF; Gene expression profiles; Chromatin; Histone code; MicroRNAs
We report the construction of a 1.5 Mb resolution radiation hybrid map of the domestic cat genome. This new map includes novel microsatellite loci and markers derived from the 2X genome sequence that target previous gaps in the feline-human comparative map. Ninety-six percent of the 1793 cat markers we mapped have identifiable orthologues in the canine and human genome sequences. The updated autosomal and X chromosome comparative maps identify 152 cat-human and 134 cat-dog homologous synteny blocks. Comparative analysis shows the marked change in chromosomal evolution in the canid lineage relative to the felid lineage since divergence from their carnivoran ancestor. The canid lineage has a thirty-fold difference in the number of interchromosomal rearrangments relative to felids, while the felid lineage has primarily undergone intrachromosomal rearrangements. We have also refined the pseudoautosomal region and boundary in the cat and show that it is markedly longer than those of human or mouse. This improved RH comparative map provides a useful tool to facilitate positional cloning studies in the feline model.
domestic cat; radiation hybrid map; canine genome; genome evolution; synteny; chromosome rearrangement
Endogenous retroviral elements (EREs), a family of transposable elements, constitute a substantial fraction of mammalian genomes. It is expected that profiles of the ERE sequences and their genomic locations are unique for each individual. Comprehensive characterization of the EREs’ genomic locations and their biological properties is essential for understanding their roles in the pathophysiology of the host. In this study, we identified and mapped putative EREs (a total of 111 endogenous retroviruses [ERVs] and 488 solo long terminal repeats [sLTRs]) within the C57BL/6J mouse genome. The biological properties of individual ERE isolates (both ERVs and sLTRs) were then characterized in the following aspects: transcription potential, tropism trait, coding potential, recombination event, integration age, and primer binding site for replication. In addition, a suite of database management system programs was developed to organize and update the data acquired from current and future studies and to make the data accessible via internet.
mouse genome; database; murine lukemia virus; endogenous retrovirus; retroelement; solo long terminal repeat
Genes occupy ~3 % of the human and mouse genomes whereas repetitive elements (REs), whose biologic functions are largely uncharacterized, constitute greater than 50 %. A heterogeneous population of RE arrays (arrangement structures) is formed by combinations of various REs in mammalian genomes. In this study, REMiner-II was refined from the original REMiner for a more efficient identification and configuration of RE arrays from large queries (e.g., human chromosomes) using an unbiased self-alignment protocol. Chromosome-wide RE array profiles for the entire sets of human and mouse chromosomes were obtained using REMiner-II on a personal computer. REMiner-II provides 10 adjustable parameters and three data output modes to accommodate different experimental settings and/or goals. Examination of the human and mouse chromosome data using the REMiner-II viewer revealed species-specific libraries of complexly organized RE arrays. In conclusion, REMiner-II is an efficient tool for chromosome-wide identification and characterization of RE arrays from mammalian genomes.
mammalian genome; chromosome-wide; repetitive element; RE array; mining; REMiner-II
Genomic imprinting at the Delta-like 1 (Dlk1) - Maternally expressed gene 3 (Meg3) locus is regulated by the Meg3 differentially methylated region (DMR), but the mechanism by which this DMR acts is unknown. The goal of this study was to analyze the Meg3 DMR during imprinting establishment and maintenance for the presence of histone modifications and trans-acting DNA binding proteins using chromatin immunoprecipitation. In embryonic stem (ES) cells, where Meg3 is biallelically expressed, the DMR showed variable DNA methylation, with biallelic methylation at one region but paternal allele-specific methylation at another. All histone modifications detected at the Meg3 DMR of ES cells were biallelic. In embryonic day 12.5 (e12.5) embryos, where Meg3 is maternally expressed, the paternal Meg3 DMR was methylated, and activating histone modifications were specific to the maternal DMR. DNA-binding proteins that represent potential regulatory factors were identified in both ES cells and embryos.
Dlk1; Meg3; genomic imprinting; differentially methylated region; epigenetics; histone modifications; chromatin immunoprecipitation
Epigenetic changes refer to heritable changes that may modulate gene expression without affecting DNA sequence. DNA methylation is one such heritable epigenetic change, which is causally associated with the transcription regulation of many genes in the mammalian genome. Altered DNA methylation has been implicated in a wide variety of human diseases including cancer. Understanding the regulation of DNA methylation is likely to improve the ability to diagnose and treat these diseases. With the advent of high-throughput RNA interference (RNAi) screens, answering epigenetic questions on a genomic scale is now possible. Two recent genome-wide RNAi screens have addressed the regulation of DNA methylation in cancer, leading to the identification of the regulators of epigenetic silencing by oncogenic RAS and how epigenetic silencing of the tumor suppressor RASSF1A is maintained. These RNAi screens have much wider applications, since similar screens can now be adapted to identify the mechanism of silencing of any human disease-associated gene that is epigenetically regulated. In this review, we discuss two recent genome-wide RNAi screens for epigenetic regulators and explore potential applications in understanding DNA methylation and gene expression regulation in mammalian cells. We also discuss some of the key unanswered questions in the field of DNA methylation and suggest genome-wide RNAi screens designed to answer them.
DNA methylation; Epigenetics; Transcription; RNA interference; Imprinting
The profiling of small RNAs by high throughput sequencing (smRNA-Seq) has revealed the complexity of the RNA world. Here, we describe a computational scheme for dissecting the plant smRNAome by integrating smRNA-Seq datasets in Arabidopsis thaliana. Our analytical approach first defines ab initio the genomic loci that produce smRNAs as basic units, then utilizes principal component analysis (PCA) to predict novel miRNAs. Secondary structure prediction of candidates’ putative precursors discovered a group of long hairpin double-stranded RNAs (lh-dsRNAs) formed by inverted duplications of decayed coding genes. These gene remnants produce miRNA-like small RNAs which are predominantly 21- and 22-nt long, dependent of DCL1 but independent of RDR2 and DCL2/3/4, and associated with AGO1. Additionally, we found two classes of transcription start site associated- (TSSa-) RNAs located at sense (+) and antisense (−) approximately 100 ~ 200 bp downstream of TSSs, but are differentially incorporated into AGO1 and AGO4, respectively.
High-throughput sequencing; small RNAs; Principal component analysis; TSS-associated RNAs
With the proliferation of high-throughput technologies, genome-level data analysis has become common in molecular biology. Bioinformaticians are developing extensive resources to annotate and mine biological features from high-throughput data. The underlying database management systems for most bioinformatics software are based on a relational model. Modern non-relational databases offer an alternative that has flexibility, scalability, and a non-rigid design schema. Moreover, with an accelerated development pace, non-relational databases like CouchDB can be ideal tools to construct bioinformatics utilities. We describe CouchDB by presenting three new bioinformatics resources: (a) geneSmash, which collates data from bioinformatics resources and provides automated gene-centric annotations, (b) drugBase, a database of drug-target interactions with a web interface powered by geneSmash, and (c) HapMap-CN, which provides a web interface to query copy number variations from three SNP-chip HapMap datasets. In addition to the web sites, all three systems can be accessed programmatically via web services.
NoSQL database; copy number variation; drug-target interaction; data integration
A wealth of genomic information is available in public and private databases. However, this information is underutilized for uncovering population specific and functionally relevant markers underlying complex human traits. Given the huge amount of SNP data available from the annotation of human genetic variation, data mining is a faster and cost effective approach for investigating the number of SNPs that are informative for ancestry. In this study, we present AncestrySNPminer, the first web-based bioinformatics tool specifically designed to retrieve Ancestry Informative Markers (AIMs) from genomic data sets and link these informative markers to genes and ontological annotation classes. The tool includes an automated and simple “scripting at the click of a button” functionality that enables researchers to perform various population genomics statistical analyses methods with user friendly querying and filtering of data sets across various populations through a single web interface. AncestrySNPminer can be freely accessed at https://research.cchmc.org/mershalab/AncestrySNPminer/login.php.
Ancestry; Ancestry informative markers; AIMs; Bioinformatics; AncestrySNPminer; Data mining; Admixture; Admixture mapping
Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to “large p, small n” problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.
Random forests; Random survival forests; Classification; Prediction; Variable selection; Genomic data analysis
A heretofore-unrecognized multigene family encoding diverse immunoglobulin (Ig) domain-containing proteins (DICPs) was identified in the zebrafish genome. Twenty-nine distinct loci mapping to three chromosomal regions encode receptor-type structures possessing two classes of Ig ectodomains (D1 and D2). The sequence and number of Ig domains, transmembrane regions and signaling motifs varies between DICPs. Interindividual polymorphism and alternative RNA processing contribute to DICP diversity. Molecular models indicate that most D1 domains are of the variable (V) type; D2 domains are Ig-like. Sequence differences between D1 domains are concentrated in hypervariable regions on the front sheet strands of the Ig fold. Recombinant DICP Ig domains bind lipids, a property shared by mammalian CD300 and TREM family members. These findings suggest that novel multigene families encoding diversified immune receptors have arisen in different vertebrate lineages and effect parallel patterns of ligand recognition that potentially impact species-specific advantages.
zebrafish; innate immunity; lipid binding
We explore the utility of p-value weighting for enhancing the power to detect differential metabolites in a two-sample setting. Related gene expression information is used to assign an a priori importance level to each metabolite being tested. We map the gene expression to a metabolite through pathways and then gene expression information is summarized per-pathway using gene set enrichment tests. Through simulation we explore four styles of enrichment tests and four weight functions to convert the gene information into a meaningful p-value weight. We implement the p-value weighting on a prostate cancer metabolomics dataset. Gene expression on matched samples is used to construct the weights. Under certain regulatory conditions, the use of weighted p-values does not in-flate the type I error above what we see for the un-weighted tests except in high correlation situations. The power to detect differential metabolites is notably increased in situations with disjoint pathways and shows moderate improvement, relative to the proportion of enriched pathways, when pathway membership overlaps.
The widespread microarray technology capable of analyzing global gene expression at the level of transcription is expanding its application in not only medicine but also studies on basic biology. This paper presents our analysis on microarray gene expression data in the CEPH Utah families focusing on the demographic characteristics such as age and sex on differential gene expression patterns. Our results show that the differential gene expression pattern between age groups is dominated by down-regulated transcriptional activities in the old subjects. Functional analysis on age regulated genes identifies cell-cell signaling as an important functional category implicated in human aging. Sex-dependent gene expression is characterized by genes that may escape X-inactivation and, most interestingly, such a pattern is not affected by the aging process. Analysis on sibship correlation on gene expression revealed a large number of significant genes suggesting the importance of a genetic mechanism in regulating transcriptional activities. In addition, we observe an interesting pattern of sibship correlation on gene expression that increases exponentially with the mean of gene expression reflecting the enhanced genetic control over the functionally active genes.
Gene expression; Aging; X-inactivation; Intra-class correlation coefficient
We introduce GenRev, a network-based software package developed to explore the functional relevance of genes generated as an intermediate result from numerous high-throughput technologies. GenRev searches for optimal intermediate nodes (genes) for the connection of input nodes via several algorithms, including the Klein-Ravi algorithm, the limited kWalks algorithm and a heuristic local search algorithm. Gene ranking and graph clustering analyses are integrated into the package. GenRev has the following features. (1) It provides users with great flexibility to define their own networks. (2) Users are allowed to define each gene’s importance in a subnetwork search by setting its score. (3) It is standalone and platform independent. (4) It provides an optimization in subnetwork search, which dramatically reduces the running time. GenRev is particularly designed for general use so that users have the flexibility to choose a reference network and define the score of genes. GenRev is freely available at http://bioinfo.mc.vanderbilt.edu/GenRev.html.
Gene ranking; Network; Subnetwork; Klein-Ravi algorithm; limited kWalks algorithm; Disease genes
Obesity affects over 500 million people worldwide, and has far reaching negative health effects. Given that high body mass index (BMI) and insulin resistance are associated with alterations in many regions of brain and that physical activity can decrease obesity, we hypothesized that in Rhesus monkeys (Macaca mulatta) fed a high fat diet and who subsequently received reduced calories BMI would be associated with a unique gene expression signature in motor regions of the brain implicated in neurodegenerative disorders. In the motor cortex with increased BMI we saw the upregulation of genes involved in apoptosis, altered gene expression in metabolic pathways, and the downregulation of pERK1/2, a protein involved in cellular survival. In the caudate nucleus with increased BMI we saw the upregulation of known obesity related genes (the insulin receptor and the glucagon-like peptide-2 receptor), apoptosis related genes, and altered expression of genes involved in various metabolic processes. These studies suggest that the effects of high BMI on the brain transcriptome persist regardless of two months of calorie restriction. We hypothesize that active lifestyles with low BMIs together create a brain homeostasis more conducive to brain resiliency and neuronal survival.
DNA microarray; rhesus monkey; BMI; motor cortex; caudate nucleus; gene expression; ERK pathway; brain
Identification of single nucleotide polymorphisms (SNPs) is a key element in sequence-based genetic analysis. Next generation sequencing offers a cost-effective basis to generate the necessary, large sequence data sets, and bioinformatic methods are being developed to process sequencing machine readouts. We were interested in detection of SNPs in a 350 kb region of an EMS-mutagenized Arabidopsis chromosome 3. The region was selectively analyzed using PCR-generated, overlapping fragments for Solexa sequencing. The ensuing reads provided a high coverage and were processed bioinformatically. In order to assess the SNP candidates obtained with a frequently used alignment program and SNP caller, we developed an additional method that allows the identification of high confidence SNP loci. The method can easily be applied to complete genome sequence data of sufficient coverage.
► We present a method to analyze high to medium coverage short read sequence data. ► The method depends on availability of a high quality reference sequence. ► The protocol can be applied to the output of established SNP caller programs. ► A scoring function or a graphic output identifies the best SNP candidates.
Next generation sequencing; Read alignment; SNP validation; SNP calling; Sub-genomic library
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.
Cancer; Classification; Gene Expression Profiling; Information Entropy; Computational Biology
TICs are characterized by their ability to self-renew, differentiate and initiate tumor formation. miRNAs are small noncoding RNAs that bind to mRNAs resulting in regulation of gene expression and biological functions. The role of miRNAs and TICs in cancer progression led us to hypothesize that miRNAs may regulate genes involved in TIC maintenance. Using whole genome miRNA and mRNA expression profiling of TICs from primary prostate cancer cells, we identified a set of up-regulated miRNAs and a set of genes down-regulated in PSs. Inhibition of these miRNAs results in a decrease of prostatosphere formation and an increase in target gene expression. This study uses genome-wide miRNA profiling to analyze expression in TICs. We connect aberrant miRNA expression and deregulated gene expression in TICs. These findings can contribute to a better understanding of the molecular mechanisms governing TIC development/maintenance and the role that miRNAs have in the fundamental biology of TICs.
Tumor-initiating cell; Prostatospheres; Prostate Cancer; Cancer Stem Cells; microRNA
Genome-wide characterization of the retinal transcriptome is central to understanding development, physiology and disorders of the visual system. Massively parallel, short-read sequencing of mRNA libraries was used to generate an extensive map of the transcriptome of the adult, murine neural retina. RNA-seq data strongly corroborates prior transcriptome studies by microarray and SAGE. However, several novel features of the retinal transcriptome were discovered. For example, retinal disease genes were discovered to be among the most highly expressed in the transcriptome. We also demonstrate other interesting features of the retinal transcriptome, for example, that the retina appears to employ a very specific and restricted set of synaptic vesicle genes, and also that there is persistence of expression of a majority of “neurodevelopmental” genes into adulthood. Retina transcriptome studies utilizing novel sequencing methods have been highly informative and these data may also serve as a resource for the community of researchers.
retina; transcriptome; RNA-seq; mouse; alternative splicing