We introduce GenRev, a network-based software package developed to explore the functional relevance of genes generated as an intermediate result from numerous high-throughput technologies. GenRev searches for optimal intermediate nodes (genes) for the connection of input nodes via several algorithms, including the Klein-Ravi algorithm, the limited kWalks algorithm and a heuristic local search algorithm. Gene ranking and graph clustering analyses are integrated into the package. GenRev has the following features. (1) It provides users with great flexibility to define their own networks. (2) Users are allowed to define each gene’s importance in a subnetwork search by setting its score. (3) It is standalone and platform independent. (4) It provides an optimization in subnetwork search, which dramatically reduces the running time. GenRev is particularly designed for general use so that users have the flexibility to choose a reference network and define the score of genes. GenRev is freely available at http://bioinfo.mc.vanderbilt.edu/GenRev.html.
Gene ranking; Network; Subnetwork; Klein-Ravi algorithm; limited kWalks algorithm; Disease genes
Obesity affects over 500 million people worldwide, and has far reaching negative health effects. Given that high body mass index (BMI) and insulin resistance are associated with alterations in many regions of brain and that physical activity can decrease obesity, we hypothesized that in Rhesus monkeys (Macaca mulatta) fed a high fat diet and who subsequently received reduced calories BMI would be associated with a unique gene expression signature in motor regions of the brain implicated in neurodegenerative disorders. In the motor cortex with increased BMI we saw the upregulation of genes involved in apoptosis, altered gene expression in metabolic pathways, and the downregulation of pERK1/2, a protein involved in cellular survival. In the caudate nucleus with increased BMI we saw the upregulation of known obesity related genes (the insulin receptor and the glucagon-like peptide-2 receptor), apoptosis related genes, and altered expression of genes involved in various metabolic processes. These studies suggest that the effects of high BMI on the brain transcriptome persist regardless of two months of calorie restriction. We hypothesize that active lifestyles with low BMIs together create a brain homeostasis more conducive to brain resiliency and neuronal survival.
DNA microarray; rhesus monkey; BMI; motor cortex; caudate nucleus; gene expression; ERK pathway; brain
Identification of single nucleotide polymorphisms (SNPs) is a key element in sequence-based genetic analysis. Next generation sequencing offers a cost-effective basis to generate the necessary, large sequence data sets, and bioinformatic methods are being developed to process sequencing machine readouts. We were interested in detection of SNPs in a 350 kb region of an EMS-mutagenized Arabidopsis chromosome 3. The region was selectively analyzed using PCR-generated, overlapping fragments for Solexa sequencing. The ensuing reads provided a high coverage and were processed bioinformatically. In order to assess the SNP candidates obtained with a frequently used alignment program and SNP caller, we developed an additional method that allows the identification of high confidence SNP loci. The method can easily be applied to complete genome sequence data of sufficient coverage.
► We present a method to analyze high to medium coverage short read sequence data. ► The method depends on availability of a high quality reference sequence. ► The protocol can be applied to the output of established SNP caller programs. ► A scoring function or a graphic output identifies the best SNP candidates.
Next generation sequencing; Read alignment; SNP validation; SNP calling; Sub-genomic library
Two-gene classifiers have attracted a broad interest for their simplicity and practicality. Most existing two-gene classification algorithms were involved in exhaustive search that led to their low time-efficiencies. In this study, we proposed two new two-gene classification algorithms which used simple univariate gene selection strategy and constructed simple classification rules based on optimal cut-points for two genes selected. We detected the optimal cut-point with the information entropy principle. We applied the two-gene classification models to eleven cancer gene expression datasets and compared their classification performance to that of some established two-gene classification models like the top-scoring pairs model and the greedy pairs model, as well as standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. These comparisons indicated that the performance of our two-gene classifiers was comparable to or better than that of compared models.
Cancer; Classification; Gene Expression Profiling; Information Entropy; Computational Biology
TICs are characterized by their ability to self-renew, differentiate and initiate tumor formation. miRNAs are small noncoding RNAs that bind to mRNAs resulting in regulation of gene expression and biological functions. The role of miRNAs and TICs in cancer progression led us to hypothesize that miRNAs may regulate genes involved in TIC maintenance. Using whole genome miRNA and mRNA expression profiling of TICs from primary prostate cancer cells, we identified a set of up-regulated miRNAs and a set of genes down-regulated in PSs. Inhibition of these miRNAs results in a decrease of prostatosphere formation and an increase in target gene expression. This study uses genome-wide miRNA profiling to analyze expression in TICs. We connect aberrant miRNA expression and deregulated gene expression in TICs. These findings can contribute to a better understanding of the molecular mechanisms governing TIC development/maintenance and the role that miRNAs have in the fundamental biology of TICs.
Tumor-initiating cell; Prostatospheres; Prostate Cancer; Cancer Stem Cells; microRNA
Genome-wide characterization of the retinal transcriptome is central to understanding development, physiology and disorders of the visual system. Massively parallel, short-read sequencing of mRNA libraries was used to generate an extensive map of the transcriptome of the adult, murine neural retina. RNA-seq data strongly corroborates prior transcriptome studies by microarray and SAGE. However, several novel features of the retinal transcriptome were discovered. For example, retinal disease genes were discovered to be among the most highly expressed in the transcriptome. We also demonstrate other interesting features of the retinal transcriptome, for example, that the retina appears to employ a very specific and restricted set of synaptic vesicle genes, and also that there is persistence of expression of a majority of “neurodevelopmental” genes into adulthood. Retina transcriptome studies utilizing novel sequencing methods have been highly informative and these data may also serve as a resource for the community of researchers.
retina; transcriptome; RNA-seq; mouse; alternative splicing
Analyzing gene expression data at the gene set level greatly improves feature extraction and data interpretation. Currently most efforts in gene set analysis are focused on differential expression analysis – finding gene sets whose genes show first-order relationship with the clinical outcome. However the regulation of the biological system is complex, and much of the change in gene expression dynamics do not manifest in the form of differential expression. At the gene set level, capturing the change in expression dynamics is difficult due to the complexity and heterogeneity of the gene sets. Here we report a systematic approach to detect gene sets that show differential coordination patterns with the rest of the transcriptome, as well as pairs of gene sets that are differentially coordinated with each other. We demonstrate that the method can identify biologically relevant gene sets, many of which do not show first-order relationship with the clinical outcome.
gene set analysis; gene expression; microarray
Identifying gene regulatory elements and their target genes in human cells remains a significant challenge. Despite increasing evidence of physical interactions between distant regulatory elements and gene promoters in mammalian cells, many studies consider only promoter-proximal regulatory regions. We identify putative cis-regulatory modules (CRMs) in human skeletal muscle differentiation by combining myogenic TF binding data before and after differentiation with histone modification data in myoblasts. CRMs that are distant (>20 kb) from muscle gene promoters are common and are more likely than proximal promoter regions to show differentiation-specific changes in myogenic TF binding. We find that two of these distant CRMs, known to activate transcription in differentiating myoblasts, interact physically with gene promoters (PDLIM3 and ACTA1) during differentiation. Our results highlight the importance of considering distal CRMs in investigations of mammalian gene regulation and support the hypothesis that distant CRM-promoter looping contacts are a general mechanism of gene regulation.
cis-regulatory modules; DNA looping interactions; Transcriptional regulation; Transcription factors; DNA binding sites
Four custom Axiom genotyping arrays were designed for a genome-wide association (GWA) study of 100,000 participants from the Kaiser Permanente Research Program on Genes, Environment and Health. The array optimized for individuals of European race/ethnicity was previously described. Here we detail the development of three additional microarrays optimized for individuals of East Asian, African American, and Latino race/ethnicity. For these arrays, we decreased redundancy of high-performing SNPs to increase SNP capacity. The East Asian array was designed using greedy pairwise SNP selection. However, removing SNPs from the target set based on imputation coverage is more efficient than pairwise tagging. Therefore, we developed a novel hybrid SNP selection method for the African American and Latino arrays utilizing rounds of greedy pairwise SNP selection, followed by removal from the target set of SNPs covered by imputation. The arrays provide excellent genome-wide coverage and are valuable additions for large-scale GWA studies.
Microarray; Genome-wide association study; Coverage; Imputation; Single nucleotide polymorphism; Throughput
The nitric oxide (NO) prodrug JS-K is shown to have anticancer activity. To profile the molecular events associated with anticancer effects of JS-K, HL-60 leukemia cells were treated with JS-K and subjected to microarray and real-time RT-PCR analysis. JS-K induced concentration- and time-dependent gene expression changes in HL-60 cells corresponding to the cytolethality effects. The apoptotic genes (caspases, Bax, and TNF-α) were induced, and differentiation-related genes (CD14, CD11b, and vimentin) were increased. For acute phase protein genes, some were increased (p53, c-jun) while others were suppressed (c-myc, cyclin E). The expression of anti-angiogenesis genes thrombospondin-1 and CD36 and genes involved in tumor cell migration such as tissue inhibitors of metalloproteinases, were also increased by JS-K. Confocal analysis confirmed key gene changes at the protein levels. Thus, multiple molecular events are associated with JS-K effects in killing HL-60, which could be molecular targets for this novel anticancer NO prodrug.
JS-K; Nitric oxide donor; HL-60 cells; gene expression; confocal analysis
Although the rhesus macaque (Macaca mulatta) is commonly used for biomedical research and becoming a preferred model for translational medicine, quantification of genome-wide variation has been slow to follow the publication of the genome in 2007. Here we report the properties of 4040 single nucleotide polymorphisms discovered and validated in Chinese and Indian rhesus macaques from captive breeding colonies in the United States. Frequency-matched measures of linkage disequilibrium were much greater in the Indian sample. Although the majority of polymorphisms were shared between the two populations, rare alleles were over twice as common in the Chinese sample. Indian rhesus had higher rates of heterozygosity, as well as previously undetected substructure, potentially due to admixture from Burma in wild populations and demographic events post-captivity.
Macaca mulatta; nonhuman primate; SNP; Linkage Disequilibrium
Repetitive elements (REs) constitute a substantial portion of the genomes of human and other species; however, the RE profiles (type, density, and arrangement) within the individual genomes have not been fully characterized. In this study, we developed an RE analysis tool, called REMiner, for a chromosome-wide investigation into the occurrence of individual REs and arrangement of clusters of REs, and REMiner’s functional features were examined using the human chromosome Y. The algorithm implemented by REMiner focused on unbiased mining of REs in large chromosomes and data interface within a viewer. The data from the chromosome demonstrated that REMiner is an efficient tool in regard to its capacity for a large query size and the availability of a high-resolution viewer, featuring instant retrieval of alignment data and control of magnification and identity ratio. The chromosome-wide survey identified a diverse population of ordered RE arrangements, which may participate in the genome biology.
repetitive element; repetitive element arrangement; chromosome-wide interactive mining; genome; REMiner
We sequenced the genomes of ten unrelated individuals and identified heterozygous stop gain variants in protein-coding genes: we then sequenced their transcriptomes and assessed the expression levels of the stop gain alleles. An ANOVA showed statistically significant differences between their expression levels (p=4×10-16). This difference was almost entirely accounted for by whether the stop gain variant had a second, non-protein-truncating function in or near an alternate transcript: stop gains without alternate functions were generally not found in the cDNA (p=3×10-5). Additionally, stop gain variants in two intronless genes were not expressed, an unexpected outcome given previous studies. In this study, stop gain variants were either well expressed in all individuals or were never expressed. Our finding that stop gain variants were generally expressed only when they had an alternate function suggests that most naturally occurring stop gain variants in protein-coding genes are either not transcribed or have their transcripts destroyed.
Nonsense-mediated decay; whole-genome sequencing; RNA-Seq; premature termination codons
We performed a detailed genomic investigation of the chimpanzee locus syntenic to human chromosome 4q35.2, associated to the facioscapulohumeral dystrophy. Two contigs of approximately 150 kb and 200 kb were derived from PTR chromosomes 4q35 and 3p12, respectively: both regions showed a very similar sequence organization, including D4Z4 and Beta satellite linked clusters. Starting from these findings, we derived a hypothetical evolutionary history of human 4q35, 10q26 and 3p12 chromosome regions focusing on the D4Z4–Beta satellite linked organization. The D4Z4 unit showed an open reading frame (DUX4) at both PTR 4q35 and 3p12 regions; furthermore some subregions of the Beta satellite unit showed a high degree of conservation between chimpanzee and humans. In conclusion, this paper provides evidence that at the 4q subtelomere the linkage between D4Z4 and Beta satellite arrays is a feature that appeared late during evolution and is conserved between chimpanzee and humans.
► A detailed genomic analysis of the PTR locus syntenic to human chromosome 4q35.2. ► PTR 4q35 and 3p12 regions carried a very similar D4Z4 and Beta satellite linked clusters. ► We derived a presumable evolutionary history of human 4q35, 10q26 and 3p12 regions. ► PTR and HSA subregions of the Beta satellite showed a high degree of conservation. ► 4q D4Z4–Beta satellite linked arrays appeared very late during evolution.
Primate evolution; Chimpanzee; Beta satellite; D4Z4; 4q35; FSHD
We used a RainDance Technologies (RDT) expanded content library to enrich the human X chromosome exome (2.5 Mb) from 26 male samples followed by Illumina sequencing. Our multiplex primer library covered 98.05% of the human X chromosome exome in a single tube with 11,845 different PCR amplicons. Illumina sequencing of 24 male samples showed coverage for 97% of the targeted sequences. Sequence from 2 HapMap samples confirmed missing data rates of 2–3% at sites successfully typed by the HapMap project, with an accuracy of at least ~99.5% as compared to reported HapMap genotypes. Our demonstration that a RDT expanded content library can efficiently enrich and enable the routine sequencing of the human X chromosome exome suggests a wide variety of potential research and clinical applications for this platform.
RNA-seq technologies are now replacing microarrays for profiling gene expression. Here we describe a robust RNA-seq strategy for multiplex analysis of RNA samples based on deep sequencing. First, an oligo-dT linked to an adaptor sequence is used to prime cDNA synthesis. Upon solid phase selection, second strand synthesis is initiated using a random primer linked to another adaptor sequence. Finally, the library is released from the beads and amplified using a bar-coded primer together with a common primer. This method, referred to as Multiplex Analysis of PolyA-linked Sequences (MAPS), preserves strand information, permits rapid identification of potentially new polyadenylation sites, and profiles gene expression in a highly cost effective manner. We have applied this technology to determine the transcriptome response to knockdown of the RNA binding protein TLS, and compared the result to current microarray technology, demonstrating the ability of MAPS to robustly detect regulated gene expression.
RNA-seq; multiplexing strategy; gene expression profiling; translocation-in-liposarcoma
Deep sequencing of the 16S rRNA gene provides a comprehensive view of bacterial communities in a particular environment and has expanded our ability to study the impact of the microflora on human health and disease. Current analysis methods rely on comparisons of the sequences generated with an expanding but limited set of annotated 16S rRNA sequences or phylogenic clustering of sequences based on arbitrary similarity cutoffs. We describe a novel approach to characterize bacterial composition using deep sequencing of 16S rRNA gene. Our method defines operational taxonomic units based on phylogenetic tree reconstruction and dynamic clustering of sequences using solely sequencing data. These OTUs can be used to identify differences in bacteria abundance between environments. This approach can perform better than previous phylogenetic methods and will significantly improve our understanding of the microfloral role on human diseases by providing a comprehensive analysis of the microbial composition from various bacterial communities.
microflora; massively parallel sequencing; 16s ribosomal RNA
High-throughput genotyping and sequencing techniques are rapidly and inexpensively providing large amounts of human genetic variation data. Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability and have been implicated in several human diseases, including cancer. Amino acid mutations resulting from non-synonymous SNPs in coding regions may generate protein functional changes that affect cell proliferation. In this study, we developed a machine learning approach to predict cancer-causing missense variants. We present a Support Vector Machine (SVM) classifier trained on a set of 3163 cancer-causing variants and an equal number of neutral polymorphisms. The method achieve 93% overall accuracy, a correlation coefficient of 0.86, and area under ROC curve of 0.98. When compared with other previously developed algorithms such as SIFT and CHASM our method results in higher prediction accuracy and correlation coefficient in identifying cancer-causing variants.
Single Nucleotide Polymorphisms; Cancer-causing variants; Gene Ontology; Machine-learning; Support Vector Machine
Microarray profiling in breast cancer patients have identified genes correlated with prognosis whose functions are unknown. The purpose of this study was to develop an in vivo assay for functionally screening regulators of tumor progression using a mouse model. Transductant shRNA cell lines were made in the MDA-MB-231 breast cancer line. A pooled population of 25 transductants was injected into the mammary fat pads and tail veins of mice to evaluate tumor growth, and experimental metastasis. The proportions of transductants were evaluated in the tumor and metastases using barcodes specific to each shRNA transductant. We characterized the homeobox 2 transcription factor as a negative regulator, decreasing tumor growth in MDA-MB-231, T47D, and MTLn3 mammary adenocarcinoma cell lines. Homeobox genes have been correlated with cancer patient prognosis and tumorigenesis. Here we use a novel in vivo shRNA screen to identify a new role for a homeobox gene in human mammary adenocarcinoma.
shRNA; Functional Screen; Breast Cancer; Homeobox 2
Single-nucleotide polymorphism (SNP) arrays have become a popular technology for disease-association studies, but they also have potential for studying the genetic differentiation of human populations. Application of the Affymetrix GeneChip Human Mapping 500K Array Set to a population of 102 individuals representing the major ethnic groups in the United States (African, Asian, European, and Hispanic) revealed patterns of gene diversity and genetic distance that reflected population history. We analyzed allelic frequencies at 388, 654 autosomal SNP sites that showed some variation in our study population and 10% or less missing values. In spite of the small size (23-31 individuals) of each subpopulation, there were no fixed differences at any site between any two subpopulations. As expected from the African origin of modern humans, greater gene diversity was seen in Africans than in either Asians or Europeans, and the genetic distance between Asians and European populations was significantly lower than that between either of these two populations and Africans. Principal components analysis applied to a correlation matrix among individuals was able to separate completely the major continental groups of humans (Africans, Asians, and Europeans), while Hispanics overlapped all three of these groups. Genes containing two or more markers with extraordinarily high genetic distance between subpopulations were identified as candidate genes for health differences between subpopulations. The results show that, even with modest sample sizes, genome-wide SNP genotyping technologies have great promise for capturing signatures of gene frequency difference between human subpopulations, with applications in areas as diverse as forensics and the study of ethnic health disparities.
Bdellovibrio bacteriovorus is a bacterial parasite with an unusual lifestyle. It grows and reproduces in the periplasm of a host prey bacterium. The complete genome sequence of B. bacteriovorus has recently been reported. We have reanalyzed the transport proteins encoded within the B. bacteriovorus genome according to the current content of the transporter classification database (TCDB). A comprehensive analysis is given on the types and numbers of transport systems that B. bacteriovorus has. In this regard, the potential protein secretory capabilities of at least 4 types of inner membrane secretion systems and 5 types for outer membrane secretion are described. Surprisingly, B. bacteriovorus has a disproportionate percentage of cytoplasmic membrane channels and outer membrane porins. It has far more TonB/ExbBD-type systems and MotAB-type systems for energizing outer membrane transport and motility than does E. coli. Analysis of probable substrate specificities of its transporters provides clues to its metabolic preferences. Interesting examples of gene fusions and of potentially overlapping genes were also noted. Our analyses provide a comprehensive, detailed appreciation of the transport capabilities of B. bacteriovorus. They should serve as a guide for functional experimental analyses.
Bacterial parasitism; transport; genome analyses; vectorial metabolism; protein secretion
The success of genome-wide association studies has paralleled the development of efficient genotyping technologies. We describe the development of a next-generation microarray based on the new highly-efficient Affymetrix Axiom genotyping technology that we are using to genotype individuals of European ancestry from the Kaiser Permanente Research Program on Genes, Environment and Health (RPGEH). The array contains 674,517 SNPs, and provides excellent genome-wide as well as gene-based and candidate-SNP coverage. Coverage was calculated using an approach based on imputation and cross validation. Preliminary results for the first 80,301 saliva-derived DNA samples from the RPGEH demonstrate very high quality genotypes, with sample success rates above 94% and over 98% of successful samples having SNP call rates exceeding 98%. At steady state, we have produced 462 million genotypes per week for each Axiom system. The new array provides a valuable addition to the repertoire of tools for large scale genome-wide association studies.
Microarray; Genome-wide association study; Coverage; Throughput; Single nucleotide polymorphism
Non-Hodgkin lymphoma (NHL) is a hematological malignancy of the immune system, and, as with autoimmune and inflammatory diseases (ADs), is influenced by genetic variation in the major histocompatibility complex (MHC). Persons with a history of specific ADs also have increased risk of NHL. As the coexistence of ADs and NHL could be caused by factors common to both diseases, here we examined whether some of the associated genetic signals are shared. Overlapping risk loci for NHL subytpes and several ADs were explored using data from genome-wide association studies. Several common genomic regions and susceptibility loci were identified suggesting a potential shared genetic background. Two independent MHC regions showed the main overlap, with several alleles in the human leukocyte antigen (HLA) Class II region exhibiting an opposite risk effect for follicular lymphoma and type I diabetes. These results support continued investigation to further elucidate the relationship between lymphoma and autoimmune diseases.
Non-Hodgkin lymphoma; Autoimmune diseases; Genome-wide Association Studies; Human Leukocyte Antigen
Here we report the use of a multi-genome DNA microarray to investigate the genome diversity of Bacillus cereus group members and elucidate the events associated with the emergence of B. anthracis the causative agent of anthrax–a lethal zoonotic disease. We initially performed directed genome sequencing of seven diverse B. cereus strains to identify novel sequences encoded in those genomes. The novel genes identified, combined with those publicly available, allowed the design of a “species” DNA microarray. Comparative genomic hybridization analyses of 41 strains indicates that substantial heterogeneity exists with respect to the genes comprising functional role categories. While the acquisition of the plasmid-encoded pathogenicity island (pXO1) and capsule genes (pXO2) represent a crucial landmark dictating the emergence of B. anthracis, the evolution of this species and its close relatives was associated with an overall a shift in the fraction of genes devoted to energy metabolism, cellular processes, transport, as well as virulence.
While there have been significant advances in understanding the genetic etiology of human hair loss over the previous decade, there remain a number of hereditary disorders for which a causative gene has yet to be identified. We studied a large, consanguineous Brazilian family that presented with sparse woolly hair at birth that progressed to severe hypotrichosis by the age of 5, in which 6 of the 14 offspring were affected. After exclusion of known candidate genes, a genome-wide scan was performed to identify the disease locus. Autozygosity mapping revealed a highly significant region of extended homozygosity (LOD score of 10.41) that contained a haplotype with a linkage LOD score of 3.28. Results of these two methods defined a 9 Mb region on chromosome 13q14.11-q14.2. The interval contains the P2RY5 gene, in which we recently identified pathogenic mutations in several families of Pakistani origin affected with autosomal recessive woolly and sparse hair. After the exclusion of several other candidate genes, we sequenced the P2RY5 gene and identified a homozygous mutation (C278Y) in all affected individuals in this family. Our findings show that mutations in P2RY5 display variable expressivity, underlying both hypotrichosis and woolly hair, and underscore the essential role of P2RY5 in the tissue integrity and the maintenance of the hair follicle.
P2RY5; G-protein coupled receptor; autosomal recessive hypotrichosis; autosomal recessive woolly hair; variable expressivity