The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data.
ERC Consortium; DMRR; exRNA; exRNA Atlas; exRNA Portal
Tissue-specific expression of lincRNAs suggests developmental and cell-type specific functions, yet tissue-specificity was established for only a small fraction of lincRNAs. Here, by analyzing 111 reference epigenomes from the NIH Roadmap Epigenomics project we determine tissue-specific epigenetic regulation for 3,753 (69% examined) lincRNAs, with 54% active in one of the fourteen cell/tissue clusters and an additional 15% in two or three clusters. A larger fraction of lincRNA TSSs are marked in a tissue-specific manner by H3K4me1 than by H3K4me3. The tissue-specific lincRNAs are strongly linked to tissue-specific pathways and undergo distinct chromatin state transitions during cellular differentiation. Polycomb-regulated lincRNAs reside in the bivalent state in embryonic stem cells and many of them undergo H3K27me3-mediated silencing at early stages of differentiation. The exquisitely tissue-specific epigenetic regulation of lincRNAs and the assignment of a majority of them to specific tissue types will inform future studies of this newly discovered class of genes.
The role of intermediate methylation states in DNA is unclear. Here, to comprehensively identify regions of intermediate methylation and their quantitative relationship with gene activity, we apply integrative and comparative epigenomics to 25 human primary cell and tissue samples. We report 18,452 intermediate methylation regions located near 36% of genes and enriched at enhancers, exons, and DNase I hypersensitivity sites. Intermediate methylation regions average 57% methylation, are predominantly allele-independent, and are conserved across individuals and between mouse and human, suggesting a conserved function. At enhancers, these regions have an intermediate level of active chromatin marks and their associated genes have intermediate transcriptional activity. Exonic intermediate methylation correlates with exon inclusion at the level between that of fully methylated and unmethylated exons, highlighting gene context-dependent functions. We conclude that intermediate DNA methylation is a conserved signature of gene regulation and exon usage.
The role of intermediate methylation states in DNA is unclear. Here, to comprehensively identify regions of intermediate methylation and their quantitative relationship with gene activity, we apply integrative and comparative epigenomics to 25 human primary cell and tissue samples. We report 18,452 intermediate methylation regions located near 36% of genes and enriched at enhancers, exons and DNase I hypersensitivity sites. Intermediate methylation regions average 57% methylation, are predominantly allele-independent and are conserved across individuals and between mouse and human, suggesting a conserved function. These regions have an intermediate level of active chromatin marks and their associated genes have intermediate transcriptional activity. Exonic intermediate methylation correlates with exon inclusion at a level between that of fully methylated and unmethylated exons, highlighting gene context-dependent functions. We conclude that intermediate DNA methylation is a conserved signature of gene regulation and exon usage.
Many loci in the mammalian genome are intermediately methylated. Here, by comprehensively identifying these loci and quantifying their relationship with gene activity, the authors show that intermediate methylation is an evolutionarily conserved epigenomic signature of gene regulation.
Tissue-specific expression of lincRNAs suggests developmental and cell-type-specific functions, yet tissue specificity was established for only a small fraction of lincRNAs. Here, by analysing 111 reference epigenomes from the NIH Roadmap Epigenomics project, we determine tissue-specific epigenetic regulation for 3,753 (69% examined) lincRNAs, with 54% active in one of the 14 cell/tissue clusters and an additional 15% in two or three clusters. A larger fraction of lincRNA TSSs is marked in a tissue-specific manner by H3K4me1 than by H3K4me3. The tissue-specific lincRNAs are strongly linked to tissue-specific pathways and undergo distinct chromatin state transitions during cellular differentiation. Polycomb-regulated lincRNAs reside in the bivalent state in embryonic stem cells and many of them undergo H3K27me3-mediated silencing at early stages of differentiation. The exquisitely tissue-specific epigenetic regulation of lincRNAs and the assignment of a majority of them to specific tissue types will inform future studies of this newly discovered class of genes.
Tissue-specific functions have been established for some lincRNAs. Here, by analysing 111 reference epigenomes from the NIH Roadmap Epigenomics project, the authors report tissue-specific epigenomic regulation of 3,753 lincRNAs and their strong connection with tissue-specific pathways.
Epigenomic analysis efforts have so far focused on the multiple layers of epigenomic information within individual cell types. With the rapidly increasing diversity of epigenomically mapped cell types, unprecedented opportunities for comparative analysis of epigenomes are opening up. One such opportunity is to map the bifurcating tree of cellular differentiation. Another is to understand the epigenomically mediated effects of mutations, environmental influences, and disease processes. Comparative analysis of epigenomes therefore has the potential to provide wide-ranging fresh insights into basic biology and human disease. The realization of this potential will critically depend on availability of a cyberinfrastructure that will scale with the volume of data and diversity of applications and a number of other computational challenges.
Interactions between the epigenome and structural genomic variation are potentially bi-directional. In one direction, structural variants may cause epigenomic changes in cis. In the other direction, specific local epigenomic states such as DNA hypomethylation associate with local genomic instability.
To study these interactions, we have developed several tools and exposed them to the scientific community using the Software-as-a-Service model via the Genboree Workbench. One key tool is Breakout, an algorithm for fast and accurate detection of structural variants from mate pair sequencing data.
By applying Breakout and other Genboree Workbench tools we map breakpoints in breast and prostate cancer cell lines and tumors, discriminate between polymorphic breakpoints of germline origin and those of somatic origin, and analyze both types of breakpoints in the context of the Human Epigenome Atlas, ENCODE databases, and other sources of epigenomic profiles. We confirm previous findings that genomic instability in human germline associates with hypomethylation of DNA, binding sites of Suz12, a key member of the PRC2 Polycomb complex, and with PRC2-associated histone marks H3K27me3 and H3K9me3. Breakpoints in germline and in breast cancer associate with distal regulatory of active gene transcription. Breast cancer cell lines and tumors show distinct patterns of structural mutability depending on their ER, PR, or HER2 status.
The patterns of association that we detected suggest that cell-type specific epigenomes may determine cell-type specific patterns of selective structural mutability of the genome.
Summary: Gene fusions are being discovered at an increasing rate using massively parallel sequencing technologies. Prioritization of cancer fusion drivers for validation cannot be performed using traditional single-gene based methods because fusions involve portions of two partner genes. To address this problem, we propose a novel network analysis method called fusion centrality that is specifically tailored for prioritizing gene fusions. We first propose a domain-based fusion model built on the theory of exon/domain shuffling. The model leads to a hypothesis that a fusion is more likely to be an oncogenic driver if its partner genes act like hubs in a network because the fusion mutation can deregulate normal functions of many other genes and their pathways. The hypothesis is supported by the observation that for most known cancer fusion genes, at least one of the fusion partners appears to be a hub in a network, and even for many fusions both partners appear to be hubs. Based on this model, we construct fusion centrality, a multi-gene-based network metric, and use it to score fusion drivers. We show that the fusion centrality outperforms other single gene-based methods. Specifically, the method successfully predicts most of 38 newly discovered fusions that had validated oncogenic importance. To our best knowledge, this is the first network-based approach for identifying fusion drivers.
Availability: Matlab code implementing the fusion centrality method is available upon request from the corresponding authors.
Supplementary data are available at Bioinformatics online.
Although our microbial community and genomes (the human microbiome) outnumber our genome by several orders of magnitude, to what extent the human host genetic complement informs the microbiota composition is not clear. The Human Microbiome Project (HMP) Consortium established a unique population-scale framework with which to characterize the relationship of microbial community structure with their human hosts. A wide variety of taxa and metabolic pathways have been shown to be differentially distributed by virtue of race/ethnicity in the HMP. Given that mtDNA haplogroups are the maternally derived ancestral genomic markers and mitochondria’s role as the generator for cellular ATP, characterizing the relationship between human mtDNA genomic variants and microbiome profiles becomes of potential marked biologic and clinical interest.
We leveraged sequencing data from the HMP to investigate the association between microbiome community structures with its own host mtDNA variants. 15 haplogroups and 631 mtDNA nucleotide polymorphisms (mean sequencing depth of 280X on the mitochondria genome) from 89 individuals participating in the HMP were accurately identified. 16S rRNA (V3-V5 region) sequencing generated microbiome taxonomy profiles and whole genome shotgun sequencing generated metabolic profiles from various body sites were treated as traits to conduct association analysis between haplogroups and host clinical metadata through linear regression. The mtSNPs of individuals with European haplogroups were associated with microbiome profiles using PLINK quantitative trait associations with permutation and adjusted for multiple comparisons. We observe that among 139 stool and 59 vaginal posterior fornix samples, several haplogroups show significant association with specific microbiota (q-value < 0.05) as well as their aggregate community structure (Chi-square with Monte Carlo, p < 0.005), which confirmed and expanded previous research on the association of race and ethnicity with microbiome profile. Our results further indicate that mtDNA variations may render different microbiome profiles, possibly through an inflammatory response to different levels of reactive oxygen species activity.
These data provide initial evidence for the association between host ancestral genome with the structure of its microbiome.
HMP; Mitochondrial DNA haplogroup; Association; Microbiome; mtDNA SNP
Ovarian cancer is the fifth leading cause of cancer death in women. Almost 70% of ovarian cancer deaths are due to the high-grade serous subtype, which is typically detected only after it has metastasized. Characterization of high-grade serous cancer is further complicated by the significant heterogeneity and genome instability displayed by this cancer. Other than mutations in TP53, which is common to many cancers, highly recurrent recombinant events specific to this cancer have yet to be identified. Using high-throughput transcriptome sequencing of seven patient samples combined with experimental validation at DNA, RNA and protein levels, we identified a cancer-specific and inter-chromosomal fusion gene CDKN2D-WDFY2 that occurs at a frequency of 20% among sixty high-grade serous cancer samples but is absent in non-cancerous ovary and fallopian tube samples. This is the most frequent recombinant event identified so far in high-grade serous cancer implying a major cellular lineage in this highly heterogeneous cancer. In addition, the same fusion transcript was also detected in OV-90, an established high-grade serous type cell line. The genomic breakpoint was identified in intron 1 of CDKN2D and intron 2 of WDFY2 in patient tumor, providing direct evidence that this is a fusion gene. The parental gene, CDKN2D, is a cell-cycle modulator that is also involved in DNA repair, while WDFY2 is known to modulate AKT interactions with its substrates. Transfection of cloned fusion construct led to loss of wildtype CDKN2D and wildtype WDFY2 protein expression, and a gain of a short WDFY2 protein isoform that is presumably under the control of the CDKN2D promoter. The expression of short WDFY2 protein in transfected cells appears to alter the PI3K/AKT pathway that is known to play a role in oncogenesis. CDKN2D-WDFY2 fusion could be an important molecular signature for understanding and classifying sub-lineages among heterogeneous high-grade serous ovarian carcinomas.
High-grade serous carcinoma (HG-SC) is the most common subtype of ovarian cancer observed in women. This subtype of ovarian cancer is typically detected at advanced stages due to lack of effective early screening tools. Recurrent cancer-specific gene fusions resulting from chromosomal translocations have the potential to serve as effective screening tools as well as therapeutic targets. Here we identified CDKN2D-WDFY2 as a cancer-specific fusion gene present in 20% of HG-SC tumors, by far the most frequent gene recombinant event found in this highly heterogeneous disease. We also presented evidence that the expression of this fusion may affect the PI3K/AKT pathway that is important for cancer progression. Thus CDKN2D-WDFY2 could very well represent a major cellular lineage important for detecting and classifying heterogeneous ovarian carcinomas, and could provide insight into the underlying mechanism of this deadly disease. This is critical, given that ovarian cancer kills 140,200 women worldwide each year, and few ovarian cancer-specific molecular alterations are currently available for targeting and screening.
Coupling bisulfite conversion with next-generation sequencing (Bisulfite-seq) enables genome-wide measurement of DNA methylation, but poses unique challenges for mapping. However, despite a proliferation of Bisulfite-seq mapping tools, no systematic comparison of their genomic coverage and quantitative accuracy has been reported. We sequenced bisulfite-converted DNA from two tissues from each of two healthy human adults and systematically compared five widely used Bisulfite-seq mapping algorithms: Bismark, BSMAP, Pash, BatMeth and BS Seeker. We evaluated their computational speed and genomic coverage and verified their percentage methylation estimates. With the exception of BatMeth, all mappers covered >70% of CpG sites genome-wide and yielded highly concordant estimates of percentage methylation (r2 ≥ 0.95). Fourfold variation in mapping time was found between BSMAP (fastest) and Pash (slowest). In each library, 8–12% of genomic regions covered by Bismark and Pash were not covered by BSMAP. An experiment using simulated reads confirmed that Pash has an exceptional ability to uniquely map reads in genomic regions of structural variation. Independent verification by bisulfite pyrosequencing generally confirmed the percentage methylation estimates by the mappers. Of these algorithms, Bismark provides an attractive combination of processing speed, genomic coverage and quantitative accuracy, whereas Pash offers considerably higher genomic coverage.
Autism is a neurodevelopmental disorder with increasing evidence of heterogeneous genetic etiology including de novo and inherited copy number variants (CNVs). We performed array comparative genomic hybridization using a custom Agilent 1 M oligonucleotide array intended to cover 197 332 unique exons in RefSeq genes; 98% were covered by at least one probe and 95% were covered by three or more probes with the focus on detecting relatively small CNVs that would implicate a single protein-coding gene. The study group included 99 trios from the Simons Simplex Collection. The analysis identified and validated 55 potentially pathogenic CNVs, categorized as de novo autosomal heterozygous, inherited homozygous autosomal, complex autosomal and hemizygous deletions on the X chromosome of probands. Twenty percent (11 of 55) of these CNV calls were rare when compared with the Database of Genomic Variants. Thirty-six percent (20 of 55) of the CNVs were also detected in the same samples in an independent analysis using the 1 M Illumina single-nucleotide polymorphism array. Findings of note included a common and sometimes homozygous 61 bp exonic deletion in SLC38A10, three CNVs found in lymphoblast-derived DNA but not present in whole-blood derived DNA and, most importantly, in a male proband, an exonic deletion of the TMLHE (trimethyllysine hydroxylase epsilon) that encodes the first enzyme in the biosynthesis of carnitine. Data for CNVs present in lymphoblasts but absent in fresh blood DNA suggest that these represent clonal outgrowth of individual B cells with pre-existing somatic mutations rather than artifacts arising in cell culture. GEO accession number GSE23765 (http://www.ncbi.nlm.nih.gov/geo/, date last accessed on 30 August 2011). Genboree accession: http://genboree.org/java-bin/gbrowser.jsp?refSeqId=1868&entryPointId=chr17&from=53496072&to=53694382&isPublic=yes, date last accessed on 30 August 2011.
BACKGROUND AND AIMS
The intestinal microbiomes of healthy children and pediatric patients with irritable bowel syndrome (IBS) are not well defined. Studies in adults have indicated that the gastrointestinal microbiota could be involved in IBS.
We analyzed 71 samples from 22 children with IBS (pediatric Rome III criteria) and 22 healthy children, ages 7–12 years, by 16S rRNA gene sequencing, with an average of 54,287 reads/stool sample (average 454 read length = 503 bases). Data were analyzed using phylogenetic-based clustering (Unifrac), or an operational taxonomic unit (OTU) approach using a supervised machine learning tool (randomForest). Most samples were also hybridized to a microarray that can detect 8,741 bacterial taxa (16S rRNA PhyloChip).
Microbiomes associated with pediatric IBS were characterized by a significantly greater percentage of the class Gammaproteobacteria (0.07% vs 0.89% of total bacteria; P <.05); one prominent component of this group was Haemophilus parainfluenzae. Differences highlighted by 454 sequencing were confirmed by high-resolution PhyloChip analysis. Using supervised learning techniques, we were able to classify different subtypes of IBS with a success rate of 98.5%, using limited sets of discriminant bacterial species. A novel Ruminococcus-like microbe was associated with IBS, indicating the potential utility of microbe discovery for gastrointestinal disorders. A greater frequency of pain correlated with an increased abundance of several bacterial taxa from the genus Alistipes.
Using16S metagenomics by Phylochip DNA hybridization and deep 454 pyrosequencing, we associated specific microbiome signatures with pediatric IBS. These findings indicate the important association between gastrointestinal microbes and IBS in children; these approaches might be used in diagnosis of functional bowel disorders in pediatric patients.
Until recently, sequencing has primarily been carried out in large genome centers which have invested heavily in developing the computational infrastructure that enables genomic sequence analysis. The recent advancements in next generation sequencing (NGS) have led to a wide dissemination of sequencing technologies and data, to highly diverse research groups. It is expected that clinical sequencing will become part of diagnostic routines shortly. However, limited accessibility to computational infrastructure and high quality bioinformatic tools, and the demand for personnel skilled in data analysis and interpretation remains a serious bottleneck. To this end, the cloud computing and Software-as-a-Service (SaaS) technologies can help address these issues.
We successfully enabled the Atlas2 Cloud pipeline for personal genome analysis on two different cloud service platforms: a community cloud via the Genboree Workbench, and a commercial cloud via the Amazon Web Services using Software-as-a-Service model. We report a case study of personal genome analysis using our Atlas2 Genboree pipeline. We also outline a detailed cost structure for running Atlas2 Amazon on whole exome capture data, providing cost projections in terms of storage, compute and I/O when running Atlas2 Amazon on a large data set.
We find that providing a web interface and an optimized pipeline clearly facilitates usage of cloud computing for personal genome analysis, but for it to be routinely used for large scale projects there needs to be a paradigm shift in the way we develop tools, in standard operating procedures, and in funding mechanisms.
Microbial metagenomic analyses rely on an increasing number of publicly available tools. Installation, integration, and maintenance of the tools poses significant burden on many researchers and creates a barrier to adoption of microbiome analysis, particularly in translational settings.
To address this need we have integrated a rich collection of microbiome analysis tools into the Genboree Microbiome Toolset and exposed them to the scientific community using the Software-as-a-Service model via the Genboree Workbench. The Genboree Microbiome Toolset provides an interactive environment for users at all bioinformatic experience levels in which to conduct microbiome analysis. The Toolset drives hypothesis generation by providing a wide range of analyses including alpha diversity and beta diversity, phylogenetic profiling, supervised machine learning, and feature selection.
We validate the Toolset in two studies of the gut microbiota, one involving obese and lean twins, and the other involving children suffering from the irritable bowel syndrome.
By lowering the barrier to performing a comprehensive set of microbiome analyses, the Toolset empowers investigators to translate high-volume sequencing data into valuable biomedical discoveries.
Cancer genomes frequently undergo genomic instability resulting in accumulation of chromosomal rearrangement. To date, one of the main challenges has been to confidently and accurately identify these rearrangements using short-read massively parallel sequencing. We were able to improve cancer rearrangement detection by combining two distinct massively parallel sequencing strategies: fosmid-sized (36 Kilobases on average) and standard 5 Kilobase mate pair libraries. We applied this strategy to map rearrangements in two breast cancer cell lines, MCF7 and HCC1954. We detect and validate a total of 91 somatic rearrangements in MCF7 and 25 in HCC1954, including genomic alterations corresponding to previously reported transcript aberrations in these two cell lines. Each of the genomes contains two types of breakpoints – clustered and dispersed. In both cell lines, the dispersed breakpoints show enrichment for low copy repeats, while the clustered breakpoints associate with high-copy number amplifications. Comparing the two genomes, we observe highly similar structural mutational spectra affecting different sets of genes, pointing to similar histories of genomic instability against the background of very different gene network perturbations.
fosmid ditag; massively parallel sequencing; gene fusion; copy number variation; genomic instability
Dynamic changes to the epigenome play a critical role in establishing and maintaining cellular phenotype during differentiation, but little is known about the normal methylomic differences that occur between functionally distinct areas of the brain. We characterized intra- and inter-individual methylomic variation across whole blood and multiple regions of the brain from multiple donors.
Distinct tissue-specific patterns of DNA methylation were identified, with a highly significant over-representation of tissue-specific differentially methylated regions (TS-DMRs) observed at intragenic CpG islands and low CG density promoters. A large proportion of TS-DMRs were located near genes that are differentially expressed across brain regions. TS-DMRs were significantly enriched near genes involved in functional pathways related to neurodevelopment and neuronal differentiation, including BDNF, BMP4, CACNA1A, CACA1AF, EOMES, NGFR, NUMBL, PCDH9, SLIT1, SLITRK1 and SHANK3. Although between-tissue variation in DNA methylation was found to greatly exceed between-individual differences within any one tissue, we found that some inter-individual variation was reflected across brain and blood, indicating that peripheral tissues may have some utility in epidemiological studies of complex neurobiological phenotypes.
This study reinforces the importance of DNA methylation in regulating cellular phenotype across tissues, and highlights genomic patterns of epigenetic variation across functionally distinct regions of the brain, providing a resource for the epigenetics and neuroscience research communities.
While current major national research efforts (i.e., the NIH Human Microbiome Project) will enable comprehensive metagenomic characterization of the adult human microbiota, how and when these diverse microbial communities take up residence in the host and during reproductive life are unexplored at a population level. Because microbial abundance and diversity might differ in pregnancy, we sought to generate comparative metagenomic signatures across gestational age strata. DNA was isolated from the vagina (introitus, posterior fornix, midvagina) and the V5V3 region of bacterial 16S rRNA genes were sequenced (454FLX Titanium platform). Sixty-eight samples from 24 healthy gravidae (18 to 40 confirmed weeks) were compared with 301 non-pregnant controls (60 subjects). Generated sequence data were quality filtered, taxonomically binned, normalized, and organized by phylogeny and into operational taxonomic units (OTU); principal coordinates analysis (PCoA) of the resultant beta diversity measures were used for visualization and analysis in association with sample clinical metadata. Altogether, 1.4 gigabytes of data containing >2.5 million reads (averaging 6,837 sequences/sample of 493 nt in length) were generated for computational analyses. Although gravidae were not excluded by virtue of a posterior fornix pH >4.5 at the time of screening, unique vaginal microbiome signature encompassing several specific OTUs and higher-level clades was nevertheless observed and confirmed using a combination of phylogenetic, non-phylogenetic, supervised, and unsupervised approaches. Both overall diversity and richness were reduced in pregnancy, with dominance of Lactobacillus species (L. iners crispatus, jensenii and johnsonii, and the orders Lactobacillales (and Lactobacillaceae family), Clostridiales, Bacteroidales, and Actinomycetales. This intergroup comparison using rigorous standardized sampling protocols and analytical methodologies provides robust initial evidence that the vaginal microbial 16S rRNA gene catalogue uniquely differs in pregnancy, with variance of taxa across vaginal subsite and gestational age.
Fuelled by new sequencing technologies, epigenome mapping projects are revealing epigenomic variation at all levels of biological complexity, from species to cells. Comparisons of methylation profiles among species reveal evolutionary conservation of gene body methylation patterns, pointing to the fundamental role of epigenomes in gene regulation. At the human population level, epigenomic changes provide footprints of the effects of genomic variants within the vast non-protein coding fraction of the genome while comparisons of the epigenomes of parents and their offspring point to quantitative epigenomic parent-of-origin effects confounding classical Mendelian genetics. At the organismal level, comparisons of epigenomes from diverse cell types provide insights into cellular differentiation. Finally, comparisons of epigenomes from monozygotic twins help dissect genetic and environmental influences on human phenotypes and longitudinal comparisons reveal aging-associated epigenomic drift. The development of new bioinformatic frameworks for comparative epigenome analysis is putting epigenome maps within reach of researchers across a wide spectrum of biological disciplines.
The hotspots of structural polymorphisms and structural mutability in the human genome remain to be explained mechanistically. We examine associations of structural mutability with germline DNA methylation and with non-allelic homologous recombination (NAHR) mediated by low-copy repeats (LCRs). Combined evidence from four human sperm methylome maps, human genome evolution, structural polymorphisms in the human population, and previous genomic and disease studies consistently points to a strong association of germline hypomethylation and genomic instability. Specifically, methylation deserts, the ∼1% fraction of the human genome with the lowest methylation in the germline, show a tenfold enrichment for structural rearrangements that occurred in the human genome since the branching of chimpanzee and are highly enriched for fast-evolving loci that regulate tissue-specific gene expression. Analysis of copy number variants (CNVs) from 400 human samples identified using a custom-designed array comparative genomic hybridization (aCGH) chip, combined with publicly available structural variation data, indicates that association of structural mutability with germline hypomethylation is comparable in magnitude to the association of structural mutability with LCR–mediated NAHR. Moreover, rare CNVs occurring in the genomes of individuals diagnosed with schizophrenia, bipolar disorder, and developmental delay and de novo CNVs occurring in those diagnosed with autism are significantly more concentrated within hypomethylated regions. These findings suggest a new connection between the epigenome, selective mutability, evolution, and human disease.
The human genome contains many loci with high incidence of structural mutations, including insertions and deletions of chromosomal segments. This excessive mutability has accelerated evolution and contributed to human disease but has yet to be explained. Segments of DNA repeated in low-copy numbers (LCRs) have been previously implicated in promoting structural mutability in specific disease-associated loci. Lack of methylation (hypomethylation) of genomic DNA has been previously associated with high structural mutability in gibbons and in human cancer cells, but the association with structural mutability in the human germline has not been explored prior to this study. Our analyses confirm the role of LCRs in promoting structural mutability on the genome scale but also reveal a surprisingly strong association of genomic instability with hypomethylation. Specifically, evolutionary analyses reveal that methylation deserts, the ∼1% fraction of the human genome with the lowest methylation in human sperm, harbor a tenfold higher number of structural mutations than genome-wide average. Moreover, the structural mutations in individuals diagnosed with schizophrenia, bipolar disorder, developmental delay, and autism are significantly more concentrated within hypomethylated regions. Our findings suggest a new connection between methylation of genomic DNA, selective structural mutability, evolution, and human disease.
Gibbons (Hylobatidae) shared a common ancestor with the other hominoids only 15–18 million years ago. Nevertheless, gibbons show very distinctive features that include heavily rearranged chromosomes. Previous observations indicate that this phenomenon may be linked to the attenuated epigenetic repression of transposable elements (TEs) in gibbon species. Here we describe the massive expansion of a repeat in almost all the centromeres of the eastern hoolock gibbon (Hoolock leuconedys). We discovered that this repeat is a new composite TE originating from the combination of portions of three other elements (L1ME5, AluSz6, and SVA_A) and thus named it LAVA. We determined that this repeat is found in all the gibbons but does not occur in other hominoids. Detailed investigation of 46 different LAVA elements revealed that the majority of them have target site duplications (TSDs) and a poly-A tail, suggesting that they have been retrotransposing in the gibbon genome. Although we did not find a direct correlation between the emergence of LAVA elements and human–gibbon synteny breakpoints, this new composite transposable element is another mark of the great plasticity of the gibbon genome. Moreover, the centromeric expansion of LAVA insertions in the hoolock closely resembles the massive centromeric expansion of the KERV-1 retroelement reported for wallaby (marsupial) interspecific hybrids. The similarity between the two phenomena is consistent with the hypothesis that evolution of the gibbons is characterized by defects in epigenetic repression of TEs, perhaps triggered by interspecific hybridization.
gibbon; centromere; transposable element; SVA; hybrid
Whole exome capture sequencing allows researchers to cost-effectively sequence the coding regions of the genome. Although the exome capture sequencing methods have become routine and well established, there is currently a lack of tools specialized for variant calling in this type of data.
Using statistical models trained on validated whole-exome capture sequencing data, the Atlas2 Suite is an integrative variant analysis pipeline optimized for variant discovery on all three of the widely used next generation sequencing platforms (SOLiD, Illumina, and Roche 454). The suite employs logistic regression models in conjunction with user-adjustable cutoffs to accurately separate true SNPs and INDELs from sequencing and mapping errors with high sensitivity (96.7%).
We have implemented the Atlas2 Suite and applied it to 92 whole exome samples from the 1000 Genomes Project. The Atlas2 Suite is available for download at http://sourceforge.net/projects/atlas2/. In addition to a command line version, the suite has been integrated into the Genboree Workbench, allowing biomedical scientists with minimal informatics expertise to remotely call, view, and further analyze variants through a simple web interface. The existing genomic databases displayed via the Genboree browser also streamline the process from variant discovery to functional genomics analysis, resulting in an off-the-shelf toolkit for the broader community.
Small non-coding RNAs, such as microRNAs (miRNAs), are involved in diverse biological processes including organ development and tissue differentiation. Global disruption of miRNA biogenesis in Dicer knockout mice disrupts early embryogenesis and primordial germ cell formation. However, the role of miRNAs in early folliculogenesis is poorly understood. In order to identify a full transcriptome set of small RNAs expressed in the newborn (NB) ovary, we extracted small RNA fraction from mouse NB ovary tissues and subjected it to massive parallel sequencing using the Genome Analyzer from Illumina. Massive sequencing produced 4 655 992 reads of 33 bp each representing a total of 154 Mbp of sequence data. The Pash alignment algorithm mapped 50.13% of the reads to the mouse genome. Sequence reads were clustered based on overlapping mapping coordinates and intersected with known miRNAs, small nucleolar RNAs (snoRNAs), piwi-interacting RNA (piRNA) clusters and repetitive genomic regions; 25.2% of the reads mapped to known miRNAs, 25.5% to genomic repeats, 3.5% to piRNAs and 0.18% to snoRNAs. Three hundred and ninety-eight known miRNA species were among the sequenced small RNAs, and 118 isomiR sequences that are not in the miRBase database. Let-7 family was the most abundantly expressed miRNA, and mmu-mir-672, mmu-mir-322, mmu-mir-503 and mmu-mir-465 families are the most abundant X-linked miRNA detected. X-linked mmu-mir-503, mmu-mir-672 and mmu-mir-465 family showed preferential expression in testes and ovaries. We also identified four novel miRNAs that are preferentially expressed in gonads. Gonadal selective miRNAs may play important roles in ovarian development, folliculogenesis and female fertility.
miRNA; ovary; oocyte; microRNA; ncRNA