Structural variation is variation in structure of DNA regions affecting DNA sequence length and/or orientation. It generally includes deletions, insertions, copy-number gains, inversions, and transposable elements. Traditionally, the identification of structural variation in genomes has been challenging. However, with the recent advances in high-throughput DNA sequencing and paired-end mapping (PEM) methods, the ability to identify structural variation and their respective association to human diseases has improved considerably. In this review, we describe our current knowledge of structural variation in the mouse, one of the prime model systems for studying human diseases and mammalian biology. We further present the evolutionary implications of structural variation on transposable elements. We conclude with future directions on the study of structural variation in mouse genomes that will increase our understanding of molecular architecture and functional consequences of structural variation.
array comparative genome hybridization (aCGH); next-generation sequencing (NGS); structural variation (SV); paired-end mapping (PEM); inbred strains of mice; Heterogeneous Stock (HS); Sanger Mouse Genomes Project
Carnitine is a key molecule in energy metabolism that helps transport activated fatty acids into the mitochondria. Its homeostasis is achieved through oral intake, renal reabsorption and de novo biosynthesis. Unlike dietary intake and renal reabsorption, the importance of de novo biosynthesis pathway in carnitine homeostasis remains unclear, due to lack of animal models and description of a single patient defective in this pathway.
We identified by array comparative genomic hybridization a 42 months-old girl homozygote for a 221 Kb interstitial deletions at 11p14.2, that overlaps the genes encoding Fibin and butyrobetaine-gamma 2-oxoglutarate dioxygenase 1 (BBOX1), an enzyme essential for the biosynthesis of carnitine de novo. She presented microcephaly, speech delay, growth retardation and minor facial anomalies. The levels of almost all evaluated metabolites were normal. Her serum level of free carnitine was at the lower limit of the reference range, while her acylcarnitine to free carnitine ratio was normal.
We present an individual with a completely defective carnitine de novo biosynthesis. This condition results in mildly decreased free carnitine level, but not in clinical manifestations characteristic of carnitine deficiency disorders, suggesting that dietary carnitine intake and renal reabsorption are sufficient to carnitine homeostasis. Our results also demonstrate that haploinsufficiency of BBOX1 and/or Fibin is not associated with Primrose syndrome as previously suggested.
Carnitine; BBOX1; Fibin; CNV; Primrose syndrome
Copy number variants (CNVs) influence the expression of genes that map not only within the rearrangement, but also to its flanks. To assess the possible mechanism(s) underlying this “neighboring effect”, we compared intrachromosomal interactions and histone modifications in cell lines of patients affected by genomic disorders and control individuals. Using chromosome conformation capture (4C-seq), we observed that a set of genes flanking the Williams-Beuren Syndrome critical region (WBSCR) were often looping together. The newly identified interacting genes include AUTS2, mutations of which are associated with autism and intellectual disabilities. Deletion of the WBSCR disrupts the expression of this group of flanking genes, as well as long-range interactions between them and the rearranged interval. We also pinpointed concomitant changes in histone modifications between samples.
We conclude that large genomic rearrangements can lead to chromatin conformation changes that extend far away from the structural variant, thereby possibly modulating expression globally and modifying the phenotype.
GEO Series accession number: GSE33784, GSE33867.
The study of transcription using genomic tiling arrays has lead to the identification of numerous additional exons. One example is the MECP2 gene on the X chromosome; using 5’RACE and RT-PCR in human tissues and cell lines, we have found more than 70 novel exons (RACEfrags) connecting to at least one annotated exon.. We sequenced all MECP2-connected exons and flanking sequences in 3 groups: 46 patients with the Rett syndrome and without mutations in the currently annotated exons of the MECP2 and CDKL5 genes; 32 patients with the Rett syndrome and identified mutations in the MECP2 gene; 100 control individuals from the same geoethnic group. Approximately 13kb were sequenced per sample, (2.4Mb of DNA resequencing). A total of 75 individuals had novel rare variants (mostly private variants) but no statistically significant difference was found among the 3 groups. These results suggest that variants in the newly discovered exons may not contribute to Rett syndrome. Interestingly however, there are about twice more variants in the novel exons than in the flanking sequences (44 vs. 21 for approximately 1.3 Mb sequenced for each class of sequences, p = 0.0025). Thus the evolutionary forces that shape these novel exons may be different than those of neighboring sequences.
MECP2; Rett syndrome; RACEfrags; SNP; rare variants; positive selection
Eukaryotic cells make many types of primary and processed RNAs that are found either in specific sub-cellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic sub-cellular localizations are also poorly understood. Since RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell’s regulatory capabilities are focused on its synthesis, processing, transport, modifications and translation, the generation of such a catalogue is crucial for understanding genome function. Here we report evidence that three quarters of the human genome is capable of being transcribed, as well as observations about the range and levels of expression, localization, processing fates, regulatory regions and modifications of almost all currently annotated and thousands of previously unannotated RNAs. These observations taken together prompt to a redefinition of the concept of a gene.
It is currently unclear whether tissue changes surrounding multifocal epithelial tumors are a cause or consequence of cancer. Here, we provide evidence that loss of mesenchymal Notch/CSL signaling causes tissue alterations, including stromal atrophy and inflammation, which precede and are potent triggers for epithelial tumors. Mice carrying a mesenchymal-specific deletion of CSL/RBP-Jκ, a key Notch effector, exhibit spontaneous multifocal keratinocyte tumors that develop after dermal atrophy and inflammation. CSL-deficient dermal fibroblasts promote increased tumor cell proliferation through up-regulation of c-Jun and c-Fos expression and consequently higher levels of diffusible growth factors, inflammatory cytokines, and matrix remodeling enzymes. In human skin samples, stromal fields adjacent to cutaneous squamous cell carcinomas and multifocal premalignant actinic keratosis lesions exhibit decreased Notch/CSL signaling and associated molecular changes. Importantly, these changes in gene expression are also induced by UVA, a known environmental cause of cutaneous field cancerization and skin cancer.
epithelial-mesenchymal interactions; epithelial cancer; Cancer Associated Fibroblasts; in situ carcinoma; actinic keratosis; Notch; AP-1
The recurrent ∼600 kb 16p11.2 BP4-BP5 deletion is among the most frequent known genetic aetiologies of autism spectrum disorder (ASD) and related neurodevelopmental disorders.
To define the medical, neuropsychological, and behavioural phenotypes in carriers of this deletion.
We collected clinical data on 285 deletion carriers and performed detailed evaluations on 72 carriers and 68 intrafamilial non-carrier controls.
When compared to intrafamilial controls, full scale intelligence quotient (FSIQ) is two standard deviations lower in carriers, and there is no difference between carriers referred for neurodevelopmental disorders and carriers identified through cascade family testing. Verbal IQ (mean 74) is lower than non-verbal IQ (mean 83) and a majority of carriers require speech therapy. Over 80% of individuals exhibit psychiatric disorders including ASD, which is present in 15% of the paediatric carriers. Increase in head circumference (HC) during infancy is similar to the HC and brain growth patterns observed in idiopathic ASD. Obesity, a major comorbidity present in 50% of the carriers by the age of 7 years, does not correlate with FSIQ or any behavioural trait. Seizures are present in 24% of carriers and occur independently of other symptoms. Malformations are infrequently found, confirming only a few of the previously reported associations.
The 16p11.2 deletion impacts in a quantitative and independent manner FSIQ, behaviour and body mass index, possibly through direct influences on neural circuitry. Although non-specific, these features are clinically significant and reproducible. Lastly, this study demonstrates the necessity of studying large patient cohorts ascertained through multiple methods to characterise the clinical consequences of rare variants involved in common diseases.
Clinical genetics; Obesity; Psychiatry; Complex traits
Transposable elements, as major components of most eukaryotic organisms' genomes, define their structural organization and plasticity. They supply host genomes with functional elements, for example, binding sites of the pleiotropic master transcription factor p53 were identified in LINE1, Alu and LTR repeats in the human genome. Similarly, in this report we reveal the role of zebrafish (Danio rerio) EnSpmN6_DR non-autonomous DNA transposon in shaping the repertoire of the p53 target genes. The multiple copies of EnSpmN6_DR and their embedded p53 responsive elements drive in several instances p53-dependent transcriptional modulation of the adjacent gene, whose human orthologs were frequently previously annotated as p53 targets. These transposons define predominantly a set of target genes whose human orthologs contribute to neuronal morphogenesis, axonogenesis, synaptic transmission and the regulation of programmed cell death. Consistent with these biological functions the orthologs of the EnSpmN6_DR-colonized loci are enriched for genes expressed in the amygdala, the hippocampus and the brain cortex. Our data pinpoint a remarkable example of convergent evolution: the exaptation of lineage-specific transposons to shape p53-regulated neuronal morphogenesis-related pathways in both a hominid and a teleost fish.
Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.
As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
In this study we report that, in response to proteasome inhibition, the E3-Ubiquitin ligase TRIM50 localizes to and promotes the recruitment and aggregation of polyubiquitinated proteins to the aggresome. Using Hdac6-deficient mouse embryo fibroblasts (MEF) we show that this localization is mediated by the histone deacetylase 6, HDAC6. Whereas Trim50-deficient MEFs allow pinpointing that the TRIM50 ubiquitin-ligase regulates the clearance of polyubiquitinated proteins localized to the aggresome. Finally we demonstrate that TRIM50 colocalizes, interacts with and increases the level of p62, a multifunctional adaptor protein implicated in various cellular processes including the autophagy clearance of polyubiquitinated protein aggregates. We speculate that when the proteasome activity is impaired, TRIM50 fails to drive its substrates to the proteasome-mediated degradation, and promotes their storage in the aggresome for successive clearance.
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Kabuki syndrome (Niikawa-Kuroki syndrome) is a rare, multiple congenital anomalies/mental retardation syndrome characterized by a peculiar face, short stature, skeletal, visceral and dermatoglyphic abnormalities, cardiac anomalies, and immunological defects. Recently mutations in the histone methyl transferase MLL2 gene have been identified as its underlying cause.
Genomic DNAs were extracted from 62 index patients clinically diagnosed as affected by Kabuki syndrome. Sanger sequencing was performed to analyze the whole coding region of the MLL2 gene including intron-exon junctions. The putative causal and possible functional effect of each nucleotide variant identified was estimated by in silico prediction tools.
We identified 45 patients with MLL2 nucleotide variants. 38 out of the 42 variants were never described before. Consistently with previous reports, the majority are nonsense or frameshift mutations predicted to generate a truncated polypeptide. We also identified 3 indel, 7 missense and 3 splice site.
This study emphasizes the relevance of mutational screening of the MLL2 gene among patients diagnosed with Kabuki syndrome. The identification of a large spectrum of MLL2 mutations possibly offers the opportunity to improve the actual knowledge on the clinical basis of this multiple congenital anomalies/mental retardation syndrome, design functional studies to understand the molecular mechanisms underlying this disease, establish genotype-phenotype correlations and improve clinical management.
Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.
alternative splicing; nonsense-mediated decay; vertebrate evolution; RBM39
The genetic dissection of the phenotypes associated with Williams-Beuren Syndrome (WBS) is advancing thanks to the study of individuals carrying typical or atypical structural rearrangements, as well as in vitro and animal studies. However, little is known about the global dysregulations caused by the WBS deletion. We profiled the transcriptomes of skin fibroblasts from WBS patients and compared them to matched controls. We identified 868 differentially expressed genes that were significantly enriched in extracellular matrix genes, major histocompatibility complex (MHC) genes, as well as genes in which the products localize to the postsynaptic membrane. We then used public expression datasets from human fibroblasts to establish transcription modules, sets of genes coexpressed in this cell type. We identified those sets in which the average gene expression was altered in WBS samples. Dysregulated modules are often interconnected and share multiple common genes, suggesting that intricate regulatory networks connected by a few central genes are disturbed in WBS. This modular approach increases the power to identify pathways dysregulated in WBS patients, thus providing a testable set of additional candidates for genes and their interactions that modulate the WBS phenotypes.
A fundamental question in current biomedical research is to establish a link between genomic variation and phenotypic differences, which encompasses both the seemingly neutral diversity, as well as the pathological variation that causes or predisposes to disease. Once the primary genetic cause(s) of a disease or phenotype has been identified, we need to understand the biochemical consequences of such variants that eventually lead to increased disease risk. Such phenotypic effects of genetic differences are supposedly brought about by changes in expression levels, either of the genes affected by the genetic change or indirectly through position effects. Thus, transcriptome analyses seem appropriate proxies to study the consequences of structural variation, such as the 7q11.23 deletion present in individuals with Williams-Beuren syndrome (WBS). Here, we present an approach that takes experimental data into account instead of relying solely on functional annotation, following the rationale that coherently regulated genes are likely to play a role in the same biological process. While our algorithm can be applied to expression data from any source, our study provides a resource for the identification of additional candidate genes and pathways to explain the WBS phenotype, as well as a basis for uncovering novel functional interactions between sets of genes.
The manuscript describes the “digital transcriptome atlas” of the developing mouse embryo, a powerful resource to determine co-expression of genes, to identify cell populations and lineages and to identify functional associations between genes relevant to development and disease.
Ascertaining when and where genes are expressed is of crucial importance to understanding or predicting the physiological role of genes and proteins and how they interact to form the complex networks that underlie organ development and function. It is, therefore, crucial to determine on a genome-wide level, the spatio-temporal gene expression profiles at cellular resolution. This information is provided by colorimetric RNA in situ hybridization that can elucidate expression of genes in their native context and does so at cellular resolution. We generated what is to our knowledge the first genome-wide transcriptome atlas by RNA in situ hybridization of an entire mammalian organism, the developing mouse at embryonic day 14.5. This digital transcriptome atlas, the Eurexpress atlas (http://www.eurexpress.org), consists of a searchable database of annotated images that can be interactively viewed. We generated anatomy-based expression profiles for over 18,000 coding genes and over 400 microRNAs. We identified 1,002 tissue-specific genes that are a source of novel tissue-specific markers for 37 different anatomical structures. The quality and the resolution of the data revealed novel molecular domains for several developing structures, such as the telencephalon, a novel organization for the hypothalamus, and insight on the Wnt network involved in renal epithelial differentiation during kidney development. The digital transcriptome atlas is a powerful resource to determine co-expression of genes, to identify cell populations and lineages, and to identify functional associations between genes relevant to development and disease.
In situ hybridization (ISH) can be used to visualize gene expression in cells and tissues in their native context. High-throughput ISH using nonradioactive RNA probes allowed the Eurexpress consortium to generate a comprehensive, interactive, and freely accessible digital gene expression atlas, the Eurexpress transcriptome atlas (http://www.eurexpress.org), of the E14.5 mouse embryo. Expression data for over 15,000 genes were annotated for hundreds of anatomical structures, thus allowing us to systematically identify tissue-specific and tissue-overlapping gene networks. We illustrate the value of the Eurexpress atlas by finding novel regional subdivisions in the developing brain. We also use the transcriptome atlas to allocate specific components of the complex Wnt signaling pathway to kidney development, and we identify regionally expressed genes in liver that may be markers of hematopoietic stem cell differentiation.
Williams–Beuren syndrome (WBS; OMIM no. 194050) is a multisystemic neurodevelopmental disorder caused by a hemizygous deletion of 1.55 Mb on chromosome 7q11.23 spanning 28 genes. Haploinsufficiency of the ELN gene was shown to be responsible for supravalvular aortic stenosis and generalized arteriopathy, whereas LIMK1, CLIP2, GTF2IRD1 and GTF2I genes were suggested to be linked to the specific cognitive profile and craniofacial features. These insights for genotype–phenotype correlations came from the molecular and clinical analysis of patients with atypical deletions and mice models. Here we report a patient showing mild WBS physical phenotype and normal IQ, who carries a shorter 1 Mb atypical deletion. This rearrangement does not include the GTF2IRD1 and GTF2I genes and only partially the BAZ1B gene. Our results are consistent with the hypothesis that hemizygosity of the GTF2IRD1 and GTF2I genes might be involved in the facial dysmorphisms and in the specific motor and cognitive deficits observed in WBS patients.
7q11.23; microdeletion; Williams–Beuren syndrome; mental retardation; haploinsufficiency
The characterization of mice with different number of copies of the same genomic segment shows that structural changes influence the phenotypic outcome independently of gene dosage.
A large fraction of genome variation between individuals is comprised of submicroscopic copy number variation of genomic DNA segments. We assessed the relative contribution of structural changes and gene dosage alterations on phenotypic outcomes with mouse models of Smith-Magenis and Potocki-Lupski syndromes. We phenotyped mice with 1n (Deletion/+), 2n (+/+), 3n (Duplication/+), and balanced 2n compound heterozygous (Deletion/Duplication) copies of the same region. Parallel to the observations made in humans, such variation in gene copy number was sufficient to generate phenotypic consequences: in a number of cases diametrically opposing phenotypes were associated with gain versus loss of gene content. Surprisingly, some neurobehavioral traits were not rescued by restoration of the normal gene copy number. Transcriptome profiling showed that a highly significant propensity of transcriptional changes map to the engineered interval in the five assessed tissues. A statistically significant overrepresentation of the genes mapping to the entire length of the engineered chromosome was also found in the top-ranked differentially expressed genes in the mice containing rearranged chromosomes, regardless of the nature of the rearrangement, an observation robust across different cell lineages of the central nervous system. Our data indicate that a structural change at a given position of the human genome may affect not only locus and adjacent gene expression but also “genome regulation.” Furthermore, structural change can cause the same perturbation in particular pathways regardless of gene dosage. Thus, the presence of a genomic structural change, as well as gene dosage imbalance, contributes to the ultimate phenotype.
Mammalian genomes contain many forms of genetic variation. For example, some genome segments were shown to vary in their number of copies between individuals of the same species, i.e. there is a range of number of copies in the normal population instead of the usual two copies (one per chromosome). These genetic differences play an important role in determining the phenotype (the observable characteristics) of each individual. We do not know, however, if such influences are brought about solely through changes in the number of copies of the genomic segments (and of the genes that map within) or if the structural modification of the genome per se also plays a role in the outcome. We use mouse models with different number of copies of the same genomic region to show that rearrangements of the genetic materials can affect the phenotype independently of the dosage of the rearranged region.
A review of the main computational pipelines used to generate the human reference protein-coding gene sets.
The vast majority of the biology of a newly sequenced genome is inferred from the set of encoded proteins. Predicting this set is therefore invariably the first step after the completion of the genome DNA sequence. Here we review the main computational pipelines used to generate the human reference protein-coding gene sets.
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
Williams–Beuren syndrome (WBS) is a neurodevelopmental and multisystemic disease that results from hemizygosity of approximately 25 genes mapping to chromosomal region 7q11.23. We report here the preliminary description of eight novel genes mapping within the WBS critical region and/or its syntenic mouse region. Three of these genes, TRIM50, TRIM73 and TRIM74, belong to the TRIpartite motif gene family, members of which were shown to be associated to several human genetic diseases. We describe the preliminary functional characterization of these genes and show that Trim50 encodes an E3 ubiquitin ligase, opening the interesting hypothesis that the ubiquitin-mediated proteasome pathway might be involved in the WBS phenotype.
Williams-Beuren syndrome; tripartite motif protein; ubiquitin ligase; contiguous gene syndrome
The fraction of experimentally active conserved non-coding sequences within any given cell type is low, so classical assays are unlikely to expose their potential.
Conserved non-coding sequences in the human genome are approximately tenfold more abundant than known genes, and have been hypothesized to mark the locations of cis-regulatory elements. However, the global contribution of conserved non-coding sequences to the transcriptional regulation of human genes is currently unknown. Deeply conserved elements shared between humans and teleost fish predominantly flank genes active during morphogenesis and are enriched for positive transcriptional regulatory elements. However, such deeply conserved elements account for <1% of the conserved non-coding sequences in the human genome, which are predominantly mammalian.
We explored the regulatory potential of a large sample of these 'common' conserved non-coding sequences using a variety of classic assays, including chromatin remodeling, and enhancer/repressor and promoter activity. When tested across diverse human model cell types, we find that the fraction of experimentally active conserved non-coding sequences within any given cell type is low (approximately 5%), and that this proportion increases only modestly when considered collectively across cell types.
The results suggest that classic assays of cis-regulatory potential are unlikely to expose the functional potential of the substantial majority of mammalian conserved non-coding sequences in the human genome.
A report of the annual meeting of the European Society of Human Genetics, Amsterdam, 6-9 May 2006.
A report of the annual meeting of the European Society of Human Genetics, Amsterdam, 6-9 May 2006.
We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions, which span 1% of the human genome sequence. The experiment had two major goals: the assessment of the accuracy of computational methods to predict protein coding genes; and the overall assessment of the completeness of the current human genome annotations as represented in the ENCODE regions. For the computational prediction assessment, eighteen groups contributed gene predictions. We evaluated these submissions against each other based on a 'reference set' of annotations generated as part of the GENCODE project. These annotations were not available to the prediction groups prior to the submission deadline, so that their predictions were blind and an external advisory committee could perform a fair assessment.
The best methods had at least one gene transcript correctly predicted for close to 70% of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotide level, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programs relying on mRNA and protein sequences were the most accurate in reproducing the manually curated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could be verified.
This is the first such experiment in human DNA, and we have followed the standards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe the results presented here contribute to the value of ongoing large-scale annotation projects and should guide further experimental methods when being scaled up to the entire human genome sequence.
The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.
The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.
In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.