Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein-coding potential is the detection of cellular protein expression through peptide mass spectrometry (MS) experiments. Here, we mapped peptides detected in seven large-scale proteomics studies to almost 60% of the protein-coding genes in the GENCODE annotation of the human genome. We found a strong relationship between detection in proteomics experiments and both gene family age and cross-species conservation. Most of the genes for which we detected peptides were highly conserved. We found peptides for >96% of genes that evolved before bilateria. At the opposite end of the scale, we identified almost no peptides for genes that have appeared since primates, for genes that did not have any protein-like features or for genes with poor cross-species conservation. These results motivated us to describe a set of 2001 potential non-coding genes based on features such as weak conservation, a lack of protein features, or ambiguous annotations from major databases, all of which correlated with low peptide detection across the seven experiments. We identified peptides for just 3% of these genes. We show that many of these genes behave more like non-coding genes than protein-coding genes and suggest that most are unlikely to code for proteins under normal circumstances. We believe that their inclusion in the human protein-coding gene catalogue should be revised as part of the ongoing human genome annotation effort.
The Vertebrate Genome Annotation (VEGA) database (http://vega.sanger.ac.uk), initially designed as a community resource for browsing manual annotation of the human genome project, now contains five reference genomes (human, mouse, zebrafish, pig and rat). Its introduction pages have been redesigned to enable the user to easily navigate between whole genomes and smaller multi-species haplotypic regions of interest such as the major histocompatibility complex. The VEGA browser is unique in that annotation is updated via the Human And Vertebrate Analysis aNd Annotation (HAVANA) update track every 2 weeks, allowing single gene updates to be made publicly available to the research community quickly. The user can now access different haplotypic subregions more easily, such as those from the non-obese diabetic mouse, and display them in a more intuitive way using the comparative tools. We also highlight how the user can browse manually annotated updated patches from the Genome Reference Consortium (GRC).
The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.
RNA sequencing has opened new avenues for the study of transcriptome composition. Significant evidence has accumulated showing that the human transcriptome contains in excess of a hundred thousand different transcripts. However, it is still not clear to what extent this diversity prevails when considering the relative abundances of different transcripts from the same gene.
Here we show that, in a given condition, most protein coding genes have one major transcript expressed at significantly higher level than others, that in human tissues the major transcripts contribute almost 85 percent to the total mRNA from protein coding loci, and that often the same major transcript is expressed in many tissues. We detect a high degree of overlap between the set of major transcripts and a recently published set of alternatively spliced transcripts that are predicted to be translated utilizing proteomic data. Thus, we hypothesize that although some minor transcripts may play a functional role, the major ones are likely to be the main contributors to the proteome. However, we still detect a non-negligible fraction of protein coding genes for which the major transcript does not code a protein.
Overall, our findings suggest that the transcriptome from protein coding loci is dominated by one transcript per gene and that not all the transcripts that contribute to transcriptome diversity are equally likely to contribute to protein diversity. This observation can help to prioritize candidate targets in proteomics research and to predict the functional impact of the detected changes in variation studies.
splicing; transcriptome; gene expression; RNA-seq
Pseudogenes have long been considered as nonfunctional genomic sequences. However, recent evidence suggests that many of them might have some form of biological activity, and the possibility of functionality has increased interest in their accurate annotation and integration with functional genomics data.
As part of the GENCODE annotation of the human genome, we present the first genome-wide pseudogene assignment for protein-coding genes, based on both large-scale manual annotation and in silico pipelines. A key aspect of this coupled approach is that it allows us to identify pseudogenes in an unbiased fashion as well as untangle complex events through manual evaluation. We integrate the pseudogene annotations with the extensive ENCODE functional genomics information. In particular, we determine the expression level, transcription-factor and RNA polymerase II binding, and chromatin marks associated with each pseudogene. Based on their distribution, we develop simple statistical models for each type of activity, which we validate with large-scale RT-PCR-Seq experiments. Finally, we compare our pseudogenes with conservation and variation data from primate alignments and the 1000 Genomes project, producing lists of pseudogenes potentially under selection.
At one extreme, some pseudogenes possess conventional characteristics of functionality; these may represent genes that have recently died. On the other hand, we find interesting patterns of partial activity, which may suggest that dead genes are being resurrected as functioning non-coding RNAs. The activity data of each pseudogene are stored in an associated resource, psiDR, which will be useful for the initial identification of potentially functional pseudogenes.
Genome sequencing studies indicate that all humans carry many genetic variants predicted to cause loss of function (LoF) of protein-coding genes, suggesting unexpected redundancy in the human genome. Here we apply stringent filters to 2,951 putative LoF variants obtained from 185 human genomes to determine their true prevalence and properties. We estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated. We identify rare and likely deleterious LoF alleles, including 26 known and 21 predicted severe disease-causing variants, as well as common LoF variants in non-essential genes. We describe functional and evolutionary differences between LoF-tolerant and recessive disease genes, and a method for using these differences to prioritize candidate genes found in clinical sequencing studies.
Advances in high-throughput mass spectrometry are making proteomics an increasingly important tool in genome annotation projects. Peptides detected in mass spectrometry experiments can be used to validate gene models and verify the translation of putative coding sequences (CDSs). Here, we have identified peptides that cover 35% of the genes annotated by the GENCODE consortium for the human genome as part of a comprehensive analysis of experimental spectra from two large publicly available mass spectrometry databases. We detected the translation to protein of “novel” and “putative” protein-coding transcripts as well as transcripts annotated as pseudogenes and nonsense-mediated decay targets.
We provide a detailed overview of the population of alternatively spliced protein isoforms that are detectable by peptide identification methods. We found that 150 genes expressed multiple alternative protein isoforms. This constitutes the largest set of reliably confirmed alternatively spliced proteins yet discovered. Three groups of genes were highly overrepresented. We detected alternative isoforms for 10 of the 25 possible heterogeneous nuclear ribonucleoproteins, proteins with a key role in the splicing process. Alternative isoforms generated from interchangeable homologous exons and from short indels were also significantly enriched, both in human experiments and in parallel analyses of mouse and Drosophila proteomics experiments. Our results show that a surprisingly high proportion (almost 25%) of the detected alternative isoforms are only subtly different from their constitutive counterparts.
Many of the alternative splicing events that give rise to these alternative isoforms are conserved in mouse. It was striking that very few of these conserved splicing events broke Pfam functional domains or would damage globular protein structures. This evidence of a strong bias toward subtle differences in CDS and likely conserved cellular function and structure is remarkable and strongly suggests that the translation of alternative transcripts may be subject to selective constraints.
alternative splicing; shotgun proteomics; genome annotation; heterogeneous nuclear ribonucleoproteins; NAGNAG splicing; mutually exclusive exons
While alternative splicing (AS) can potentially expand the functional repertoire of vertebrate genomes, relatively few AS transcripts have been experimentally characterized. We describe our detailed manual annotation of vertebrate genomes, which is generating a publicly available geneset rich in AS. In order to achieve this we have adopted a highly sensitive approach to annotating gene models supported by correctly mapped, canonically spliced transcriptional evidence combined with a highly cautious approach to adding unsupported extensions to models and making decisions on their functional potential. We use information about the predicted functional potential and structural properties of every AS transcript annotated at a protein-coding or non-coding locus to place them into one of eleven subclasses. We describe the incorporation of new sequencing and proteomics technologies into our annotation pipelines, which are used to identify and validate AS. Combining all data sources has led to the production of a rich geneset containing an average of 6.3 AS transcripts for every human multi-exon protein-coding gene. The datasets produced have proved very useful in providing context to studies investigating the functional potential of genes and the effect of variation may have on gene structure and function.
The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, protein-coding region annotations for the human and mouse reference genome assemblies. The CCDS data set reflects a ‘gold standard’ definition of best supported protein annotations, and corresponding genes, which pass a standard series of quality assurance checks and are supported by manual curation. This data set supports use of genome annotation information by human and mouse researchers for effective experimental design, analysis and interpretation. The CCDS project consists of analysis of automated whole-genome annotation builds to identify identical CDS annotations, quality assurance testing and manual curation support. Identical CDS annotations are tracked with a CCDS identifier (ID) and any future change to the annotated CDS structure must be agreed upon by the collaborating members. CCDS curation guidelines were developed to address some aspects of curation in order to improve initial annotation consistency and to reduce time spent in discussing proposed annotation updates. Here, we present the current status of the CCDS database and details on our procedures to track and coordinate our efforts. We also present the relevant background and reasoning behind the curation standards that we have developed for CCDS database treatment of transcripts that are nonsense-mediated decay (NMD) candidates, for transcripts containing upstream open reading frames, for identifying the most likely translation start codons and for the annotation of readthrough transcripts. Examples are provided to illustrate the application of these guidelines.
Database URL: http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi
The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5′ and 3′ transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.
Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.
alternative splicing; nonsense-mediated decay; vertebrate evolution; RBM39
Protein domains are protein regions that are shared among different proteins and are frequently functionally and structurally independent from the rest of the protein. Novel domain combinations have a major role in evolutionary innovation. However, the relative contributions of the different molecular mechanisms that underlie domain gains in animals are still unknown. By using animal gene phylogenies we were able to identify a set of high confidence domain gain events and by looking at their coding DNA investigate the causative mechanisms.
Here we show that the major mechanism for gains of new domains in metazoan proteins is likely to be gene fusion through joining of exons from adjacent genes, possibly mediated by non-allelic homologous recombination. Retroposition and insertion of exons into ancestral introns through intronic recombination are, in contrast to previous expectations, only minor contributors to domain gains and have accounted for less than 1% and 10% of high confidence domain gain events, respectively. Additionally, exonization of previously non-coding regions appears to be an important mechanism for addition of disordered segments to proteins. We observe that gene duplication has preceded domain gain in at least 80% of the gain events.
The interplay of gene duplication and domain gain demonstrates an important mechanism for fast neofunctionalization of genes.
Novel human pseudogenes are identified that had previous functionality and their age is estimated. The rate of loss-of-function occurred uniformly.
Unitary pseudogenes are a class of unprocessed pseudogenes without functioning counterparts in the genome. They constitute only a small fraction of annotated pseudogenes in the human genome. However, as they represent distinct functional losses over time, they shed light on the unique features of humans in primate evolution.
We have developed a pipeline to detect human unitary pseudogenes through analyzing the global inventory of orthologs between the human genome and its mammalian relatives. We focus on gene losses along the human lineage after the divergence from rodents about 75 million years ago. In total, we identify 76 unitary pseudogenes, including previously annotated ones, and many novel ones. By comparing each of these to its functioning ortholog in other mammals, we can approximately date the creation of each unitary pseudogene (that is, the gene 'death date') and show that for our group of 76, the functional genes appear to be disabled at a fairly uniform rate throughout primate evolution - not all at once, correlated, for instance, with the 'Alu burst'. Furthermore, we identify 11 unitary pseudogenes that are polymorphic - that is, they have both nonfunctional and functional alleles currently segregating in the human population. Comparing them with their orthologs in other primates, we find that two of them are in fact pseudogenes in non-human primates, suggesting that they represent cases of a gene being resurrected in the human lineage.
This analysis of unitary pseudogenes provides insights into the evolutionary constraints faced by different organisms and the timescales of functional gene loss in humans.
Alternative splicing has the potential to generate a wide range of protein isoforms. For many computational applications and for experimental research, it is important to be able to concentrate on the isoform that retains the core biological function. For many genes this is far from clear.
We have combined five methods into a pipeline that allows us to detect the principal variant for a gene. Most of the methods were based on conservation between species, at the level of both gene and protein. The five methods used were the conservation of exonic structure, the detection of non-neutral evolution, the conservation of functional residues, the existence of a known protein structure and the abundance of vertebrate orthologues. The pipeline was able to determine a principal isoform for 83% of a set of well-annotated genes with multiple variants.
RACE (Rapid Amplification of cDNA Ends) is a widely used approach for transcript identification. Random clone selection from the RACE mixture, however, is an ineffective sampling strategy if the dynamic range of transcript abundances is large. Here, we describe a strategy that uses array hybridization to improve sampling efficiency of human transcripts. The products of the RACE reaction are hybridized onto tiling arrays, and the exons detected are used to delineate a series of RT-PCR reactions, through which the original RACE mixture is segregated into simpler RT-PCR reactions. These are independently cloned, and randomly selected clones are sequenced. This approach is superior to direct cloning and sequencing of RACE products: it specifically targets novel transcripts, and often results in overall normalization of transcript abundances. We show theoretically and experimentally that this strategy leads indeed to efficient sampling of novel transcripts, and we investigate multiplexing it by pooling RACE reactions from multiple interrogated loci prior to hybridization.
The human X chromosome has a unique biology that was shaped by its evolution as the sex chromosome shared by males and females. We have determined 99.3% of the euchromatic sequence of the X chromosome. Our analysis illustrates the autosomal origin of the mammalian sex chromosomes, the stepwise process that led to the progressive loss of recombination between X and Y, and the extent of subsequent degradation of the Y chromosome. LINE1 repeat elements cover one-third of the X chromosome, with a distribution that is consistent with their proposed role as way stations in the process of X-chromosome inactivation. We found 1,098 genes in the sequence, of which 99 encode proteins expressed in testis and in various tumour types. A disproportionately high number of mendelian diseases are documented for the X chromosome. Of this number, 168 have been explained by mutations in 113 X-linked genes, which in many cases were characterized with the aid of the DNA sequence.
An analysis of ribosomal protein pseudogenes in the four mammalian genomes reveals no correlation between number of pseudogenes and mRNA abundance.
The availability of genome sequences of numerous organisms allows comparative study of pseudogenes in syntenic regions. Conservation of pseudogenes suggests that they might have a functional role in some instances.
We report the first large-scale comparative analysis of ribosomal protein pseudogenes in four mammalian genomes (human, chimpanzee, mouse and rat). To this end, we have assigned these pseudogenes in the four organisms using an automated pipeline and make the results available online. Each organism has a large number of ribosomal protein pseudogenes (approximately 1,400 to 2,800). The majority of them are processed (generated by retrotransposition). However, we do not see a correlation between the number of pseudogenes associated with a ribosomal protein gene and its mRNA abundance. Analysis of pseudogenes in syntenic regions between species shows that most are conserved between human and chimpanzee, but very few are conserved between primates and rodents. Interestingly, syntenic pseudogenes have a lower rate of nucleotide substitution than their surrounding intergenic DNA. Moreover, evidence from expressed sequence tags indicates that two pseudogenes conserved between human and mouse are transcribed. Detailed analysis shows that one of them, the pseudogene of RPS27, is likely to be a protein-coding gene. This is significant as previous reports indicated there are exactly 80 ribosomal protein genes encoded by the human genome.
Our analysis indicates that processed ribosomal protein pseudogenes abound in mammalian genomes, but few of these are conserved between primates and rodents. This highlights the large amount of recent retrotranspositional activity in mammals and a relatively larger amount of it in the rodent lineage.
This study analyzes the predictions of a number of promoter predictors on the ENCODE regions of the human genome as part of the ENCODE Genome Annotation Assessment Project (EGASP). The systems analyzed operate on various principles and we assessed the effectiveness of different conceptual strategies used to correlate produced promoter predictions with the manually annotated 5' gene ends.
The predictions were assessed relative to the manual HAVANA annotation of the 5' gene ends. These 5' gene ends were used as the estimated reference transcription start sites. With the maximum allowed distance for predictions of 1,000 nucleotides from the reference transcription start sites, the sensitivity of predictors was in the range 32% to 56%, while the positive predictive value was in the range 79% to 93%. The average distance mismatch of predictions from the reference transcription start sites was in the range 259 to 305 nucleotides. At the same time, using transcription start site estimates from DBTSS and H-Invitational databases as promoter predictions, we obtained a sensitivity of 58%, a positive predictive value of 92%, and an average distance from the annotated transcription start sites of 117 nucleotides. In this experiment, the best performing promoter predictors were those that combined promoter prediction with gene prediction. The main reason for this is the reduced promoter search space that resulted in smaller numbers of false positive predictions.
The main finding, now supported by comprehensive data, is that the accuracy of human promoter predictors for high-throughput annotation purposes can be significantly improved if promoter prediction is combined with gene prediction. Based on the lessons learned in this experiment, we propose a framework for the preparation of the next similar promoter prediction assessment.
The GENCODE consortium was formed to identify and map all protein-coding genes within the ENCODE regions. This was achieved by a combination of initial manual annotation by the HAVANA team, experimental validation by the GENCODE consortium and a refinement of the annotation based on these experimental results.
The GENCODE gene features are divided into eight different categories of which only the first two (known and novel coding sequence) are confidently predicted to be protein-coding genes. 5' rapid amplification of cDNA ends (RACE) and RT-PCR were used to experimentally verify the initial annotation. Of the 420 coding loci tested, 229 RACE products have been sequenced. They supported 5' extensions of 30 loci and new splice variants in 50 loci. In addition, 46 loci without evidence for a coding sequence were validated, consisting of 31 novel and 15 putative transcripts. We assessed the comprehensiveness of the GENCODE annotation by attempting to validate all the predicted exon boundaries outside the GENCODE annotation. Out of 1,215 tested in a subset of the ENCODE regions, 14 novel exon pairs were validated, only two of them in intergenic regions.
In total, 487 loci, of which 434 are coding, have been annotated as part of the GENCODE reference set available from the UCSC browser. Comparison of GENCODE annotation with RefSeq and ENSEMBL show only 40% of GENCODE exons are contained within the two sets, which is a reflection of the high number of alternative splice forms with unique exons annotated. Over 50% of coding loci have been experimentally verified by 5' RACE for EGASP and the GENCODE collaboration is continuing to refine its annotation of 1% human genome with the aid of experimental validation.
Defensins are important components of innate immunity to combat bacterial and viral infections, and can even elicit antitumor responses. Clusters of defensin (DEF) genes are located in a 2 Mb range of the human chromosome 8p23.1. This DEF locus, however, represents one of the regions in the euchromatic part of the final human genome sequence which contains segmental duplications, and recalcitrant gaps indicating high structural dynamics.
We find that inter- and intraindividual genetic variations within this locus prevent a correct automatic assembly of the human reference genome (NCBI Build 34) which currently even contains misassemblies. Manual clone-by-clone alignment and gene annotation as well as repeat and SNP/haplotype analyses result in an alternative alignment significantly improving the DEF locus representation. Our assembly better reflects the experimentally verified variability of DEF gene and DEF cluster copy numbers. It contains an additional DEF cluster which we propose to reside between two already known clusters. Furthermore, manual annotation revealed a novel DEF gene and several pseudogenes expanding the hitherto known DEF repertoire. Analyses of BAC and working draft sequences of the chimpanzee indicates that its DEF region is also complex as in humans and DEF genes and a cluster are multiplied. Comparative analysis of human and chimpanzee DEF genes identified differences affecting the protein structure. Whether this might contribute to differences in disease susceptibility between man and ape remains to be solved. For the determination of individual DEF gene repertoires we provide a molecular approach based on DEF haplotypes.
Complexity and variability seem to be essential genomic features of the human DEF locus at 8p23.1 and provides an ongoing challenge for the best possible representation in the human reference sequence. Dissection of paralogous sequence variations, duplicon SNPs ans multisite variations as well as haplotypes by sequencing based methods is the way for future studies of interindividual DEF locus variability and its disease association.