The manual annotation is based on protein data from the UniProt database, which is aligned to the individual bacterial artificial chromosome (BAC) clones that make up the reference genome sequence using BLAST [51
]. Gene models are manually extrapolated from the alignments by annotators using the ZMAP annotation interface and the otterlace annotation system [52
]. Alignments were navigated using the Blixem alignment viewer [53
]. Visual inspection of the dot-plot output from the Dotter tool [53
] is used to resolve any alignment with the genomic sequence that is unclear in, or absent from, Blixem. A model is defined as a pseudogene if it possesses one or more of the following characteristics unless there is evidence (transcriptional, functional, publication) showing that the locus represents a protein-coding gene with structural/functional divergence from its parent (paralog): (1) a premature stop codon relative to parent CDS - can be introduced by nonsense or frame-shift mutation; (2) a frame-shift in a functional domain - even where the length of the resulting CDS is similar to that of the parent CDS; (3) a truncation of the 5' or 3' end of the CDS relative to the parent CDS; (4) a deletion of an internal portion of the CDS relative to the parent CDS. Processed pseudogene loci lacking disabling mutations are annotated as 'pseudogene' when they lack locus-specific transcriptional evidence
PseudoPipe identifies pseudogenes by searching for homology to all known protein sequences in the genome (defined in Ensembl) using a six-frame translational BLAST, followed by removal of redundancies and merging of the overlapping and continuous BLAST hits. Functional paralogs (parents) of the resulting pseudogenes are determined by sequence similarity, and the disablements in pseudogenes are identified through alignment to the parent genes. A non-redundant set of 18,046 pseudogenes was obtained using the human reference genome (GRch37, ENSEMBL gene release 60). Pseudogenes are categorized into different classes as processed, duplicated or ambiguous based on their genomic structures. While duplicated pseudogenes have intron-exon like structures, processed pseudogenes contain only continuous exon sequences with no introns and have traces of polyadenine tails at the 3' end. Ambiguous pseudogenes indicate processed pseudogenes with decayed sequences.
RetroFinder is unique among pseudogene prediction methods for using mRNA alignments to identify retrogenes, including processed pseudogenes [37
]. Human mRNA and RefSeq sequences are aligned using the Lastz [54
] alignment program (based on Blastz [55
]), which is very sensitive, allowing alignment down to the level of 65% identity, whereas BLAT [56
] works better for sequences where identity is greater than 95%. If one of these transcripts aligns more than once, and one of the alignments is to a known gene locus, then the additional alignments are scored on a number of features indicative of retrotransposition: multiple contiguous exons with the parent gene introns removed; negatively scored introns that are distinguished from repeat insertions (SVA elements, long interspersed nucleotide elements (LINEs), short interspersed nucleotide elements (SINEs), Alu elements); lack of conserved splice sites; break in synteny with mouse and dog genomes using the syntenic net alignments [57
] from the UCSC Genome Browser [58
]; polyadenine tail insertion.
Parents based on immunoglobulin and zinc finger genes are filtered out since these large gene families cause false positives. The score threshold is set at 550 based on training with VEGA [59
] processed pseudogenes. Note that for human, VEGA genes are included in the manually annotated genes of GENCODE. Further details of the method can be found in [37
Consensus of manual and automated annotation
To obtain a consensus set of pseudogenes, we verified each pseudogene locus from manual annotation against those predicted by either of the two automated pipelines (PseudoPipe and RetroFinder), using a 50 bp overlap criterion. A pseudogene passing these overlapping tests is classified as: a 'level 1' pseudogene if it passes tests of manual annotation against both automated pipelines; or a '2-way consensus' pseudogene if it only passes the test between the two automated pipelines.
As a quality control exercise to determine completeness of pseudogene annotation in chromosomes that have been manually annotated, 2-way consensus pseudogenes are re-checked to establish their validity and added to the manually annotated pseudogene set as appropriate.
We estimated the total number of pseudogenes in the genome using the knowledge from PseudoPipe and manual annotation. Using manual annotation from the chromosomes that were completely annotated as a gold standard, we estimated the number of false positives and false negatives in PseudoPipe predictions. We used this information to extrapolate to the entire human genome to obtain an estimate of the number of pseudogenes in the reference genome.
Chromosomes 1 to 11, 20, 21, 22, X, Y and the p arm of 12 are fully annotated in GENCODE v7. On these chromosomes, there are 9,776 and 12,501 pseudogenes predicted by manual inspection and by PseudoPipe, respectively. PseudoPipe assigned 18,046 pseudogenes in the entire genome. Based on this, the number of manually identified pseudogenes in the genome will be (9,776 × 18,046)/12,501 ≈ 14,112.
Alternatively, we used a simple linear extrapolation to correlate the number of pseudogenes with the size of chromosomes on which the pseudogenes are annotated. With this method, the number of nucleotides from the fully annotated regions is 2,383,814,825, while the total number of nucleotides in the genome is 3,092,688,347. Therefore, the predicted number of pseudogenes for the entire human genome is (9,776 × 3,092,688,347)/2,383,814,825 ≈ 12,683.
Identification of the parents of pseudogenes and sequence similarity to the parent
We derived parents of pseudogenes from the correspondence between pseudogenes and query sequences used by different pipelines (that is, UniProt proteins for manual annotation and Ensembl peptides for PseudoPipe), together with the sequence alignments of pseudogenes against the whole human genome. The procedure was carried out using the following steps: first, use correspondence between parents and pseudogenes derived by the manual annotation; second, one-to-one sequence alignment between pseudogenes and coding regions in the human genome by BLAT (sequence similarity > 90%); third, use parent gene information provided by PseudoPipe.
When the parent identity for a pseudogene is inconsistent across different data resources, we assign the parent based on the highest ranked data in the following order: manual annotation, BLAT alignment, and automated curation.
Parents of 9,368 pseudogenes were unambiguously identified, while it is difficult to uniquely identify the parent genes for 1,848 pseudogenes. The two most significant factors that confound our ability to confidently identify a pseudogene parent are the degree of degradation of the pseudogene and the number of closely related paralogs to the true parent gene. Therefore, for gene families with many closely related members, even a relatively small number of mutations can render accurate identification of the true parent difficult; while for more degraded pseudogenes from large families with common functional domains (for example, zinc fingers), the number and similarity of the potential parents make prediction impossible.
To calculate the sequence identity between pseudogenes and their parents, each pseudogene sequence was extended by 2 kb at its 3' end for a higher coverage of 3' UTR of its parent and then aligned to its parent sequence. Only exons of parent and pseudogene sequences were used. The alignment was carried out using ClustalW2, with default parameters. To adapt to the large size of 3' UTR and much smaller size of small RNA targets in that region, a sliding window of 100 bp was used for sequence identity for a more accurate local identity. The window with the highest sequence identity was taken as representative of the 3' UTR and used in the following tests.
Pseudogene transcription evidence from RNA-Seq data
The pseudogenes in GENCODE v7 were tested for transcription evidence using the following workflow. First, we extracted the genomic coordinates of the processed and duplicated pseudogenes from GENCODE v7 (gene_type = 'pseudogene' AND transcript_type = 'processed_pseudogene' OR transcript_type = 'unprocessed_pseudogene'). From this step we obtained 8,107 processed and 1,860 duplicated pseudogenes. Second, we obtained the underlying genomic sequence for each pseudogene by concatenating the sequences of their pseudoexons. Third, we aligned each pseudogene sequence to the human reference genome using BLAT [56
] (with default parameters) to find all similar regions in the genome. Fourth, we assigned each pseudogene alignment to one of four categories: pseudogenes with no similar regions in the genome (presumably these pseudogenes are more ancient and have accumulated many mutations, and therefore they have a low sequence similarity compared to the parent gene); pseudogenes giving rise to one alignment pair (most likely the parent gene); pseudogenes with two to five alignments; pseudogenes giving rise to more than five sequence alignments.
For the 9,967 pseudogenes analyzed, we obtained the following counts: 3,198 pseudogenes with zero alignments, 1,907 pseudogenes with one alignment, 2,150 pseudogenes with two to five alignments and 2,712 pseudogenes with more than five alignments.
In order to check for evidence of pseudogene transcription, we examined the expression pattern of each pseudogene and its similar regions using the Illumina Human BodyMap RNA-Seq data set consisting of 16 tissues. First, we aligned the reads for each tissue to the human genome reference sequence in conjunction with a splice junction library using Bowtie [60
] and RSEQtools [61
]. There was no preference given for a genome match over other matches. Second, we generated a signal track of the mapped reads for each tissue. Third, for a given pseudogene and its similar regions in the human genome, we extracted the signal track of mapped reads from all 16 tissues as shown in Figure .
After a number of filtering steps we obtained a list of potentially transcribed pseudogenes. For example, the set of 3,198 pseudogenes with no similar regions in the genome was reduced to 344 pseudogenes by requiring that each pseudogene is covered by at least two reads across half of its length in at least one tissue.
Transcribed pseudogenes subject to experimental validation
Out of the 469 pseudogenes subjected to experimental validation, 94 pseudogenes were randomly selected from the manual pipeline output (pipeline 1 in section 'Pseudogene Transcription Identified by Sequence of Computational Pipelines'); 271 pseudogenes were selected at random from the PseudoSeq pipeline output (pipeline 3 in the same section as above), and 97 pseudogenes were selected at random from the TotalRNA pipeline output (pipeline 2 in the same section as above). The remaining seven pseudogenes (containing seven loci to be validated), were manually chosen by examining the expression patterns of pseudogenes and their parents using BodyMap data and PseudoSeq (Figure ). At the time of writing, the remainder of transcribed pseudogenes are undergoing experimental validation and the results will be constantly updated in the psiDR.
Multiple sequence alignment, pseudogene preservation and polymorphisms in the human population
Genomic sequence alignments of 16 species, including primates, mammals, and vertebrates, were extracted from the original 46-way vertebrate sequence alignments obtained from the UCSC genome browser. Genomes from all the species were aligned using BlastZ with a synteny filter followed by the MultiZ method. Assembled sequences for the 2X mammal data are excluded from the current study due to their low quality and possible false positive alignment to pseudogenes from the high-quality assemblies.
Genomic variation data consisting of SNPs, indels, and structural variations were from 60 individuals in the CEU population (Utah residents with ancestry from northern and western Europe) from the 1000 Genomes project pilot data release [47
Chimp orthologs to human pseudogenes were derived from whole genome sequence alignments. Only pseudoexons were used in the ortholog identification and the following analyses. The divergence is calculated as the ratio of mutated nucleotides in the chimp genome to the length of human pseudogenes. We assume the occurrence of substitution follows a Poisson distribution and the background substitution rate (null hypothesis mean) was set at 1.5%. The P
-value for pseudogene conservation was derived as the probability of that pseudogene having equal or fewer nucleotide mutations than it really has under the null hypothesis. We adjusted P
-values for multiple hypotheses testing using the Benjamini and Hochberg approach [62
]. All the pseudogenes were ranked by their P
-values from the most significant to the least significant. Pseudogenes with P
-values less than (False discovery rate × Rank/COUNT) were taken as significant, where false discovery rate is set to 0.05 and COUNT is the total number of pseudogenes tested. Conserved pseudogenes from mouse orthologs were calculated in the same manner, except the background substitution rate was set to 5%.
Chromatin segmentation using segway
Segway segmentation labels the genome using 25 different markers. Half of them are indicative of genomic activity (for example, transcription factor activity, gene body, enhancers), while the other half are repressive (for example, CTCF). We calculated the frequency of each marker in the pseudogenes and parent genes in a genome-wide fashion. All the frequencies were normalized with respect to the total segment distribution across the entire genome. Two different trends were observed globally for the parent genes: (a) TSS mark frequency is at least one order of magnitude larger than the frequency of the repressive marks; and (b) the frequency of the GE, GM and GS marks is, on average, five times larger than the frequency of the repressive marks. The segment distribution of the parent genes indicated enrichment in TSS, GS, e/GM (enhancer/gene body middle) and GE marks and was considered as a standard indicator for active chromatin.
Transcription factor binding sites in the upstream regions
TFBSs were studied using data from ENCODE ChIP-Seq experiments. In this study, we used the transcription factor occupancy data from the ENCODE 2011 January data freeze. The binding peaks of all the transcription factors were called by PeakSeq, with optimal settings to reduce the false negative results due to weak/poor biological replicates. A pseudogene was considered to have a TFBS if the majority of a peak for that transcription factor is located within the genomic region 2 kb upstream of the pseudogene.
ENCODE tier 1 and tier 2 cell lines (Gm12878, K562, Helas3, H1-hesc and Hepg2) with ChIP-Seq data for at least 40 transcription factors were included in this analysis. To avoid confusion with the transcription factor binding signals from neighboring genomic loci, 693 pseudogenes whose 5' ends are less than 4 kb away from the TSS of protein-coding genes were excluded. In the end, this study focused on 10,523 pseudogenes, where 876 are transcribed pseudogenes.
One confounding factor in the analysis is the different number of transcription factors studied in each cell line. However, we argue that the numbers here reflect the true tendency of TFBSs for pseudogenes since fairly comprehensive lists of transcription factors have been studied (74, 114, 53, 40 and 61 transcription factors in Gm12878, K562, Helas3, H1-hesc and Hepg2, respectively) and the results are consistent across all the different cell lines.