Identifying proteins for examining gene space
Our strategy for examining gene space is to determine how well one can map complete proteins in unfinished genomes. Ideal proteins would be easily identifiable, present in all eukaryotes, and single copy. We had previously developed the CEGMA mapping protocol (10
) using data from the KOGs database (11
). Utilizing complete protein catalogs from six model organisms (H. sapiens
, D. melanogaster
, C. elegans
, A. thaliana, S. cerevisiae
and S. pombe
), we produced a set of 458 CEGs whose entire coding sequence can be reliably mapped in higher eukaryotes (by which we mean plants, animals and fungi). To find genes that tend to be single copy, we selected those genes that are present as a single copy in the KOGs database in at least four of the six species. Each KOG cluster contains one or more genes that are reciprocal best matches among these genomes. Some clusters contain genes that tend to duplicate while other clusters tend to exist in single copies. For example, KOG0157 (Cytochrome P450 CYP4/CYP19/CYP26 subfamilies), contains 87 genes in A. thaliana
, 14 in D. melanogaster
, 33 in H. sapiens
, 24 in C. elegans
, 2 in S. cerevisiae
and 1 in S. pombe
. However, KOG0261 (RNA polymerase III, large subunit) has only one orthologous protein in each species. After enriching for single-copy genes, the dataset was reduced to 248 proteins. Approximately 90% of these genes are single-copy in D. melanogaster, C. elegans, S. cerevisiae and S. pombe
. In H. sapiens
and A. thaliana
, multi-gene families were significantly reduced and 50% are single-copy genes in the set of 248 CEGs ().
Reducing the number of orthologs in the original set of 458 CEGs
The CEGMA mapping protocol was run using the reduced set of 248 CEGs against the six model organism genomes from which the set of core genes were originally identified, and as expected, CEGMA correctly identified all of the CEGs in these species (data not shown). The sole exception to this was the failure to predict one core gene in C. elegans. On inspection, we found that the gene in question (WormBase gene ID: WBGene00007698) is interrupted by the presence of another gene in one of its introns. CEGMA correctly predicts the first half of the gene upstream of the gene-containing-intron but not the rest of the gene. This suggests that the CEGMA may have problems with other genes whose structure is interrupted by large insertions of other genes. Non-canonical splice sites, selenocysteine codons, regulated frame-shifting, and other rare features may also confound CEGMA, but since most core genes do not exhibit these features, their effect should be minimal.
Mapping core genes in genomes of varying coverage
To determine the properties of the gene space in genome assemblies with varying levels of sequence coverage, we mapped the 248 core genes against multiple assemblies of C. briggsae
, T. gondii
and H. sapiens
. For C. briggsae
we randomly sampled reads at 2×, 4×, 6×, 8× and 10× coverage from the original 12× project (15
) and re-assembled them with an updated version of the Phusion assembler. For T. gondii
, we used the six assemblies (0.7×, 1×, 2×, 4×, 6× and 10×) that were available online (the latter two assemblies are scaffold-based, the rest are contig-based). Finally, for H. sapiens
, we created simulated draft genomes based on the known distribution of contig and scaffold sizes for several non-human WGS assemblies (see Methods section). These draft genomes are not true assemblies because they consist of sampled finished sequence rather than contigs built up from shotgun sequencing reads. As a result, the simulated human genomes do not mimic all the properties present in WGS assemblies. We find that the number of genes found by CEGMA in the simulated drafts is consistent with the original non-human genomes (Supplementary Table S1
). Therefore, while the simulated draft human genomes may not faithfully mirror WGS contigs, the content of the gene-space is similar.
In general, as the sequence coverage increases, the N50 length of contigs and scaffolds also increases, as does the number of mapped CEGs (). An exception to this is the 12 × C. briggsae
assembly, whose scaffold N50 length is shorter than the 8× and 10× reassemblies, and approximately equal to the 6× reassembly. We believe that this is because of improvements made to the Phusion assembler since the published 12× assembly was generated. The other exceptions are the 4.2× and 6.6× simulated human genomes, whose scaffold lengths are exceptionally long. This is because these simulated drafts were based on chimpanzee genome assemblies, which used a reference genome (human) to aid their construction (19
Assembly statistics and results of mapping 248 CEGs in C. briggsae, H. sapiens and T. gondii
Do mapped CEGs faithfully represent all genes?
Eukaryotic genomes can contain many thousands of genes, though these genes might not all be present in the sequence of an incomplete genome assembly. To determine if the proportion of mapped CEGs corresponds to the proportion of all genes that can be found, we mapped the latest, complete gene catalogs of C. briggsae, H. sapiens
and T. gondii
against genome assemblies at various levels of sequence coverage. It is important to choose a suitable cut-off for determining whether a gene is present or not, i.e. what percentage of each gene annotation needs to be present in an individual contig or scaffold. Choosing a cut-off value that is too low makes it possible to find nearly all of the genes in most of the assemblies, because even in very low-coverage assemblies, most genes are present as fragments (Supplementary Figure S1
). For instance, even in the low-coverage human 1.9× assembly, we can still find fragments of 18 528 of the 23 713 genes.
By using a cut-off of 65% (see Methods section) we find that there is a good overall correlation between the number of mapped CEGs and the total number of genes (), though there are interesting differences among the three species (Supplementary Figure S2
). The results for C. briggsae
show the closest agreement between the proportions of core genes and all genes that are mapped. The average discrepancy between the percentages of mapped CEGs and all genes is 0.96%. This suggests that the proportion of CEGs that can be mapped in the C. briggsae
genome, regardless of the level of sequence coverage, is a good approximation for the proportion of all genes that should be present. For H. sapiens
, the average discrepancy between the ability to map the two data sets is higher at 6.55%. The proportion of mapped CEGs in the human assemblies always provides an overestimation of the proportion of all genes that are present. This contrasts with T. gondii
, where we appear to underestimate the proportion of all genes that are present.
The overestimation in H. sapiens
is mainly caused by the high paralogy of human CEGs (32.3% of the 248 core genes had more than one orthologous protein). The other factor biasing the calculation was the length of the primary transcripts. Since the genes encoding CEGs tend to be slightly shorter than ‘normal’ genes (Supplementary Table S2
), they are more likely than a normal gene to be contained completely within a short scaffold. For instance, while 19% of transcripts from all human genes are longer than 50 000 bp only 9% of human CEG transcripts exceed that length. Because of the large number of pseudogenes in the human genome, we also suspected that some pseudogenes may contribute to the overestimation, and in some of the lower-coverage simulated draft genomes a small number of pseudogenes are incorrectly classified as CEGs (Supplementary Table S3
). However, in the full human genome sequence, no pseudogenes are classified as CEGs.
The underestimation of the number of real genes in the T. gondii
assemblies is apparent at all levels of sequence coverage. Since T. gondii
is an outgroup species, ~1900 million years diverged from the higher eukaryotes used to build the CEGs (20
), we sought to determine if evolutionary divergence was the source of error. We partitioned the set of 248 CEGs into four groups based on their overall degree of sequence conservation (Supplementary Table S4
). Group 1 contains the most divergent CEG proteins, and Group 4 contains the most highly conserved. All of the Group 4 proteins could be mapped in the 10 × T. gondii
genome, compared to about two thirds of the Group 1 proteins (Supplementary Table S5 and Figure 1
). For highly diverged genomes, it is therefore necessary to use a smaller set of only the most highly conserved CEGs in order to evaluate the completeness of the gene space.
Investigating gene space in 25 species
shows the results of mapping the 248 CEGs into 25 species from diverse phylogenetic groups (results from additional species are included in the supplemental spreadsheet file ‘other_genomes.xls’). In most genomes (18 of 25), we were able to map >90% of the full-length CEGs. Some of the genomes with the fewest mapped proteins were guinea pig (Cavia porcellus), elephant (L. africana) and domestic cat (F. catus) with 46.0, 46.0 and 58.1% of CEGs mapped, respectively. These values are slightly inflated due to high levels of paralogy and low sequence coverage in these species, but given that these genomes were sequenced at only ~2× coverage, it is somewhat surprising that their gene space is represented as well as it is. Some of the missing CEGs are present as fragmentary matches, and this is more pronounced in low coverage genomes (the above figures for mapped CEGs in the guinea pig, elephant, and cat rise to 68.1%, 68.5% and 75.8% when partial matches are also included).
Results of mapping CEGs against the genomes of various eukaryotes
In platypus (O. anatinus), we mapped only 74.6% of the CEGs despite the genome having 6× coverage. The N50 scaffold length of the genome is high (531 kb) but the average contig size is only 7232 bp with 28% of the genome sequence in fragments smaller than 20 000 bp. Since the average platypus transcript is 22 000 bp, many genes are not fully contained in a scaffold sequence and this leads to fragmentation of the gene space and a high number of partial predictions (10%).
The chicken (G. gallus) genome assembly is derived from 6.6× sequence coverage, yet we were only able to map 83.9% of the CEGs. The chicken sequencing consortium produced a comprehensive EST collection, so we mapped the ESTs against the 36 missing CEGs to determine if the CEGs were missing from the genome assembly or from the organism. Of the 36 missing genes, 29 have at least one matching EST, indicating that the missing genes are not missing from the organism. When the ESTs of these missing genes are aligned back to the genome, 45% could not be mapped to the genome sequence at all. Of the 55% of ESTs that did match the genome, about half matched unanchored sequences that are not integrated into the main genome assembly, and the remaining half had only partial matches to the main genome sequence (on average, matches occurred across just 40% of the length of the EST sequence).
Another species with a low fraction of mapped CEGs was T. spiralis. Because of errors in the initial estimate of genome size, the sequence coverage for this species may be as high as 30×. This is partially supported by contigs and scaffold sequences having much higher N50 sizes than the other nematodes in this study (data not shown). However, we still fail to map 15 CEGs in this species, although at least five of these missing genes are present as fragments. For seven cases we find the candidate locus but the coding sequence contains frame-shifts. Given that there are few paralogs and the genome is compact, it is unlikely that these are pseudogenes. Whether the frame-shifts are intrinsic properties of the genome, sequencing errors, or assembly artifacts is currently unknown.
Other species with lower than expected numbers of mapped CEGs include P. falciparum (75.0%), G. lamblia (46.4%), and to a lesser extent, Chlamydomonas reinhardtii (93.1%). These divergent genomes follow a similar pattern as T. gondii where highly conserved proteins are mapped more frequently than poorly conserved ones (). Consequently, the fraction of mapped CEGs is an underestimate of the completeness of the gene space. Considering only the most highly conserved Group 4 proteins, the genomes of P. falciparum and C. reinhardtii appear mostly complete with 98.4% and 96.9% of the CEGs represented, but G. lamblia has only 67.7%. Since G. lamblia is an outgroup to the higher eukaryotes used to build the CEGs, the low fraction of mapped CEGs may not be an accurate reflection of the state of the gene space.
Mapping results for six selected species in four subsets of core genes. Group 1 represents the least conserved of all CEGs and Group 4 the most conserved.
For the 19 species with available gene annotations, we find that that there is a good overlap between the proteins predicted by the CEGMA mapping protocol and those provided by the relevant genome consortia. However, in most cases (14 of 19) CEGMA predicts some core genes that are not present in the current annotations, and for a few species many CEGs appear missing from the annotations (e.g. CEGMA finds 228 CEGs in the honey bee genome, though only 173 appear in the available annotations). In a few cases (4 of 19) the consortium annotations include CEGs that are not mapped by the CEGMA protocol. This is more pronounced in the two protozoan species P. falciparum and G. lamblia and their evolutionary divergence would again be most likely to account for this.
We define the ‘paralogy index’ as the proportion of mapped CEGs with paralogs. This figure partly depends on genome coverage since CEGMA has difficulties sorting orthologs from paralogs in incomplete genomes. The species with the highest paralogy indexes tend to be higher plants and vertebrates (). Plants are represented by rice (Oryza sativa
) and the poplar tree (P. trichocarpa
) and these have the highest (P. trichocarpa
, 71.3%) and third-highest (O. sativa
, 51.6%) proportion of paralogs in the 25 species studied. The paralogy index for most mammals is in the 30–40% range. Within the vertebrates, the species with the lowest fraction of paralogs (13.0%) is the chicken (G. gallus
). This result is supported by data suggesting that the chicken genome has undergone extensive gene loss (21
Among the Caenorhabditids
, C. brenneri
is an outlier. The C. elegans, C. briggsae
and C. remanei
genome assemblies roughly match their expected sizes (3
), but the C. brenneri
assembly is nearly 200 Mb though its expected size is 150 Mb (http://genome.wustl.edu
). The paralogy index of C. brenneri
is also much higher (53.5%) than the others (5.7%, 8.1% and 15.5%). One possible reason that C. brenneri
is an outlier is that its genome assembly contains some heterozygosity, and this results in creating artificial paralogs and inflating the genome size.