Two hundred and twenty-five is a surprisingly small number for the complete gene content of approximately 1% of the human genome. It is significantly less than 1% of the 50,000-100,000 genes previously estimated in total for the human genome (see also [
4]) and it is significantly less than the 545 genes identified on chromosome 22 in approximately the same amount of DNA [
5]. Previous data from the mapping of expressed sequence tags (ESTs) and genes, and efforts at cDNA selection, have consistently suggested that chromosome 21 was relatively gene-poor overall, and extremely so in some regions [
6,
7]. It could also be predicted that chromosome 21 would have fewer genes than chromosome 22. Approximately half of chromosome 21 is a large dark band when stained with Giemsa, and such bands are known to be gene-poor, while chromosome 22 is almost entirely comprised of gene-rich R bands [
8,
9]. In addition, trisomy 21 is compatible with life, while trisomy 22 is not [
1]. Chromosome 21, therefore, was expected to be relatively gene-poor. Its extreme paucity of genes, however, justifies further consideration. In particular, are there consistent errors or weaknesses in gene-finding techniques that could have missed a significant proportion of genes? To see where errors may have accumulated, it is worth reviewing the gene identification methods.
Genes were identified on the basis of the following types of data: identities or similarities to known proteins; identities to spliced ESTs; and patterns of consistent coding-exon prediction. First, protein matches identified genes that were identical or similar to known genes, and also found pseudogenes. With some minor corrections, all 107 genes associated with complete cDNAs that had been mapped previously to chromosome 21 (listed by Swiss-PROT [
10] in March 2000) were found. In addition, within 21q, 52 protein matches were classed as pseudogenes on the basis of a lack of introns and, most importantly, on the presence of multiple in-frame stop codons. Given the inability of transcripts from these genes to produce a complete protein, it is unlikely that any pseudogenes were incorrectly classified. Secondly, for EST matches, only those that showed evidence of splicing were used - that is, those that were non-contiguous with genomic sequence, showed consensus splice sites, and represented essentially perfect matches (>95% identity) to the genomic sequence. This eliminates many of the artifacts common to cDNA libraries. A survey of the EST database [
11] for fifty of the known chromosome 21 genes found that forty-three were present as spliced ESTs, six were present only as unspliced ESTs (five of these were intronless genes), and one was not present in dbEST.
Finally, the criterion of consistent exon prediction required that two of the three coding-exon prediction programs (Grail, Genscan and MZEF) agreed on the location of an exon, and that a minimum of three consistent exons were found within < 60 kb, with introns <30 kb. It is noteworthy that the coding regions of intronless genes were well predicted but only as single exons. Such exons tend to be very large - greater than a kilobase (kb) in length - in contrast to typical coding exons that average 100-150 base pairs (bp). After making exceptions for, and including, large single-coding-exon genes, by these criteria, all but one of the 107 known genes could be identified by exon prediction. This included very large genes, such as DSCAM, which spans >800 kb, and GRIK1, which spans >400 kb, both of which were well predicted through at least some of their coding regions.
The important conclusion here is that each of the 107 genes previously known to map to chromosome 21 would have been identified, in the absence of protein similarities, by the criteria of EST matches plus exon prediction. These criteria do not, in most cases, define a complete gene structure, but they do successfully indicate the presence of a gene. Thus, unless novel genes have very different characteristics, it is reasonable to expect a similarly high level of success in their identification. Using these criteria a further 118 genes were identified.
What is likely to have been missed? First, there are gaps in the sequence of 21q. They are few (three) and small (<50 kb each), however, and therefore cannot harbor large numbers of genes. Second, genes that would not be identified would have to possess the following features: no similarity to any known protein; consistently very large introns (>30-60 kb), so that patterns of predicted exons would not be scored; and long intronless 3' untranslated regions (UTRs) or restricted and/or low expression levels, so that no spliced EST is present in dbEST. It is certainly possible that some number of genes with such characteristics exist; that they represent a significant proportion of chromosome 21 genes is unlikely, however. The distal one third of 21q is the most gene-rich (and GC-rich); but intergenic distances here are not large enough to accommodate additional genes with uniformly large introns. So, unless coding exons in these genes are for some reason not recognized, such genes would be scored on the basis of patterns of predicted exons. The proximal two thirds, in contrast, is uniformly AT-rich and does have large segments lacking gene features; indeed, there is one segment of approximately 7 Mb that harbors only seven genes. Here there is room for numerous genes that have large introns and restricted expression. One argument against this is a biological one: an individual who is monosomic for this region has only mild phenotypic abnormalities [
12]. A second argument is a general scarcity of any consistent exon prediction in the region, regardless of 'intron' size. If there are many coding exons within this region, they must also be largely unrecognized by prediction programs. Together, these data suggest that the total of 225 genes is likely to be reliable: false negatives should be few.
But what about the possibility of false positives? Genes with complete protein or cDNA sequences identical or highly similar to known genes (these are the class 1 and class 2 genes in [
3]) are unambiguous. Gene models (classes 3 and 4), however, are still open to further investigation and interpretation. For example, some investigators will choose to disregard a specific match to a protein domain if the similarity is weak. How many exons to include in a model, and whether an EST should be included will also sometimes be debatable. Thus, details in the gene catalog of 21q should be considered provisional. Investigators should review the basis for specific gene predictions of interest (available at [
13]).