|Home | About | Journals | Submit | Contact Us | Français|
The recent completion of the DNA sequence of human chromosome 21 has provided the first look at the 225 genes that are candidates for involvement in Down syndrome (trisomy 21). A broad functional classification of these genes, their expression data and evolutionary conservation, and comparison with the gene content of the major mouse models of Down syndrome, suggest how the chromosome sequence may help in understanding the complex Down syndrome phenotype.
Down syndrome (DS), affecting one in 700 live births, is the most common genetic cause of mental retardation . The phenotype of DS is complex and variable in severity among individuals; it includes mental retardation and cognitive deficits, heart defects, hypotonia, motor dysfunction, immune system deficiencies, an increased risk of leukemia, and development of the pathology of Alzheimer's disease . Most commonly, DS is due to the presence of an extra copy of a complete chromosome 21 and it is assumed that the DS phenotypic features are a direct consequence of the overexpression of some number of genes contained within 21q (21p is largely made up of ribosomal RNA genes and other repeat sequences). Recently, the essentially complete sequence of 21q - 33.5 Mb - was finished, and 225 genes were identified by the application of a variety of experimental and computer-based approaches . The availability of this massive amount of new data has immediate importance to DS research. This review discusses the following issues: the reliability of gene identification; what is known or can be inferred about the biological function of the 225 identified genes; expression patterns of the novel genes; evolutionary conservation of, in particular, those genes lacking functional associations; inferences about the gene content of the major mouse models of DS and therefore the causes of the phenotypic differences among them; and reasonable next steps towards the goal of understanding the gene-phenotype relationships in DS. Throughout the following discussions, references to numbers and kinds of genes and additional analyses of 21q gene content are based on the data presented in .
Two hundred and twenty-five is a surprisingly small number for the complete gene content of approximately 1% of the human genome. It is significantly less than 1% of the 50,000-100,000 genes previously estimated in total for the human genome (see also ) and it is significantly less than the 545 genes identified on chromosome 22 in approximately the same amount of DNA . Previous data from the mapping of expressed sequence tags (ESTs) and genes, and efforts at cDNA selection, have consistently suggested that chromosome 21 was relatively gene-poor overall, and extremely so in some regions [6,7]. It could also be predicted that chromosome 21 would have fewer genes than chromosome 22. Approximately half of chromosome 21 is a large dark band when stained with Giemsa, and such bands are known to be gene-poor, while chromosome 22 is almost entirely comprised of gene-rich R bands [8,9]. In addition, trisomy 21 is compatible with life, while trisomy 22 is not . Chromosome 21, therefore, was expected to be relatively gene-poor. Its extreme paucity of genes, however, justifies further consideration. In particular, are there consistent errors or weaknesses in gene-finding techniques that could have missed a significant proportion of genes? To see where errors may have accumulated, it is worth reviewing the gene identification methods.
Genes were identified on the basis of the following types of data: identities or similarities to known proteins; identities to spliced ESTs; and patterns of consistent coding-exon prediction. First, protein matches identified genes that were identical or similar to known genes, and also found pseudogenes. With some minor corrections, all 107 genes associated with complete cDNAs that had been mapped previously to chromosome 21 (listed by Swiss-PROT  in March 2000) were found. In addition, within 21q, 52 protein matches were classed as pseudogenes on the basis of a lack of introns and, most importantly, on the presence of multiple in-frame stop codons. Given the inability of transcripts from these genes to produce a complete protein, it is unlikely that any pseudogenes were incorrectly classified. Secondly, for EST matches, only those that showed evidence of splicing were used - that is, those that were non-contiguous with genomic sequence, showed consensus splice sites, and represented essentially perfect matches (>95% identity) to the genomic sequence. This eliminates many of the artifacts common to cDNA libraries. A survey of the EST database  for fifty of the known chromosome 21 genes found that forty-three were present as spliced ESTs, six were present only as unspliced ESTs (five of these were intronless genes), and one was not present in dbEST.
Finally, the criterion of consistent exon prediction required that two of the three coding-exon prediction programs (Grail, Genscan and MZEF) agreed on the location of an exon, and that a minimum of three consistent exons were found within < 60 kb, with introns <30 kb. It is noteworthy that the coding regions of intronless genes were well predicted but only as single exons. Such exons tend to be very large - greater than a kilobase (kb) in length - in contrast to typical coding exons that average 100-150 base pairs (bp). After making exceptions for, and including, large single-coding-exon genes, by these criteria, all but one of the 107 known genes could be identified by exon prediction. This included very large genes, such as DSCAM, which spans >800 kb, and GRIK1, which spans >400 kb, both of which were well predicted through at least some of their coding regions.
The important conclusion here is that each of the 107 genes previously known to map to chromosome 21 would have been identified, in the absence of protein similarities, by the criteria of EST matches plus exon prediction. These criteria do not, in most cases, define a complete gene structure, but they do successfully indicate the presence of a gene. Thus, unless novel genes have very different characteristics, it is reasonable to expect a similarly high level of success in their identification. Using these criteria a further 118 genes were identified.
What is likely to have been missed? First, there are gaps in the sequence of 21q. They are few (three) and small (<50 kb each), however, and therefore cannot harbor large numbers of genes. Second, genes that would not be identified would have to possess the following features: no similarity to any known protein; consistently very large introns (>30-60 kb), so that patterns of predicted exons would not be scored; and long intronless 3' untranslated regions (UTRs) or restricted and/or low expression levels, so that no spliced EST is present in dbEST. It is certainly possible that some number of genes with such characteristics exist; that they represent a significant proportion of chromosome 21 genes is unlikely, however. The distal one third of 21q is the most gene-rich (and GC-rich); but intergenic distances here are not large enough to accommodate additional genes with uniformly large introns. So, unless coding exons in these genes are for some reason not recognized, such genes would be scored on the basis of patterns of predicted exons. The proximal two thirds, in contrast, is uniformly AT-rich and does have large segments lacking gene features; indeed, there is one segment of approximately 7 Mb that harbors only seven genes. Here there is room for numerous genes that have large introns and restricted expression. One argument against this is a biological one: an individual who is monosomic for this region has only mild phenotypic abnormalities . A second argument is a general scarcity of any consistent exon prediction in the region, regardless of 'intron' size. If there are many coding exons within this region, they must also be largely unrecognized by prediction programs. Together, these data suggest that the total of 225 genes is likely to be reliable: false negatives should be few.
But what about the possibility of false positives? Genes with complete protein or cDNA sequences identical or highly similar to known genes (these are the class 1 and class 2 genes in ) are unambiguous. Gene models (classes 3 and 4), however, are still open to further investigation and interpretation. For example, some investigators will choose to disregard a specific match to a protein domain if the similarity is weak. How many exons to include in a model, and whether an EST should be included will also sometimes be debatable. Thus, details in the gene catalog of 21q should be considered provisional. Investigators should review the basis for specific gene predictions of interest (available at ).
DS can be considered as a contiguous gene syndrome, with almost the entirety of 21q the relevant region. The segment of 21q22.2 that is referred to as the Down syndrome chromosomal region (DSCR) was defined to contain genes relevant to aspects of the DS phenotype on the basis of the phenotypes of several cases of partial trisomy 21 [14,15]. Data using a larger number of partial trisomy cases showed that only the most centromeric region of 21q could be excluded from containing relevant genes, in particular for mental retardation . It is assumed that overexpression of chromosome 21 genes, as a result of their presence in an extra copy, causes the DS phenotype. Are all chromosome 21 genes overexpressed? Can overexpression of some genes be tolerated with no phenotypic effect? How many genes are overexpressed and relevant? Currently, there are no answers to these questions. It is, however, worth considering what is known about the function of chromosome 21 genes.
Table Table11 lists the 122 genes for which some functional association can be inferred. Functional inferences are based on partial or complete similarities of the chromosome 21 genes or gene models to proteins or protein domains for which experimental data has demonstrated a specific function. For example, ZNF295 is a gene model with an open reading frame that contains zinc finger domains. Some zinc finger proteins have been shown to be transcription factors, so ZBF295 is classed as such. In general, genes are classified as broadly as possible. For example ITGB2, is classed only as a cell adhesion molecule, although because it has been studied essentially only in lymphocytes, it is regarded as an immune system gene . Future studies may well reveal functions other than those that have been observed, so it is as well to speculate about the functions of genes as broadly as possible.
Every biologist will bring their own expertise to bear in deciding which of the genes in Table Table11 are of greatest potential relevance to the DS phenotype. Transcription factors are attractive candidates because imbalance of one component of a transcription factor complex may alter the effectiveness of the activation or repression of transcription of target genes. Genes within the ubiquitin pathway may alter rates of target protein degradation. Cell adhesion was long ago postulated, with intriguing preliminary data , to play a role in altering rates and extents of cell migration during development. Overexpression of one potassium channel gene has been shown to disregulate expression of other channel genes, affecting neuronal network excitability . If mental retardation and cognitive deficits are the primary focus of study, almost any of the categories in Table Table11 could be relevant, such is the extent of our current understanding of the complex developmental processes leading to these conditions.
Only about half the 225 chromosome 21 genes have any functional association, and some of these are particularly weak - for example, the presence of a transmembrane domain is not very definitive. In some cases, the lack of protein or functional domain data may be due to the lack of complete coding sequence information. While awaiting the generation of complete cDNA sequences (which may be laborious to obtain), and even for further analysis of complete cDNAs lacking functional associations, expression patterns may help in prioritizing genes for further study. Of the novel genes with incomplete cDNAs, thirty-eight are represented by ESTs from Soares or CGAP cDNA libraries . Of these, only seven would be classed as ubiquitous in expression - that is, present in dbEST with more than 30 entries from numerous tissues. Twenty-six ESTs are each associated with fewer than five dbEST entries. Five of these ESTs are seen only in testes/prostate and three are seen only in fetal sources. While there are features of dbEST construction that can produce artifactual pictures of expression patterns, these data suggest that the novel genes within 21q may be largely of limited expression. In some cases at least, this is consistent with the failure to identify these genes previously.
For relevance to mental retardation and cognitive deficits, genes with brain-specific expression, such as PCP4 , are of interest. Equally interesting are examples of brain-specific alternative processing, as is seen with Intersectin and DSCAM [22,23]. In an analysis of a number of novel gene models, alternative processing, some of it brain-specific, was observed in the majority of cases . It is unlikely that even most known genes have been examined thoroughly for instances of multiple transcripts. Because these may alter protein sequences and therefore function, their role in DS may be relevant.
Model organisms will provide the basis for functional studies of the known and novel chromosome 21 genes. The genomes of Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila [25,26,27] have been completely sequenced, and thus the complete set of proteins of each of these organisms is known. Annotation of the Drosophila genome identified approximately 13,500 genes. Comparison of the translations of all annotated chromosome 21 genes with the Drosophila set identified 23 chromosome 21 gene products with similarity to a Drosophila protein over the complete length. Many of these similarities involve basic biochemical/biological functions and include such proteins as SOD1 (superoxide dismutase), GART (a purine biosynthesis enzyme), CBS (cys-tathionine beta-synthetase), and those involved in RNA splicing and the ubiquitin pathway. A further set of 31 genes showed excellent informative matches but only over a domain or subregion of the human protein. Previously known homologs include MNB (minibrain) and SIM2 (single-minded). Perhaps most interesting in both sets are those genes for which there is little or no functional data. Table Table22 lists some of the known and novel chromosome 21 genes with partial and complete similarities in Drosophila. Among the novel genes, identities at the amino-acid level range as high as 64% (c21orf19) and over as many as 1,600 residues (c21orf5). Additional details remain to be resolved; for example, in several cases the lengths of the human and Drosophila proteins are significantly different. Correcting these differences, if it is necessary, may strengthen the similarity data. In addition, defining complete cDNAs may reveal new homologies not discernible with partial gene models. Determining the phenotypes of mutants in the Drosophila genes is likely to shed light on the function of the homologous human genes.
Regions of human chromosome 21 are conserved within segments of three mouse chromosomes. The centromere-proximal region of chromosome 21 through the MX genes is homologous with the telomeric region of mouse chromosome 16 (Figure (Figure1).1). The next approximately 2 Mb segment of chromosome 21 is homologous with the centromere-proximal region of mouse chromosome 17, and the telomeric 2 Mb of chromosome 21 is homologous with an internal segment of mouse chromosome 10. On the basis of current data, the order of chromosome 21 homologues in the mouse chromosome 16 and 10 segments appears to be completely conserved, although the boundaries of these regions are still approximate [28,29]. For example, the most centromere-proximal gene on chromosome 21 verified to map to mouse chromosome 16 is STCH. There are seven genes proximal to this that should be mapped in mouse. Similarly, although it is known that Mx maps to mouse chromosome 16 and Tff3, Cbs and Crya map to mouse chromosome 17, there are 11 genes between and among these that are of unknown map location in mouse. Lastly, PDXK is the most proximal chromosome 21 gene mapped to mouse chromosome 10 . Genes in this region are relatively small, however, and additional chromosome 21 genes may be located on mouse chromosome 10 between Pdxk and the adjacent region homologous with human chromosome 19. Defining the endpoints of these homologous regions is critical for evaluating gene-phenotype correlations within existing mouse models and for designing new ones.
Currently, the best mouse models of DS are the mouse chromosome 16 segmental trisomies, Ts65Dn and Ts1Cje. Ts65Dn is trisomic for the region spanning an undefined distance proximal to App through Mx to presumably the telomere of chromosome 16. The phenotype of Ts65Dn includes working memory impairment and long term memory deficits; delayed development and lower body weight; motor dysfunction; decreased responsiveness to pain; hyperactivity; and decreased ability to inhibit behavior (reviewed in [30,31]; see also [32,33]). Particularly interesting are observations of age-related loss of cholinergic neurons, decreased numbers of asymmetric synapses in the temporal cortex, abnormalities in neuron number in hippocampal regions, and deficiencies of beta-noradrenergic transmission within the hippocampus and cerebral cortex [34,35,36,37,38]. Some of these deficits have been observed in DS; others suggest new avenues of investigation. Knowing which genes cannot be responsible for the phenotype can be helpful. Table Table3a3a lists the 32 genes found centromeric to the Alzheimer's-associated gene APP on chromosome 21. On the basis of current comparative mapping data, most of these may be present in only two copies in Ts65Dn and therefore would not contribute to its phenotypic features. The Ts1Cje mouse is a more recent model, and is trisomic for the region of mouse chromosome 16 from Sod1 through Mx (and again presumably to the telomere). While it has not yet been studied so thoroughly as Ts65Dn, there are phenotypic differences between the two mice. In contrast to Ts65Dn, Ts1Cje shows hypoactivity, no loss of cholinergic neurons, and no deficits in the visible platform part of the water maze tests (which tests only memory and not the ability to make spatial correlations) . Table Table3b3b lists 27 genes that are expected to be trisomic in the Ts65Dn but only disomic in the Ts1Cje, based on the genetic map . It is tempting to conclude that these genes must account for the phenotypic differences, but it must be kept in mind that the two mouse strains have been produced on different genetic backgrounds, which may have phenotypic consequences.
Segmental trisomies for the regions of chromosome 21 homologous with mouse chromosomes 17 and 10 do not exist. If Mx is the most telomeric gene on mouse chromosome 16 and Pdxk is the most centromeric on mouse chromosome 10, there are 33 genes within the approximately 2.2 Mb of the mouse chromosome 17 region (Table (Table4)4) and 50 genes within the approximately 2.9 Mb of the mouse chromosome 10 region. Adding the maximum of 32 genes not trisomic in the Ts65Dn, half of the chromosome 21 homologous genes are not trisomic in Ts65Dn. The phenotypic consequences of these genes must be assessed in some fashion, because the Ts65Dn lacks some features of DS. Constructing single-gene transgenic mice expressing each of these and then combining each with the Ts65Dn by breeding would be laborious and probably of limited success. An alternative is to generate additional segmental trisomies using the Cre-lox system .
Analysis of the complete sequence of chromosome 21 has provided the first look at all candidate DS genes. The next steps require verifying and refining the predicted, incomplete gene models, defining new models as necessary, and isolating complete cDNAs for each gene. With complete coding sequences, protein sequences can be examined for motifs, domains, and biochemical characteristics that may suggest function. The most challenging problem will then be determining the functions of these genes and the other 'known' genes. While it is tempting to focus on genes whose protein characteristics suggest a hypothesis for relevance to some aspect of DS, the more than 100 genes distributed throughout the chromosome that have no functional association are too large a dataset to ignore. For these and other genes on 21q, detailed expression analysis may be informative. Demonstration that a gene shows increased expression in the trisomic state by northern blot or RT-PCR analysis, followed by RNA tissue in situ hybridization to define specific cell types, brain regions and developmental stages of expression, may help in selecting genes of greater or lesser interest.
The most direct assessment of function will require mutation or overexpression of individual genes or sets of genes. For these experiments, the 'complete' protein databases for S. cerevisiae, C. elegans and Drosophila will provide homologous genes that can be analyzed in more tractable systems. The increasing complexity of the zebrafish EST database will add another model organism system of increasing utility. Issues remain with all model organisms, however, of verifying correct gene structures, identifying orthologous genes versus merely homologous genes, and interpreting mutation and knockout data in one system versus overexpression in another. The ultimate model organism, of course, will remain the mouse. Multiple genes can be 'added' to the Ts65Dn using transgenics carrying bacterial chromosomes (BACs), to look for enhanced DS-relevant phenotypes. The human sequence will be useful here in ensuring that clones are extensive enough to contain appropriate regulatory regions. Single-gene knockouts can also be 'subtracted' from the Ts65Dn mouse model, to search for amelioration of phenotype. With good biological intuition and luck, it may not be necessary to understand all of the genes within chromosome 21 before promising candidates are identified and the design of potential therapeutics can begin.
This work was supported by the Boettcher Foundation and Grant No. HD17449 from the National Institutes of Health.