|Home | About | Journals | Submit | Contact Us | Français|
We have established that human genome sequences encoding a novel protein domain, DUF1220, show a dramatically elevated copy number in the human lineage (>200 copies in humans vs. 1 in mouse/rat) and may be important to human evolutionary adaptation. Copy-number variations (CNVs) in the 1q21.1 region, where most DUF1220 sequences map, have now been implicated in numerous diseases associated with cognitive dysfunction, including autism, autism spectrum disorder, mental retardation, schizophrenia, microcephaly, and macrocephaly.
Although the data are only correlative at this point, we report here that these disease-related 1q21.1 CNVs either encompass or are directly flanked by DUF1220 sequences and exhibit a dosage-related correlation with human brain size. Microcephaly-producing 1q21.1 CNVs are deletions, whereas macrocephaly-producing 1q21.1 CNVs are duplications. Similarly, 1q21.1 deletions and smaller brain size are linked with schizophrenia, whereas 1q21.1 duplications and larger brain size are associated with autism. Interestingly, these two diseases are thought to be phenotypic opposites. These data suggest a model which proposes that (1) DUF1220 domain copy number may be involved in influencing human brain size and (2) the evolutionary advantage of rapidly increasing DUF1220 copy number in the human lineage has resulted in favoring retention of the high genomic instability of the 1q21.1 region, which, in turn, has precipitated a spectrum of recurrent human brain and developmental disorders.
At a fundamental level, evolution has been characterized as a change in the allele frequency of a gene. More precisely, it is an alteration in the frequency of a genome sequence and may or may not involve a gene. It has been proposed that the primary types of genome alterations that underlie evolutionary change are single-nucleotide substitution, chromosomal rearrangement, and gene duplication. In 1970, Ohno (1970) put forth the argument that gene duplication is a primary mechanism of evolutionary change, due to the relaxation of selection and increase in variation afforded by its built-in redundancy. This view has also been expressed by W.H. Li (1997): “There is now ample evidence that gene duplication is the most important mechanism for generating new genes and new biochemical processes that have facilitated the evolution of complex organisms from primitive ones.” More recently, a similar view has been expressed, albeit more succinctly, by E.E. Eichler (2001): “Exceptional duplicated regions underlie exceptional biology.”
Given the importance of gene duplication to evolutionary change, we initiated a collaboration with Jonathan Pollack at Stanford to generate the first genome-wide and first gene-based application of array comparative genomic hybridization (arrayCGH) across human and nonhuman primate lineages (Fortna et al. 2004). The approach surveyed humans and four great ape species (bonobo, chimpanzee, gorilla, and orangutan) using multiple individuals from each. The arrays were generated from >41,000 human cDNA clones (~24,000 genes), from which full inserts were amplified by polymerase chain reaction (PCR) and spotted. Each arrayCGH experiment represented a pairwise comparison of a reference DNA sample (always human) labeled with a green fluorescent dye and a test DNA sample (either human or a sample from one of the four great apes) labeled with a red fluorescent dye. Comparison of resulting array signals indicated that at least 1004 genes could be identified that gave changes in hybridization intensity consistent with lineage-specific gains or losses in gene copy number (Fig. 1). Several sets of control experiments were used to verify that the data were lineage-specific the result of copy-number changes, rather than sequence divergence.
Analysis of resulting arrayCGH data not only identified numerous candidate genes for a wide range of lineage- specific traits, but, by parsimony, also allowed one to determine when, in recent primate evolution, each of these events was likely to have occurred. In this manner, a genome-wide evolutionary history of lineage-specific gene-copy-number gain (duplication) and loss could be reconstructed covering the past 16 million years, extending from the last common ancestor (LCA) of the human and great ape lineages to the present (Fortna et al. 2004).
In addition, the data revealed that gene-copy-number changes showed a strong positional bias with respect to genome location and, in many cases, corresponded to regions known to be evolutionarily active. Among the most prominent of these were 1q21.1 (where the majority of DUF1220 domain sequences map; Fig. 2), the pericentromeric region of chromosome 9, and the fusion site on chromosome 2, which is the site at which two ancestral ape chromosomes fused to generate human Chr 2. Whereas 1q21.1 and 9p13/9q13 showed the highest concentration of genes giving human lineage-specific duplications, the fusion region on chromosome 2 contained a number of virtually contiguous genes that exhibited dramatic copy-number changes specific for several different lineages, all within a narrow (400 kb) genomic interval (Fortna et al. 2004).
There was also a pronounced enrichment of genes that showed increases in copy number specifically in the African Great Apes, compared to humans and orangutans. This finding has been reinforced by a recent report (Marques-Bonet et al. 2009) indicating that a burst of gene duplication occurred in the African Great Ape lineages. Interestingly, at least some of these African Great Ape–specific gene-copy-number increases may represent independent expansions within the African Great Ape clade; e.g., the gene was duplicated in the gorilla independent of its duplication in the Pan (chimp/bonobo) lineages (Fortna et al. 2004; Marques-Bonet et al. 2009).
Because of the highly informative results that we obtained from a comparison of the human and great ape genomes, we extended the application of arrayCGH to five evolutionarily more distant primate lineages (Dumas et al. 2007). By using human cDNA arrays and retaining humans as the reference DNA in all arrayCGH experiments, we were able to interrelate data from all 10 primate species tested: human, bonobo, chimp, gorilla, orangutan, gibbon, macaque, baboon, marmoset, and lemur. This series of experiments further extended the period of primate evolutionary history that we could survey. The most distant comparison was between lemur and human, the LCA of which is estimated to have occurred ~60 million years ago. Although we were aware that sequence divergence may contribute to arrayCGH signals and that this would be more pronounced as more and more distantly related species were compared, these data indicated that valid copy-number changes could be detected even among the most distant primate species. For example, the many genes giving strong red signals in the lemur to human comparison (indicative of copy-number gains in lemur relative to humans) indicated that the method could reliably pick up copy-number gains between these highly divergent species and that the gene-copy-number expansions which were giving hybridization signals indicative of lemur-specific copy-number increases must be dramatic enough to override any contribution that sequence divergence was making to the arrayCGH signals.
This study identified 4159 genes that showed lineage-specific copy-number changes among these 10 primate species, including many that were specific to multiple lineages (Fig. 3). Among these were genes that were (1) increased in human and ape lineages relative to monkey lineages, (2) increased only in Old World monkeys relative to the other primates tested, and (3) increased in African Great Apes relative to the other species tested. If one applies parsimony to the results, the copy-number changes can be positioned at specific points in primate evolutionary time.
The number of genes showing lineage-specific copy-number changes for each lineage showed a general correlation with the age of the lineage (the age of the LCA with humans). Although the gibbon was an exception, showing an increased number of lineage-specific changes relative to its evolutionary age, this was consistent with the unusually high degree of chromosomal rearrangement that has been found within this species. In agreement with the overall trend, application of a tree-building program to this extended arrayCGH data set recapitulated the established evolutionary relationship of the species to a very high approximation (Fig. 4) (Dumas et al. 2007).
Among the genes identified in the human and great ape arrayCGH study were 134 that exhibited hybridization signals consistent with human lineage-specific increases in copy number. To characterize these more completely, the cDNA inserts of each clone were sequenced and used as BLAT queries against available primate genome sequences (Popesco et al. 2006). The most striking finding was that one gene, MGC8902, encoded by cDNA IMAGE:843276, gave more than 49 hits in the human genome but only 10 and 4 in chimp and macaque genomes, respectively. On the basis of BLAT hit span size, all human hits were shown to contain both exonic and intronic sequences and thus were not the products of retrotransposition. Further investigation of the IMAGE:843276 cDNA insert sequence indicated that it encoded six closely spaced copies of a predicted PFAM protein domain of unknown function (DUF1220). Follow-up analysis of DUF1220 domains indicated that they are ~65 amino acids in length and composed of a two-exon doublet (Fig. 5).
Sequences encoding DUF1220 domains continued to show a human lineage-specific increase in copy number when the five additional primate lineages were compared. PFAM has divided DUF1220 sequences into 11 seed domains, only one of which (O75042) is found outside primates, and there only as a single copy. All of the remaining 10 seed domains are primate specific, and, of these, Q8IX162 gave the highest number of BLAT hits in the human genome. Remarkably, 37 of the 90 hits produced by Q8IX162 in the human genome gave a perfect sequence match. In humans, the DUF1220 family is found on 20–34 genes (also called the NBPF family; Vandepoele et al. 2005; Popesco et al. 2006) where they are present in from 1 to 50+ copies. Analysis of predicted DUF1220-encoding proteins indicates that the genes appear to be virtually devoid of other domains or functional signatures (Fig. 6), suggesting that the primary focus of selection was on increasing the number of DUF1220 domain copies in the genome, rather than integration of DUF1220 copies into known genes.
When all 11 DUF1220 seed domains are used for BLAT analysis and redundant hits are removed, we estimate that the human genome encodes 212 DUF1220 domains, whereas chimp, macaque, and mouse genomes encode 34, 30, and 1, respectively (Popesco et al. 2006). It should be noted that, with the exception of humans, these estimates are based on draft genome sequences, and these numbers are likely to vary when more recent genome assemblies are surveyed.
DUF1220 domains are thought to have undergone recent positive selection based on elevated Ka/Ks ratios (Popesco et al. 2006). Western analysis with antibodies directed against a human DUF1220 peptide show DUF1220-positive signals in several human tissues, including the brain. Immunocytochemistry of postmortem brain indicates that DUF1220 proteins are expressed exclusively in neurons, and they seem to be enriched in cell bodies and dendrites (Popesco et al. 2006). Although DUF1220 sequences are found at three cytogenetic locations on chromosome 1, the great majority map to 1q21.1, which, along with the pericentromeric region of chromosome 9, are the two genomic regions that show the highest concentration of genes that exhibit human lineage-specific increases in copy number (Fig. 7).
During the last few years, CNVs in the 1q21.1 region have been implicated in an increasingly higher number of human diseases, including idiopathic mental retardation (de Vries et al. 2005; Sharp et al. 2006), autism/autism spectrum disorder (Sharp et al. 2006; Autism Genome Project Consortium 2007; Mefford et al. 2008), congenital heart disease (Christiansen et al. 2004), schizophrenia (ISC 2008; Stefansson et al. 2008; Walsh et al. 2008; Need et al. 2009), microcephaly/macrocephaly (Brunetti-Pierri et al. 2008; Mefford et al. 2008), and neuroblastoma (Vandepoele et al. 2005; Diskin et al. 2009). These results have provided a number of provocative hints as to the possible role DUF1220 amplification may have in human evolution and human disease. For example, the evolutionary gene-copy-number studies that we have reported show that DUF1220 copy-number increases roughly parallel the increase in relative brain size, and/or neocortex expansion, that has occurred over recent primate and especially human evolution (Popesco et al. 2006; Dumas et al. 2007). Given this observation, it is intriguing that two reports that studied 1q21.1 CNVs and brain size found that deletions were associated with microcephaly, and duplications were associated with macrocephaly (Brunetti-Pierri et al. 2008; Mefford et al. 2008). Although no mention was made in these reports that DUF1220 copy number might underlie or be related to these phenotypes, we note here that these CNVs either encompassed or flanked DUF1220 domains (which, because of their highly duplicated nature, were not directly interrogated in these studies), raising the prospect that DUF1220 domain copy number (i.e., DUF1220 dosage) may be causally related to the observed differences in human brain size.
We point out here that a similar argument can be made regarding the 1q21.1 CNVs that have been reported to be involved in autism and schizophrenia. These diseases, as well as a number of pairs of other disorders, have been labeled genomic “sister disorders” by virtue of their tendency to exhibit diametrically opposite phenotypes and be caused by duplications versus deletions of the same genomic sequences (Crespi et al. 2009). Among the opposing phenotypes that distinguish autism and schizophrenia are a larger and smaller brain size, respectively (Crespi and Badcock 2008). It is therefore intriguing that the 1q21.1 CNVs that have been reported to underlie autism and schizophrenia tend to be duplications and deletions, respectively. In addition, these 1q21.1 CNVs, as with those underlying microcephaly and macrocephaly, encompass or are flanked by DUF1220 domains and span the same genomic interval. Taken together, these observations support the involvement of DUF1220 domains in the brain size differences that are known to exist between autistic and schizophrenic populations and between microcephalic and macrocephalic populations, and they suggest that a mechanistic link exists among 1q21.1 instability, the evolutionarily rapid DUF1220 copy-number amplification that has occurred in the human lineage, and the high prevalence of these diseases in human populations.
During the past several years, a number of microcephaly disease genes have been identified and these provide another link between DUF1220 and brain size. Specifically, it has been shown that of the half-dozen or so genes identified that cause microcephaly, a majority appear to encode proteins that are associated with the centrosome and control of the cell cycle (Bond and Woods 2006). Cell cycle control and the timing of when and where cells switch from symmetric to asymmetric cell division have been postulated to be key factors in changes in neuron number and brain size that have characterized mammalian and primate brain evolution (Rakic 1995). Given these observations, it is noteworthy that the ancestral DUF1220 domain (found once in mouse/rat and other nonprimate mammals) is found on myomegalin (PDE4DIP), a centrosomal protein whose gene is a homolog of CDK5RAP2, a non-DUF1220 microcephaly disease gene that also encodes a centrosomal protein (Bond et al. 2005; Dumas et al. 2007).
Recently, it has been reported that DUF1220 domains are one of only a small number of core duplicons that exist in the human genome (Jiang et al. 2007). These highly duplicated sequences have been the focal point for large-copy-number expansions in the human lineage, and in many cases, each core duplicon has been instrumental in recent chromosome-specific copy-number expansions. Typically, core duplicons are flanked by a mosaic of other duplicated sequences that appear to have been disseminated by being carried along during core duplicon transpositions. The DUF1220 domain duplicon appears to be responsible for much of the duplicated sequences found in the pericentromeric region of chromosome 1 (Jiang et al. 2007). Core duplicon sequences are often interspersed and separated by single- or low-copy-number genes and often promote nonallelic homologous recombination (NAHR) events. Such a duplicon-rich genome architecture can often lead to NAHR between distantly spaced highly similar sequences (e.g., those of a core duplicon) that simultaneously produce duplications or deletions of the intervening sequences (segmental aneusomy). These intervening sequences often contain dosage-sensitive genes that are carried along with core duplicon duplications and deletions, resulting in disease-causing CNVs. The fact that DUF1220 sequences are associated with the numerous disease-related 1q21.1 CNVs mentioned above is fully consistent with such predictions of core duplicon behavior.
On the basis of these observations, we propose a model that links DUF1220 sequences to both human disease and human brain evolution. The model proposes the following specific testable hypotheses: (1) Increasing DUF1220 copy number may be related to increases in brain size; this is based both on a general correlation of DUF1220 copy number with brain size between species (Fortna et al. 2004; Popesco et al. 2006; Dumas et al. 2007), as well as within the human population (Brunetti-Pierri et al. 2008; Mefford et al. 2008), (2) 1q21.1 instability promoted the rapid evolutionary copy-number increase of DUF1220 sequences, and (3) the evolutionary advantage that increased DUF1220 copy number conferred favored retention of high 1q21.1 instability, which, in turn, has resulted in many recurrent deleterious duplications (macrocephaly, autism) and deletions (microcephaly, schizophrenia) of dosage-sensitive 1q21.1 genes.
It is also noteworthy that the extreme instability of the 1q21.1 region increases the chances that somatic or de novo germ-line CNVs will occur, an observation that provides a potential explanation for the frequent examples of non-Mendelian inheritance that have been found to be associated with schizophrenia and autism. Finally, this mechanistic linking of 1q21.1 genomic instability with an evolutionarily favored process (i.e., DUF1220 copy-number increase) provides a means of reconciling the central paradox associated with schizophrenia and autism. Namely, why do diseases that are clearly maladaptive and have a significant genetic component persist at unusually high frequency throughout human populations?
The 1q21.1 region is highly complex and shows a disproportionately high number of segmental duplications and sequence gaps (18), making precise assessment of its sequence content and organization difficult. As a result, caution should be exercised when relying on the current human and nonhuman primate genome sequences for 1q21.1. Before the complete evolutionary ontogeny of DUF1220 sequences can be reconstructed, and before direct comprehensive comparisons of the 1q21.1 region can be made among different individuals and among different species, a much more complete genome assembly of the region will be required. One potentially useful strategy will be the use of a haploid genomic resource: a human hydatiform mole bacterial artificial chromosome (BAC) library that has been constructed with this type of application in mind. Because of the repeat-rich nature of 1q21.1 and other genomic regions, the development of longer read sequencing technology would also provide a valuable tool for accurate genome finishing (Eid et al. 2009). Current available sequencing platforms have relatively short read lengths, a limitation that prevents accurate assembly of many of these repeat-rich regions. Such regions are increasingly being implicated in human disease, yet, due to this sequencing limitation, they continue to remain largely unexamined. Finishing 1q21.1 will provide a framework both for correctly annotating the DUF1220 domain content and organization in the human genome, and for precisely defining breakpoints and content of the many disease-related CNVs that are being identified in this evolutionarily important genomic region.
Presently available data suggest that the striking increase in DUF1220 domain copy number in recent evolutionary time has occurred because there is a clear adaptive advantage to greater numbers of DUF1220 domains and that this advantage and selection process has persisted throughout primate, and especially human, evolution. Indeed, as we have previously reported, sequences encoding DUF1220 domains are virtually all primate specific, show signs of strong positive selection, and are increasingly amplified generally as a function of a species’ evolutionary proximity to humans, where the greatest number of copies (212) is found (Popesco et al. 2006). On the basis of the genome organization of DUF1220 sequences, the large number of human copies is not likely to have arisen by a single event, but rather by a series of small and large incremental increases, each of which conferred an adaptive advantage. This series of increases (involving multiple rounds of both gene and domain duplications) is likely to have been promoted by the high degree of genomic instability associated with the human 1q21.1 region. Such instability would have more frequently produced duplications and deletions in the region, with the DUF1220 duplications conferring a selective advantage, which in turn would result in retention of the instability in those who had more DUF1220 copies. This process could be viewed as a recurring cycle in which the increased 1q21.1 instability facilitated DUF1220 copy-number increases in certain individuals, followed by selection of individuals who exhibited the increased DUF1220 copy number, resulting in retention of the 1q21.1 instability in these individuals. In this manner, high 1q21.1 instability may have been selected for and retained in the human lineage because it more rapidly produced increases in DUF1220 copy number.
The 1q21.1 instability is a product of the genome architecture of the region and, because it works by promoting rearrangements in sequence organization and copy number, many deleterious events can be expected to occur along with the beneficial DUF1220 duplications. In this regard, the large number of recent reports implicating 1q21.1 CNVs in multiple human diseases should not be unexpected and should be viewed as a natural outcome of the genomic instability that is being favored because of the adaptive value that more DUF1220 copies may be conferring. In summary, the high number of 1q21.1 CNVs that are disease-causing may be the price that our species paid, and continues to pay, for the adaptive benefit of large numbers of DUF1220 domains.
We thank the individuals who have contributed to the data presented here, including J. Pollack, A. Fortna, M. Popesco, E. MacLaren, Y. Kim, M. O’Bleness, J. Keeney, J. Hopkins, A. Karimpour-Fard, M. Cox, R. Berry, L. Meltesen, L. McGavran, G. Wyckoff, and L. Jorde. J.M.S. is supported by National Institute of Mental Health grant R01 MH81203, NIAAA grant 2 R01 AA11853, and a Butcher Foundation grant.