We performed a search for all recent segmental duplications that were larger than 5 kb in size and showed greater than 90% sequence identity from both the February 2002 (numerical results for February 2002 assembly are presented at our web site [
14]) and the February 2003 mouse genome sequence assemblies [
15]. Our method was based on pairwise (mega-) BLAST2 [
16] sequence comparisons between entire chromosome sequences. From our analysis of the February 2003 assembly, a total of 33.6 Mb (1.2%) of the genome sequence (2,695 Mb) was found to be involved in recent segmental duplications (Table ) and 8.9 Mb of this sequence was unmapped data (found in the unmapped chromosome sequence). On the basis of the 20 mapped chromosomes, more than 712 distinct intrachromosomal segmental duplications, comprising 19.9 Mb of sequence (Figure ), and 475 distinct interchromosomal duplications, comprising 7.1 Mb of sequence, were identified. We also found that 57% of the duplications were in tandem, which we defined as two related intrachromosomal duplicons located within 200 kb of one another.
| Table 1Recent segmental duplication in the mouse genome |
Duplications can be found in all chromosomes analyzed, with chromosomes 6, 7, 17, and X having the highest, and chromosome 18 having the least, duplicated content (Table , Figure ). Substantial amounts (8.9 Mb) of the duplicated content are found in the unmapped chromosome (ChrUn) sequence, suggesting that the correct chromosomal assignment of these segments remains a major assembly challenge. It is possible that small subsets of these duplications are due to chimeric reads and other sequencing artifacts and thus should not be part of the finished genome sequence. On the other hand, these unmapped duplicated sequences represent true duplications that have been excluded from the assembly. One example of this occurs with a member of the mouse Bcl2 family of apoptosis regulators, Bcl2a1. Bcl2a1 contains four highly similar genes (> 97% identical at the nucleotide level) that have been mapped together on chromosome 9 of the C57BL/6 and 129SV genomes [
17,
18]. Currently, the Bcl2a1 genes are not assembled on the mapped chromosome and are found in three distinct unmapped contigs. In the human genome only one copy of
BCL2A1 is found, although a recent, independent 8.5 kb tandem duplication containing the last exon of
BCL2A1 has occurred, forming a novel
BCL2A1-related transcript (AF249277). An example of a region that has changed between assemblies is the
Amy2 locus.
Amy2 is known to vary in copy number between inbred strains of mice [
19]. In the February 2002 assembly, only one copy of the
Amy2 gene resided on chromosome 3 in addition to a second copy found on a large 10 kb unmapped contig. In addition, partial high identity matches (> 95%) to four distinct unmapped contigs were found (note that these partial copies were not detected in our analysis as they are less than 5 kb long). In the February 2003 assembly, six
Amy2 genes exist, which is close to the five
Amy2-like genes that were detected in the genome of strain A/J mice using quantitative densitometry of Southern blots [
20]. It is, however, important to note that a gap, not bridged by a clone, still exists between the
Amy2 locus and the
Amy1 gene, and so the copy number in the C57BL/6J genome assembly may still vary.
We analyzed the distribution of segmental duplication content by sorting the duplications into six different sequence-similarity categories: 90-92%, 92-94%, 94-96%, 96-98%, 98-99.5%, and 99.5-100%, for both the February 2002 and 2003 assembly builds (Table ). The amount of duplication content appears to be unevenly distributed across these categories, with a distinct rise in the 94-96% category. This might suggest recent duplicative events in the mouse genome have not occurred at a steady rate. However, it is unclear at this point how these results were affected by the draft status of the genome assemblies. Between the 2002 and the 2003 assembly builds we found that the amount of duplication content is nearly the same within each percent category except for the 99.5-100% category, which contained 4.8 Mb of sequence in 2002 and 18.5 Mb in 2003 (Table ). Furthermore, we determined that the majority (88%) of the duplicated sequence in the 99.5-100% category occurred intrachromosomally, within 200 kb of each other. Using the assembly component tables (provided by UCSC [
21]), which contain information about the underlying makeup of the February 2003 genome assembly (shotgun-assembled scaffolds and BAC sequences), we found that 215/216 (99.5%) of these duplications involved a BAC sequence. Hence, we suspect that the large increase in near-identical duplications could be the result of sequence misassignment errors arising from the inherent difficulty of merging finished BAC sequence with shotgun sequence contigs.
| Table 2Comparison between genome assemblies |
We previously observed that the human genome sequence assembled by Celera's WGS method [
22] showed poor quality in regions with near-identical segmental duplications [
23]. To assess the finishing status of duplicated regions in the WGS mouse genome assembly (February 2002 MGSCv3 assembly), we calculated the amount of unfinished sequence (regions with gaps or Ns) within the immediate neighborhood (20 kb) of each duplicon (the unmapped chromosome sequence was excluded from this analysis). We observed substantially higher amounts of unfinished sequence (number of Ns) in these regions. Whereas 8.0% of the assembly is comprised of Ns, regions harboring duplications contain an average of 12.2%. This average rises to 16.6% for duplications with more than 98% sequence identity (statistics can be obtained from our website [
14]). This suggests that the WGS assembler had difficulty assembling regions containing recent sequence duplication and that these regions are good candidates for finishing using clone resources.
Using the NCBI Refseq and Ensembl mouse gene annotation, we identified 675 genes that mapped to duplicated regions of the mouse genome (a full list of genes can be obtained from our website [
14]); 414 of these genes were found to be fully contained within a segmental duplication, thus representing the best candidates for whole-gene duplication. While it is likely that some of these duplicate copies have become pseudogenes, others may have evolved specialized functions [
3]. Moreover, we sought to use the identified gene sequences, which were expressed sequence tags (ESTs) and/or cDNAs, as experimentally derived resources to help validate the genomic duplication content presented in this study. We aligned duplicated gene sequences to each genomic region using UCSC BLAT [
17] and determined their percent identity matches. Unambiguous gene-to-genomic identity matches were established for all 128 gene pairs we examined. Each gene sequence was mapped to their respective genomic region with at least 99.1% identity (examples are shown in Table ; a full table is available at [
14]). We also examined the identified duplicated genes using their InterPro protein-domain classification present in 608 Ensembl genes to see whether specific kinds of genes or protein domains have been preferentially duplicated. We found that genes containing protein domains related to signal transduction (rhodopsin-like G-protein-coupled receptor superfamily), olfaction (olfactory receptors, vomeronasal receptors) immunity (immunoglobulin/MHC, serine protease), and drug metabolism (cytochrome P450) are significantly enriched (by at least threefold) (Table ).
| Table 3Examples of recent mouse gene duplications |
| Table 4Protein domain enrichment found in recently duplicated mouse genes* |
From this list of genes, we performed a detailed analysis of
Mater, a maternal-effect gene of potential medical importance.
Mater encodes an autoantigen in a mouse model for human autoimmune premature ovarian failure [
24]. Knockout studies have shown that it is essential for early embryonic development in mice [
25].
Mater encodes a protein of 1,111 amino acids from a 3.5 kb transcript that spans 57 kb on mouse chromosome 7. A 42 kb segmental duplication involving two duplicons (DUP1, where
Mater is located; DUP2, where a novel
Mater2 is located) are situated about 5 Mb apart and in an inverted orientation (Figure ). DUP1 and DUP2 are on average 91.1% identical over the entire 42 kb genomic region, with a 96.6% average in the exonic regions. Furthermore, we identified an intron-less
Mater pseudogene (
MaterP), which shares 87% DNA sequence identity to
Mater, at a location 10 Mb proximal to
Mater (Figure ; see Additional data files for a detailed comparative genomic analysis of the
Mater locus). The mapping locations of these duplications have been confirmed by fluorescence
in situ hybridization (FISH) (Figure ). Thus,
Mater serves as one example of a gene that has been knocked out in mice but for which there is a second, highly similar transcript whose biological role is not yet known.
In addition, we were interested in determining whether any of the 675 genes have undergone recent (≥ 90% sequence identity over ≥ 5 kb) and independent duplication in the human genome. Some of these genes could be recently evolving via the 'birth and death model of evolution' which has been used to describe the evolution of the major histocompatibility complex (MHC) and immunoglobulin multigene families [
26]. This model describes genes that are repeatedly created through duplication, with some genes becoming fixed while others are rendered nonfunctional by deleterious mutations [
26].
We examined the 675 duplicated mouse genes using best reciprocal BLAST hits to identify their putative human orthologs. We subsequently analyzed regions containing these putative orthologs for recent sequence duplication in the human genome. Sixteen of the 675 genes were found to be involved in recent, independent gene duplication in mouse and human (see Table ). Some of these regions containing whole-gene duplications are part of multigene families known to be evolving via duplication and are found in tandem duplicated arrays in both species (that is, the
Amy2,
H2-Q1,
Gsta1, and
Olfr54 genes). An interesting example of a recent and apparently independent whole-gene duplication that occurred in mouse and human involves
Bmp8a and a second intronic transcript
Oxct2. Of the partial gene duplications, the recent duplication within the
Tnxb gene and its human ortholog
TNXB (found at the MHC III locus of mouse chromosome 17 and human 6p21) is particularly intriguing. In humans, this locus consists of a tandem array of genes (
RP,
C4,
CYP21, and
TNXB (RCCX)), which through gene duplication, can exist as mono-, di- and tri-modular forms in the caucasian population [
27]. Recent studies have also shown the presence of a deletion haplotype in one individual, leading to a fusion of the
TNXA/
TNXB gene on one chromosome and a duplication of
CYP21 on the other chromosome [
28]. Furthermore, complex haplotypes of the complement genes (
C4A and
C4B) residing in the RCCX module have been characterized and postulated to have a role in individual susceptibility to infection and autoimmune disease [
29]. A closer inspection of the genomic region surrounding this recent duplication in the mouse reveals that the C57BL/6J duplication encompasses homologous genes (
Tnxb,
Slp (a
C4 paralog),
Cyp21a1, and
C4). Similarly, in humans, this orthologous region of the mouse genome has been shown to undergo multiple recombination events, giving rise to a variety of haplotypes [
30]. Overall, many of the genes that have recently experienced duplications in the mouse and human genomes are of biomedical and evolutionary interest. The complexity and polymorphic nature of these recent duplications underscores the need for, and the difficulty of, performing the detailed structural and functional analyses that will help discern their true genomic organization, evolutionary history, and biological implications.
| Table 5Genes that have undergone recent duplication in both the mouse and human genome* |