In the past decade, the focus of genomic diversity studies mainly included single nucleotide polymorphisms (SNPs) and short tandem repeats (STRs) in humans as well as in other organisms. More recently, the International HapMap project directed efforts toward understanding haplotype structure in human populations and helped to form our perception of genomic diversity based on SNPs [1
]. At the same time, Olson and Varki [5
] argued that understanding our own genome would not be complete without the evolutionary perspective, and requires knowledge of the genomes of our closest primate relatives [5
]. Previously, direct comparisons of human and chimpanzee sequences have used either a single chromosome [7
] or the entire genomes [8
]. Resulting extensive sequence datasets opened a possibility for carefully examining alternative sources of genomic variation, such as insertions and deletions (indels).
Since humans diverged from a common ancestor with chimpanzees approximately 5 million years ago [16
], understanding genome differences between these two lineages is critically important for defining our own species. Several studies that examined chimpanzee and human genomes comparatively, located and characterised sequence differences, including single-base pair indels, monomeric and multi-base pair extensions (repeats), indels with random DNA sequences, and transposon insertions [5
]; and more are currently under way. Initially, differences between humans and chimpanzees were estimated at 1% [7
], but later this number was refined to 1.2% [8
]. Several studies pointed out that the number of differences is much higher when indels (insertions and deletions) are included in the comparison [19
], and the total divergence may be as high as 6.5% [19
]. Removing repeats and low-complexity DNA reduces this calculation to 2.4% [19
], doubling the original estimates.
Indels, fragments missing in sequence comparisons between individuals or closely related species, are plentiful across genomes [9
]. Only a small fraction of indels occurs within coding sequences; it seems that these may play a key role in primate evolution [19
]. While most indels have no adaptive value, some are known to alter important functions, and many are known to be involved in disease phenotypes [11
]. It has been noted that while the human genome might contain as many as 1.6–2.5 million indel polymorphisms, efforts directed toward the discovery of this type of genomic variants are still significantly less intensive than efforts involved in the SNP discovery [10
]. Many indel polymorphisms can still be discovered and classified in comparative genomic studies.
As the number of described indels accumulates, several mechanisms have been proposed to explain their source and existence. The origin of individual indels seems to depend on their size, sequence context as well as other factors [15
]. For example, it appears that many recent short insertions in the human genome originated as tandem duplications, while smaller indels (<5 bp) were generated either by unequal crossing over or by replication slippage [23
]. In addition, a non-homologous end joining mechanism initiated by a double-strand breakage [15
] has also been proposed as a main mechanism of indel generation for a wide range of sizes [24
In this study, we report on those indels (10 bp or larger) that cannot be simply explained by short tandem repeats and/or other repetitive DNA occurrence. We obtained a set of indels by comparing homologous sections of human and chimpanzee chromosome 22 (following the orthologous numbering nomenclature [8
]), and characterized it relative to the local, chromosomal, genomic, and gene-specific sequence contexts. We divided our data into three groups (referred to as "core classes") according to presence, sequence identity, and relative locations of additional copies of indel sequence in the neighboring (± 5 kb) region. We also considered the observed indels in their genomic context and further divided the data into three groups (referred to as "genome classes") based on the presence of copies locally on chromosome 22 or elsewhere in the human genome. The presence of indels in coding sequences of genes and other genomic elements was examined relative to the random expectation using a tenfold chromosome-wide resampling approach. Finally, we examined predicted transcripts for their impacts on peptide sequence to confirm genes where an insertion or deletion can change the amino acid sequence or alternate splice products that differentiate between the two species. Indels that impacted genes by altering coding sequences and splice sites were further characterized. Gene impacts were considered both computationally, by looking at the resulting amino acid sequence, and by direct sequencing, in comparison among five human populations and with five closely related primate species.