Sequence variants affecting phenotypes between different individuals were believed to be mostly due to small differences, such as single nucleotide polymorphisms (SNPs) [
1-
4]. However, when comparing two or more genomes within a species, gene presence/absence (P/A) variations have been also commonly observed in recent studies. Since Grant
et al. found P/A polymorphisms in the
RPM1 gene in
Arabidopsis[
5], an increasing number of P/A genes have been reported in disease resistance genes in this model species [
6,
7] and other land plants [
8-
10]. This phenomenon has also been described in the human genome [
11-
13], the telomeric region in
Drosophila[
14] as well as bacterial genomes [
15], which suggests that P/A polymorphisms have unique roles in species differentiation. Additionally, several human diseases have been associated with gene insertions or deletions [
16,
17] and in plants, there is evidence that P/A genes are involved in gene expression [
18] and noncollinearity in heterosis [
19]. These examples indicate the importance of P/A genes in the evolutionary history of various species.
The commonly used definition of a P/A gene is that it is a gene present in some individuals but absent in others within a species at a particular locus, although there are different definitions in the literature [
6,
10]. The narrow definition of a P/A gene is one which exists only in one individual but not in another on a genome-wide scale. For example, it was reported in maize that 20% of genome segments (~10,000 genes or gene fragments) are not shared between inbred lines B73 and Mo17 [
8]. Yu
et al. found that 2.2% and 3.3% of rice
indica and
japonica genes, respectively, are unique to the subspecies [
20], while Ding
et al. found 5.2% genes with P/A polymorphisms between Nipponbare and 93-11 [
10]. Although a gene can be localized to a genomic position and be denoted as a P/A gene at that locus, it may have a paralog at a different locus. By using a broad definition, 4.7% additional genes were classified as P/A genes among rice genomes [
10]. Our study also uses the broad definition of a P/A gene, which is one being found at a particular locus only in some genomes compared to the others.
Most land plants have evolved by whole genome duplication and subsequent gene loss [
21]. Such extensive rearrangement events can result in a high proportion of P/A genes in plants. Transposable elements (TE) are dominant factors inducing intraspecies diversity in maize [
8]. Large duplications can be another source of genetic variation [
22]. In
Arabidopsis, unequal and illegitimate recombination also plays an important role in triggering large-scale indels [
23]. The
Arabidopsis genome is extremely redundant due to segmental duplications and tandem arrays [
24]. These features provide ample opportunity for unequal crossing over to generate P/A genes. Balancing selection is thought to be one of the mechanisms maintaining P/A polymorphisms, at least for some disease resistance P/A genes [
6,
7,
25]. However, compared with the large numbers of detected P/A polymorphisms, the mechanisms for P/A gene generation and maintenance are complicated and remain unclear.
Although P/A polymorphisms have been reported in several species [
6,
23], there is still a lack of a clear estimate of the P/A gene number, proportion and variation pattern in any particular species, since a large number of fully sequenced individual genomes is the basic prerequisite for such studies. Recently, 80 re-sequenced
Arabidopsis genomes were released [
26,
27] and provided a unique opportunity to systematically study the characteristics of P/A genes. By analyzing the data, we identified a remarkable number of P/A genes and obtained an estimate of the P/A genes and their frequency distribution in the worldwide
Arabidopsis accessions. We also used this information to investigate the variation in P/A gene patterns among accessions and to provide a description of their preference locations on chromosomes. An analysis of the relationship between diversity and frequency of P/A genes was performed to explore the natural selection pressure, the evolutionary forces on P/A genes in
Arabidopsis populations as well as the mechanism for P/A generation.