SNP number and distribution
We analysed the nucleotide sequences of 109 OR genes (102 genes and seven pseudogenes, as defined in the genome sequence [23
]) selected from the entire OR repertoire of 872 genes and 222 pseudogenes [7
]. These OR genes were selected to be representative of a large number of families (the five class I families and 15 of the 18 class II families), subfamilies and clusters (33 of 54) located on 20 chromosomes (Additional file 1
). They were also selected as representative of genomic regions very rich in OR genes, as for cluster @40–44 on canine chromosome 18 (CFA18), or with a lower density of OR genes, as for cluster @3 on CFA15. We also studied five isolated OR genes. We determined the nucleotide sequences of PCR fragments amplified from DNA purified from a cohort of 48 dogs of six breeds: German Shepherd Dog (GSD), Belgian Malinois (BM), Labrador Retriever (LR), English Springer Spaniel (ESS), Greyhound (Grey) and Pekingese (Pek). We also analysed a subset of 27 OR genes in eight Boxers (Box).
Visual inspection of all sequencing traces obtained with the cohort of 48 dogs led to the identification of 710 SNP, corresponding to 549 transitions and 161 transversions. We also observed 17 short insertions/deletions (indels, 1 to 3 nt) and five longer indels of 6 to 74 nucleotides. As the occurrence of each indel probably corresponded to a single mutational event, these 732 mutations (SNP + indels) were combined for further analysis. Figure shows the distribution of SNP within the 109 OR genes. It shows that all but four of the OR genes are polymorphic, with one to 22 SNP per OR gene.
Distribution profile of the 732 SNP + indels.
When analysed at the breed level, the total number of SNP differed significantly (chi2, P < 10-3) between breeds, whereas their distribution did not (Wilcoxon-Mann-Whitney) (Figure ). However the numbers of OR genes without SNP differed markedly between breeds (chi2, P < 0.05), with 24 and 21 OR genes with no SNP for German Shepherd Dog and Greyhound, respectively, 14 for Labrador Retriever and only 10 for each of the three other breeds. The set of OR genes with no SNP was either breed-specific or shared by only a few breeds, in different combinations (Table ).
Distribution of SNP within the 6 breeds.
OR genes with no SNP in one or several breeds.
At the whole-population level, most OR genes tended to be either weakly (such as CfOR2171 and CfOR08C09 with 0 or one SNP per breed) or highly (such as CfOR0007 with 18 or 19 SNP and CfOR0034 with 14 to 22 SNP depending on breed) polymorphic (see additional file 2
). However, there were several notable exceptions, with some OR genes weakly polymorphic or not polymorphic in one breed and highly polymorphic in the other five breeds. This was the case for CfOR0527 (no SNP in Pekingese but seven or eight SNP in each of the other five breeds), CfOR0390 (six SNP in Greyhound, one SNP in Pekingese and none in the other breeds) and CfOR08A02 (10 SNP in Pekingese, six SNP in Belgian Malinois and no SNP in the other breeds; Table ).
We investigated the possible correlation between OR gene polymorphism and the organization of these OR genes into clusters of different sizes, by ranking the 109 OR genes according to SNP content. We selected the 22 OR genes with no more than two SNP and the 27 OR genes with 10 or more SNP and compared the sizes of the clusters harbouring these OR genes. As shown in Figures and , the least polymorphic OR genes were preferentially localised in small clusters (median cluster size 4.5 OR genes) and the highly polymorphic OR genes, in large clusters (median cluster size 240 OR genes). Mann-Whitney test showed this relationship to be significant (P < 10-3). In addition, the 109 OR genes were ranked according to cluster size and we selected the 20 OR genes located in clusters containing five or fewer OR genes and the 18 OR genes present in the largest cluster (containing 243 OR genes). Again, OR genes in small clusters tended to be less polymorphic than OR genes in large clusters (median SNP numbers of 2 and 8 for the smallest and largest clusters, respectively, Mann-Whitney test; P < 10-3) (Figures and ). Interestingly, the OR genes with the highest number of SNP tended to have paralogous genes with higher sequence homology (> 90%) than OR genes devoid of SNP or harbouring a small number of SNP.
Figure 3 Boxplot of cluster sizes (1, 2) and boxplot of SNP contents (3, 4). Boxplot 1 shows the cluster sizes of the 22 least polymorphic OR genes (≤ 2 SNP). This boxplot should be compared with boxplot 2, showing the cluster sizes of the 27 OR genes (more ...)
SNP minor allele frequency (MAF) ranged from 1% to 50% (see additional file 3
). However, MAF within breeds might differ considerably from MAF across breeds, with some alleles absent in all but one breed, in which they could be the major allele (see for example, SNP 78 and 189 in gene CfOR16HO4 and SNP 530 in gene CfOR0135). Other examples are provided by SNP 294, 518 and 295 (of CfOR0297, CfOR5413 and CfOR10F04 respectively), for which the minor alleles at the whole population level are the major alleles in one breed (Table ).
Overrepresentation of minor alleles in specific breeds.
We found that 193 of the 732 SNP (26.4%) identified in this study were restricted to a single breed and that their breed distribution differed significantly (chi2, P <10-3), with 10 private SNP for German Shepherd Dog, 26 for Belgian Malinois, 47 for English Springer Spaniel, 18 for Greyhound, 8 for Labrador Retriever and 84 for Pekingese. Conversely, 199 SNP (27.2%) were common to all breeds, whereas 79 were common to two breeds and 50 were common to three breeds (Tables , and ).
SNP distribution within breeds.
Number of SNP shared by different pairs of breeds.
Number of SNP shared by different trios of breeds.
Assuming, as is most likely, that each SNP appeared once in the evolutionary history of the dog, it follows that the 199 SNP common to all breeds probably arose before the separation of the six breeds and that most of the private SNP arose following breed separation. Based on the same rationale, it could be hypothesised that SNP common to two or three breeds arose before the separation of these breeds. Although the number of pairs in common differed significantly (chi2
, P <10-3
), the use of HCLUST [31
] to construct dendrograms did not result in any clusters matching breed history. This is probably because the number of SNP common to pairs of breeds with a MAF > 10% was too small.
Nucleotide polymorphism level reflects the number of differences between two sequences. It can be represented by N, the mean distance, expressed in nucleotides, between two SNP. OR genes are generally highly polymorphic, but the distribution of SNP is far from even (Figure ). CfOR0034, in which 22 SNP were detected, was the most polymorphic OR gene studied, with an N of 98 for the whole population, ranging from 89 for Pekingese to 293 for German Shepherd Dog (see additional file 2
). At the other extreme, CfOR08C09 and CfOR0525 were the least polymorphic genes after the four genes with no SNP (CfOR16F03, CfOR0317, CfOR0166 and CfOR0154). CfOR08C09 has one SNP, detected only once, in one Pekingese. This would give a theoretical N value of 7920 for Pekingese and 47520 for the whole population. Another example is provided by CfOR0525, for which we found 2 SNP. Each of these two SNP was detected only once, in two different Belgian Malinois, and one of these two SNP was detected in three English Springer Spaniels and two Labrador Retrievers (data not shown). This gives N values of 3780, 2908 and 4050, respectively, for these three breeds (see additional file 2
Figure 4 Variability in OR gene polymorphism level. Cumulative number of OR genes (y axis) plotted against N values (x axis). The graph shows that more than 50% of OR genes are highly polymorphic, with an N value even smaller than that for anonymous sequences (more ...)
Calculation, at the whole-population level, of N for the 109 OR genes gave a mean value of 577. Comparison at the breed level indicated that the English Springer Spaniel was the most polymorphic breed, with an N value of 594, whereas the German Shepherd Dog was the least polymorphic breed, with an N value of 926 (chi2, P < 10-3) (Table ).
Mean N values for OR genes and other sequences.
Only 27 OR genes were analysed in Boxer, and we obtained an N value of 1728. We therefore wondered whether the large differences in N values between the other six breeds and Boxer were due to the 27 OR genes selected for study in Boxer or whether they reflected a truly lower level of polymorphism in Boxer. However the N values for these same 27 OR genes calculated for each of the six breeds were not statistically different (Mann-Whitney test) from those calculated for the whole set of 109 OR genes (Table ). This last finding ruled out the possibility of a bias due to the sampling of this subset of OR genes and indicated that the level of polymorphism really was lower for Boxer OR genes – this finding is relevant to the choice of the Boxer Tasha DNA sample (less polymorphic than the other DNA samples tested) for determination of the dog genome sequence [23
We compared the level of OR gene polymorphism with that of non-coding regions and coding regions devoid of OR, by sequencing a series of exons, introns (only regions close to splice sites) and intergenic sequences with no known coding function. We obtained N values of 8631 for exons, 1992 for introns and 732 for anonymous intergenic sequences (Table ). These values are consistent with previous reports [23
]. A comparison of these values indicates that the coding regions of OR genes are more polymorphic than most exon sequences and more polymorphic than the non-coding DNA (chi2
, P <10-3
In a similar study, Sutter et al.
] sequenced five non-coding regions of the dog genome in a cohort of 95 dogs of five breeds and detected 201 SNP and 19 indels. These results, indicating a lower level of genetic diversity than that observed in OR genes, confirm the high level of genetic diversity of the OR coding exons. The isolated OR genes and genes belonging to small clusters analysed in this study were overrepresented among the 109 OR genes as with respect to their presence in the whole repertoire. As these OR genes tended to be less polymorphic than the OR genes from large clusters, their presence increases the value of N, and the actual difference between OR genes and intergenic sequences should thus be even greater.
Ka/Ks and protein sequence polymorphism
We noted that 152 of the 732 SNP identified within the 109 OR genes led to pseudoalleles (alleles with an interrupted coding frame). Theoretical translation of intact OR genes showed that 307 of the remaining 580 SNP were silent mutations. Of the 273 missense mutations (47% of the total), 130 would result in the incorporation of an amino acid of a different chemical group (Table ).
Distribution of the 580 SNP (307 silent and 273 missense) between the extracellular (EC), transmembrane (TM) and intracellular (IC) domains.
Calculation of the Ka/Ks ratio, where Ka is the number of non-synonymous substitutions (missense mutations) per non-synonymous site and Ks is the number of synonymous substitutions (silent mutations) per synonymous site between two closely related species, is the traditional method of assessing the strength of selection affecting proteins during evolution. In a recent study, it was shown that the A/S ratio calculated from the SNP content of the human genome is equivalent to the Ka/Ks ratio for the assessment of selective pressure [32
Using the SNP detected in this study, a Ka/Ks value of 0.37 was obtained for the 95 OR genes analysed here (109 minus pseudogenes and non-polymorphic genes). Similar values were obtained at the breed level (from 0.31 for Labrador Retriever to 0.37 for Pekingese). A Ka/Ks value of 0.098 has been reported for a large set (n = 13,816) of canine genes [23
]. Comparison of these two values (0.37 and 0.098) indicates an absence of strong selective constraint, resulting in greater diversification for the OR genes, as already observed for a small subset of human and chimpanzee OR genes and for the gene encoding the human bitter taste receptor, than for most other genes [33
]. As isolated OR genes tended to be less polymorphic than OR within large clusters, we wondered whether the Ka/Ks ratio might differ with cluster size. A Pearson correlation test on the 95 OR genes analysed (all OR genes minus the pseudogenes and genes devoid of SNP) gave a value of -0.05059135, indicating this was not the case. Similarly the Ka/Ks values of the 11 OR genes within small clusters (≤ 5 OR genes) and the values for the 15 OR genes present in the largest cluster (243 OR genes) were not significantly different (Student's t
-test P = 0.78).
We also analysed the distribution of SNP within codon positions and found that 161, 130 and 289 of the 580 SNP were located at the first, second and third codon positions, respectively. This distribution, with 50% of mutations affecting one of the first two positions, at which nearly all mutations induce an amino-acid change, and 50% affecting the third position, at which half of all mutations induce an amino-acid change, is consistent with many mutations (75%) randomly affecting the DNA sequence being retained and not counter-selected.
SNP were found throughout the OR gene sequences, resulting in amino-acid substitutions evenly distributed along the length of corresponding proteins, in the transmembrane, inner and outer parts of the receptors (Table ).
However, if we take into account the respective sizes of the various domains, the number of missense mutations is significantly larger in intracellular (IC) than in extracellular (EC) and transmembrane (TM) domains (chi2
, P < 10-3
), whereas the number of silent mutations does not appear to differ significantly between domains (chi2
, P > 0.7). These results were obtained for the whole set of data considered together, or when OR belonging to small clusters (≤ 5 OR genes) and OR belonging to the large cluster (243 OR genes) were considered independently. This indicates the existence of stronger selective pressure to maintain the structural conformation of the parts of the OR related to ligand binding (TM 3, TM5 and EC3 [9
]) than to maintain the structure of the part of the protein involved in signal transduction and processing. This finding, which conflicts with those of Buck and Axel [1
], should be interpreted taking into account the fact that we compared the sequences of the same gene in different breeds, whereas Buck and Axel [1
] compared paralogous OR genes from a single rat and thus compared OR with different binding properties. It would thus be of interest to determine whether the amino-acid changes within IC domains affect the efficiency of the transduction pathway and, in turn, odorant sensing properties. The distributions of missense and silent mutations for the 136 SNP present in only one breed (private SNP) and for the 168 SNP shared by all six breeds indicate a significant bias, with missense mutations more frequent among private SNP (chi2
, P < 10-2
), suggesting selection pressure related to breeding practices.
We used the CORP program to determine the effects, if any, of the 273 missense mutations [35
]. Of the 83 OR genes with missense mutation(s), 44 conserved the same ΨL
value, whereas changes < 0.3 were observed for 20 OR and changes > 0.3 for 19 OR. Variations of this type were also associated with higher or lower functionality as defined by the CORP program. As concerns a putative decrease in functionality, only 14 of the 273 SNP leading to an amino acid changes affect the 22 most highly conserved positions [9
]. In addition, five missense mutations involved the arginine of the MAYDRY conserved motif.
Mammalian OR repertoires contain a large number of pseudogenes – up to 60% for the human repertoire and around 20% for the rodent and dog OR repertoires [4
]. These pseudogenes are not retrogenes and have resulted from nonsense mutations or short indels occurring during evolution. Of the 109 OR genes analysed in this study, seven were strictly pseudogenes, 86 were intact in all breeds and 16 OR genes had both intact and interrupted ORF (pseudoallele). In each breed, a subset of 10 to 13 of these 16 OR have been identified as having one or more pseudoalleles (Table ). The frequency of SNP closing the frame varies across breeds (Table ). For example, CfOR08G02 has an SNP 360 (360 indicates the nucleotide position) that closes the frame. It is present in all six breeds, but at very different frequencies: 0.812 in German Shepherd Dog, 0.375 in Belgian Malinois, 0.125 in English Springer Spaniel, 0.188 in Greyhound, 0.438 in Labrador Retriever and 0.062 in Pekingese. Other examples, such as the SNP 362 of CfOR14A11 or SNP1 of CfOR12F06, are provided in Table . More extreme distributions exist, with SNP closing the frame in one or more breeds, but not all, such as the SNP 84 of CfOR0821 or SNP 49 of CfOR0401, which close the frame only in Pekingese and English Springer Spaniel, respectively. Genotype analysis (data not shown) indicates that the distribution within breeds is not homogeneous, with dogs having zero, one or two alleles with an interrupted ORF. These results indicate that the status of a gene as active or inactive (pseudogene) does not necessarily apply to the whole dog population, depending instead upon breed or even the individual dog. These observations suggest that pseudogene formation is still an active process, as previously reported [18
], related to the acceptance of a large proportion of mutational events to the probable continuing diversification of the OR repertoire – the risk attached to deleterious mutations being counter-balanced by the highly combinatory nature of the OR repertoire [37
], partly accounted for by gene redundancy.
Pseudoallele frequency (PAF).
Haplotype structures and distribution
We used the Fast Phase algorithm [27
] to identify a total of 809 haplotype structures for all OR genes with more than two SNP (see additional file 4
). We found that the mean number of haplotypes per gene and per breed varied between 2.83 for German Shepherd Dog and 3.73 for English Springer Spaniel. Not surprisingly, the number of haplotypes per gene increased with the number of SNP. However this relationship is not simple and many exceptions were noted. We plotted the haplotype/SNP number ratio against the number of SNP (Figure ). We calculated the Manhattan distances between the points and generated four groups of OR genes by agglomerative hierarchical clustering, with the two extreme groups having 11 OR and 5 OR genes. As examples of these two extreme groups, CfOR12A07 has 4 SNP and 11 haplotypes and DOPRH07 has 21 SNP and 4 haplotypes (see additional file 4
Figure 5 Relationship between SNP and haplotype number. Distances between points were calculated with R software (maximum distances)  and used to cluster OR genes. With k = 4, a group of 5 OR genes (in light blue) with a large number of SNP but a small number (more ...)
The existence of the two extreme groups (Figure ) suggests two different evolutionary processes. However, comparisons of gene status (family, subfamily, CFA position, cluster position for OR genes belonging to these two extreme groups) identified no specific feature.
As pointed out above, most of the SNP common to all six breeds had different MAF. Not surprisingly, this leads to very different haplotype patterns in different breeds, with some breed-specific haplotypes, such as the GCAGAGGTAAT haplotype (CfOR5413), which was found in 11 of the 16 Pekingese haplotypes but was absent from the other breeds (see additional file 4
In total, we identified 332 breed-specific haplotypes (41%). Many (205) were found only once, but some (38) accounted for 25% or more of the 16 possibilities per OR gene per breed and might even be the most frequent haplotype in the breed concerned (Table ). The combination of a small number of haplotypes may result, for each breed, in a haplotype signature. This signature could be used to certify that a given animal does or does not belong to a specific breed, based on the analyses of limited numbers of OR genes. For example, the haplotype structure of CfOR0050 and CfOR16H04, deduced from the analysis of 11 SNP, would be sufficient to identify a dog as a German Shepherd Dog.
Number of breed-specific haplotypes and number of times represented.
Linkage disequilibrium (LD)
Linkage disequilibrium indicates an association between two polymorphic markers, for which pairs of alleles are inherited together. Previous studies have shown that dogs display higher levels of LD than humans. However, LD has also been shown to be heterogeneous, with alternating genomic long and short regions of LD [23
]. This pattern of alternating long and short LD regions, which differs between breeds, has been attributed to the history of the dog population, which has been characterised by two bottlenecks and expansion periods [23
]. We investigated the evolution of the OR gene repertoire by calculating LD both within and between OR genes.
LD within OR genes
All pairs of SNP (MAF > 0.05) within each OR were used to calculate the mean r2
per breed – range of 0.52 for Pekingese to 0.70 for German Shepherd Dog, with a mean of 0.33 for the whole population (Table ). These values indicate (1) that the extent of LD for OR genes is one tenth the mean extent of LD previously reported [23
]; (2) the lower r2
value (0.33) obtained for the whole population than for individual breeds is consistent with greater homogeneity within breeds. This low LD value indicates that SNP alleles within individual OR genes are not inherited as a block and suggests an ongoing gene conversion process potentially generating many OR genes with higher levels of polymorphism than the bulk DNA [39
LD within OR clusters
A number of the sequenced OR genes corresponded to several clusters between 104 kb and 182 kb in size (see clusters description in additional file 5
). We first retrieved SNP with a MAF > 0.2 and calculated D' values for each pair of SNP. The percentage of SNP pairs with a D' value > 0.8 varied from 38 to 66% for the five different clusters analysed within the whole population (Table ). Contrasting results were obtained for analyses within breeds. For example, Belgian Malinois and Greyhound, in cluster 03, were weakly polymorphic and no LD value was calculated, whereas, for German Shepherd Dog and Labrador Retriever, 100% of SNP pairs had a D' value > 0.8 and, in Pekingese, only 58% of SNP pairs had a D' value > 0.8. These results indicate that the constraints imposed on OR cluster evolution are not identically distributed in the different breeds. The LD value calculated per breed was also higher than that calculated for the whole cohort (Table ). This result contrasts with the findings of Sutter et al
], showing that the LD value calculated at the whole-population level for regions devoid of OR genes was similar to that obtained for individual breeds. However, our result is consistent with that reported by Menashe et al
] for the analysis of a human OR cluster in different populations.
Percentage of SNP pairs with a D' value > 0.8.