Here, using a custom tiling oligonucleotide array CGH approach, we reported the first CNV survey of the pig genome among twelve unrelated healthy boars which are founders of a vast pig family. It should be stressed that only four chromosomes and not the whole genome were screened here. Both gains and losses of different lengths were discovered on part of chromosomes 4, 7, 14, and 17. With the tiling nature of the array, we were able to identify 37 frequently occurring loci of copy number variation.
Natural large-scale genomic size divergence between animals of the same breed was found to vary by at least 164.1 kb, showing that a substantial portion of the pig genome may vary in copy number. In comparison with CNV studies in the “finished” human and mouse genomes
[17],
[22], our study found an order of magnitude less genomic size divergency. This is not surprising since the pig assembly is currently only in its draft form, covering less sequenced data.
With a detection sensitivity ranging from about 2 kb (median spacing*5probes) to 248.471 kb (length of the biggest contig in the Sscrofa 6 assembly), at least 0.18% of the mapped pig chromosomes are tolerant to copy number variation.
Concerning the functional sequence content, twelve pig unigene sequences and one Refseq gene were found to be putatively under influence of the CNVRs. The Refseq gene is related to sensory perception, which is a common large and rapidly evolving gene family found to contain many genes overlapped by CNVs in other mammalian genomes
[16]–
[17],
[19],
[21]–
[22]. This gene family is possibly conserved by natural selection in mammalian species or, with a different view, could mean a relative relaxation of selective pressure on copy number variants for these genes.
In order to confirm the CNVRs found with the array approach, RT-PCR was carried out on some CNVRs and 50% of the selected CNVRs were validated. Although this validation rate seems poor, it should be noted that RT-PCR is not trivial for a highly error-prone preliminary genome assembly. Many factors could account for this discrepancy as explained very thoroughly elsewhere
[63], like: (1) The breakpoint estimation of the copy number variable regions may not be correct leading to a primer design upstream or downstream of the true boundaries of the CNVR; (2) CNVRs have a lower probe density than usual because some regions surrounding the Nimblegen probes have a high repeat content which may disturb the PCR reaction; (3) The animals may have SNPs and small indels in the CNVRs compared to the reference genome assembly, which may compromise the RT-PCR reaction but not the CGH hybridization, or at least not so seriously
[62], since the RT-PCR primers are shorter and thus less robust than the CGH probes. The source of the disagreement between RT-PCR and array CGH awaits further research.
Further validation was done using 7k SNPs ascertained in-house by mapping their surrounding sequences to the
Sus scrofa 6 assembly for the pigs queried in this study (unpublished data). Here it was tested whether the SNPs found in close proximity of the CNVRs validated by RT-PCR gave some information about the presence of copy number variants in those regions (see
Methods). In fact, for three of the CNVRs, the SNP alleles in close proximity were found to cluster in 2 groups: animals with the CNVR had one set of alleles while the others had a different set of alleles. Probably due to the low density distribution of SNPs they were uninformative regarding the status of the other putative copy number variable regions.
Since our analytical pipeline for measuring the pig CNV landscape was developed in order to minimize the detection of somatic CNVs and false positives, and since the pig preliminary assembly contains high amounts of unfinished sequence and incorrectly mapped regions, our results are an obvious underestimate of the total number of CNVs in the sequences covered. As an example, when allowing copy number variants to be called in only one animal, there is an increase in the CNVR estimate from 37 to 165 (unpublished data).
It is also important to state that the sequences within a contig might be incorrectly assembled. Consequently, a CNVR detected at a certain position and in a certain orientation within a contig might have a different position and orientation within this contig. This could affect the performance of the calling algorithms. Future pig genome assemblies will shed light on this matter.
With the hypothesis that hundreds or maybe thousands of CNVs exist in the pig genome, this study is still an early step toward a more complete understanding of copy number variation within the pig species. Consequently, more studies are needed to fully understand the extent and functional roles of CNVs. Therefore, integration of previously gathered QTL and SNP (unpublished data) data for the pig families, the CNV data reported here, and a more comprehensive genome-wide CNV study in our group will certainly provide a framework for genetic association studies that will hopefully unravel the biological relevance of genetic variation and their effect upon important economic traits.