|Home | About | Journals | Submit | Contact Us | Français|
Edited by Dr Osamu Ohara
The diversity and population genetics of copy number variation (CNV) in domesticated animals are not well understood. In this study, we analysed 75 genomes of major taurine and indicine cattle breeds (including Angus, Brahman, Gir, Holstein, Jersey, Limousin, Nelore, and Romagnola), sequenced to 11-fold coverage to identify 1,853 non-redundant CNV regions. Supported by high validation rates in array comparative genomic hybridization (CGH) and qPCR experiments, these CNV regions accounted for 3.1% (87.5 Mb) of the cattle reference genome, representing a significant increase over previous estimates of the area of the genome that is copy number variable (~2%). Further population genetics and evolutionary genomics analyses based on these CNVs revealed the population structures of the cattle taurine and indicine breeds and uncovered potential diversely selected CNVs near important functional genes, including AOX1, ASZ1, GAT, GLYAT, and KRTAP9-1. Additionally, 121 CNV gene regions were found to be either breed specific or differentially variable across breeds, such as RICTOR in dairy breeds and PNPLA3 in beef breeds. In contrast, clusters of the PRP and PAG genes were found to be duplicated in all sequenced animals, suggesting that subfunctionalization, neofunctionalization, or overdominance play roles in diversifying those fertility-related genes. These CNV results provide a new glimpse into the diverse selection histories of cattle breeds and a basis for correlating structural variation with complex traits in the future.
Copy number variations (CNVs) are deletions and insertions of genomic sequence between two individuals of a species.1–3 Substantial progress has been made in understanding the impacts of CNVs on both normal phenotypic variability and disease susceptibility in human.4 The majority of previous studies of CNV in domesticated animals have been based on array comparative genomic hybridization (CGH) experiments or single-nucleotide polymorphism (SNP) arrays.5–13 As in human,14 sequence-based approaches are becoming popular for the study of CNVs in domesticated animals.15–17
The CNV distribution within and among species seems to be shaped by mutation, selection, and demographic history.18,19 However, unlike SNPs and microsatellites, the population genetics of CNV is largely unknown.1,20–23 Several studies have explored the evolution and adaptation aspects of CNVs in human,24,25 and other species.26–29 Only a few cases of recent positive selection were found near AMY1, APOBEC, MAPT, MIF, and UGT2B17 within human populations.30–33 While most of simple deletions and simple duplications (75–90%) display extensive linkage disequilibrium (LD) with SNPs,1,34,35 the properties of CNVs not tagged by SNPs remain unexplored in cattle. This is mainly due to the difficulty of genotyping CNV and to the limited sample size in the published CNV studies.
As one of the most important farm animals, cattle are used for a variety of purposes including dairy, beef, leather, and labour. The majority of the global cattle population can be classified into one of two subspecies: humpless (taurine, Bos taurus taurus) and humped (indicine or zebu, Bos taurus indicus) cattle with dramatic phenotypic differences between them.36,37 Earlier studies indicated that these two subspecies diverged from the last common ancestor between 0.6 and 2 million yrs ago.38,39 There appears to have been two separate domestication events, with taurine cattle likely being domesticated in the Fertile Crescent ~8,000–10,000 yrs ago and indicine cattle in the Indus Valley ~6,000–8,000 yrs ago.40,41 A third independent domestication was proposed in Africa;42,43 however, a recent study did not support this hypothesis.44 Since the early 1800s, breed development has primarily been based on phenotypic selection on coat colour and polled phenotypes. More recently, the adoption of effective genetic selection programs and widespread use of artificial insemination resulted in bottlenecks followed by breed expansion. During the last 50 yrs, animal breeding based on quantitative genetics has resulted in remarkable progress in improving milk and meat production traits.45,46 Therefore, selective (natural and human-imposed) and non-selective forces (demographic events and introgression) have driven changes within the cattle genome. Their combined effects have created exceptional phenotypic diversity and genetic adaptation to local environments across the globe within the modern cattle breeds. For example, indicine cattle are better adapted to warm climates and demonstrate superior resistance to tick infestation than do taurine breeds.47 Likewise, beef and dairy cattle breeds display distinct patterns in selected metabolic pathways related to muscling, marbling, and milk composition traits. Although cattle genome evolution and demographic history have been explored from multiple aspects, the diversity and population genetic properties of CNV in cattle are still unexplored.
In this study, we compare the diversity and population-genetic properties of CNVs in ~70 cattle individuals sequenced to medium coverage (mean ~11.8×). The data set includes multiple individuals from eight representative cattle breeds, representing both the major taurine and indicine breeds used for both dairy and beef purposes. It provides unprecedented genome-wide resolution to interrogate CNV and a unique opportunity to fully explore the population-genetic properties and evolutionary contributions of multicopy genes related to breed-specific traits.
We used a previously described segmentation algorithm to call CNVs.16,48 Detailed sample selection and CNV calling methods can be found in Supplementary Material online. Association between CNVs and SDs was tested by Spearman's rank correlation using 100 kb windows as previously described.49 Additional genomic features were obtained from public databases. Determination of the overlap between CNVRs and genomic features was performed as previously described.10
Inbreeding is a common feature in livestock due to selective mating and widespread use of artificial insemination. We filtered our samples based on known pedigrees constraining Wright's coefficient of relationship (r) to <0.25 to identify 69 unrelated individuals for the population-genetic analyses.
During the CNV discovery phase, a total of 1,148,528 windows of 1 kb were identified across the whole genome. Population-specific CNVs were estimated using the statistic VST developed by Redon et al.21. VST is calculated by considering (VT − VS)/VT, where VT is the variance in normalized copy numbers among all unrelated individuals and VS is the average variance within each population, weighted for population size.
We next selected the top 1% diverse 1 kb windows (n = 80) from the distinct CNV regions to perform CNV genotyping using the partitioning around medoids (PAM) function in R.50 A PAM procedure was used to cluster copy number ratios into discrete CN genotypes.21 Similarly, we partitioned the copy numbers of each 1 kb window into three clusters, representing the low, mid, and high ranges and then coded them using the 0, 1, and 2 matrix for SNP genotyping. Population clustering was then performed using STRUCTURE v2.3.3,51,52 assuming three ancestral populations (k = 3). This analysis between taurine and indicine cattle was initially run for values of the number of clusters (k) between 2 and 8. Each analysis was performed using 100,000 replicates and 100,000 burn-in cycles under admixture and correlated allele frequencies models. Reynolds' genetic distances among breeds were calculated using PHYLIP 3.69. To provide statistical support for the resulting clades, 10,000 bootstrap simulations were performed. The phylogenetic trees were visualized in FigTree 1.3.1 (http://tree.bio.ed.ac.uk/software/figtree/). Multidimensional scaling (MDS) analysis was conducted in PLINK 1.07 based on either CNV genotypes generated in this study or SNP genotypes generated using the same bovine HapMap populations as described previously.53
NimbleGen array CGH log2 ratios were calculated for each probe on a custom 2.1 million probe array for all animals. The reference animal, in all cases, was the Hereford cow used to generate the cattle reference assembly, L1 Dominette. All log2 ratio values that spanned the GAT, GLYAT, and KRTAP9-1 genes were averaged across the gene's length for each animal individually. VST values contrasting taurine and indicine populations were calculated for these log2 ratios as previously described.21
Heatmaps were generated using the estimated CN windows for each animal as described previously.16 The gplots (v 2.14.2) R package (http://cran.r-project.org/web/packages/gplots/index.html) was used to graph the CN values and generate heatmap representation of all lineage-specific gene duplications, deletions, and expansions identified in cattle breeds.
Gene content of cattle CNVRs was assessed as previously described.10 We performed DAVID analysis to test whether the terms were under- or overrepresented in CNVRs after Bonferroni correction.10 We identified the lineage-specific or lineage-differential gene families using a heuristic approach with our 75 analysed individual animals. We divided the animals into breed, subspecies, and purpose groups as listed in Table Table11 and used a weighted search algorithm to highlight CNVRs with a high tendency to exist solely within a specific group. The weighted search was accomplished as follows: for each CNVR we calculated a sum score that represented a hypothesis that the CNVR was unique to a specific breed/subspecies based on the animals that shared the CNVR. For each breed/subspecies/purpose group (G), we counted the number of animals (A) from G that shared the CNVR and imposed a penalty (P) for each animal that was not a member of the current G. The sets of G that were tested consisted of dairy, beef, Angus, Holstein, Limousin, Jersey, Romagnola, Nelore, Gir, Brahman, Taurus, and Indicus (membership was not mutually exclusive within the groups). The summed weight of A–P was calculated for each G, and if it exceeded a threshold of 3, it was selected as a putative subspecies/breed-specific or differential CNVR. We also employed a statistical method (VST) to identify copy number variable genes within our data set. Based on the gene CNs from each animal, we identified gene families that were stratified by subspecies differences using the statistic VST, as described above.
To explore the diversity of haplotypes and evolutionary relationships across populations, we retrieved the high-density SNP array data for these eight breeds generated by the Illumina BovineHD SNP Consortium as described previously.53 Haplotypes and their frequencies were estimated separately for each breed using PHASE 2.1.54,55 To obtain reliable results, we employed an iterative scheme to perform inference with 10,000 iterations and 10,000 burn-ins, also we increased the number of iterations of the final run of the algorithm using option -X100, for details see http://stephenslab.uchicago.edu/instruct2.1.pdf. Haplotype networks were constructed near functional genes such as GAT/GLYAT, ASZ1, AOX1, and FZD3. Phylogenetic relationships among the identified haplotypes were inferred through a median-joining network analysis using Network 4.6.12 (http://www.fluxus-engineering.com/).
All array CGH data have been submitted to the gene expression omnibus (http://www.ncbi.nlm.nih.gov/geo/) under the accession number GSE62990. Raw data and population genetic and evolutionary analysis results are available upon request for research purposes. Raw reads were deposited under the SRA Bioproject PRJNA277147 in SRA (http://www.ncbi.nlm.nih.gov/sra/).
After carefully excluding samples with low coverage in our initial survey, we focused on the remaining 75 individuals in our final data set (see Table Table11 and Supplementary Table S1 for sample information and sequence coverage). We identified CNVs using a sliding window approach based on the previously published MrsFAST-WSSD method.16,48 We discovered comparable average numbers of CNVs per individual across taurine (626.7) and indicine (591.2) cattle, suggesting our results based on the taurine reference assembly (UMD3.1) were not particularly biased against the indicine cattle. A full list of CNV calls (47,511) is presented in Supplementary Fig. S1 and Table S2. After merged across samples, these CNVs yielded 1,853 CNV regions (CNVR), which represent 87.5 Mb (3.1%) of the cattle genome (Supplementary Fig. S1 and Table S3). We then calculated absolute copy number values for 1 kb windows across the genome for each sequenced individual (see Materials and methods). As anticipated, the average normalized genome-wide copy number was 2.15 ± 0.1 for all copy number windows. We successfully performed 75 quantitative PCR (qPCR, Supplementary Table S4) and 25 array CGH experiments (Supplementary Fig. S2) to assess the false-positive discovery rate for our data set as previously described.3,53 Detailed experimental validations can be found in Supplementary Material online. In summary, CNV calls made with sequence data were strongly correlated with array CGH data (r2 = 0.761) and had an estimated 12% false-positive rate and a 19% false-negative rate based upon qPCR and array CGH, respectively.
We found that a large proportion of identified CNVRs (43.3%; 49.5 Mb) overlapped with the segmental duplication (SD) regions. We estimated pair-wise Spearman's rank correlations (Supplementary Table S5) of 0.084 and 0.098 for indicine and taurine CNVs and SD regions (both ), which were similar to the previously reported human results.49 A strong correlation of CNVs and SDs in cattle confirms that their formation mechanisms are mainly due to non-allelic homologous recombination (NAHR).10,56 In the following analyses, we mainly focused on the characterization of the high-confidence CNVs from autosomes.
To investigate the population genetics of CNVs, we first identified the frequencies of CNVRs in our data set (Supplementary Fig. S3). The average CNVR had a frequency of 29.3% (22 animals out of 75 total); however, the CNVR frequency tended towards a parabolic distribution with 835 CNVRs having a frequency ≤5% and 189 CNVRs having a frequency≥95% in our data set (Supplementary Fig. S3). As expected, rare events were often observed in only one subspecies/breed, whereas common CNVs (frequency >5%) were usually shared across subspecies/breeds.
To explore the population differentiation of CNVs between taurine and indicine cattle, we applied statistical measures of population differentiation using VST21 to our dataset in three ways: (i) estimation of VST for genome-wide 1 kb CN windows; (ii) clustering of the top 1% of VST values for 1 kb CN windows; and (iii) estimation of VST using the average CN for annotated genes. Estimates of VST for all genome-wide CN windows and all CNVRs revealed a number of outliers with levels of population differentiation suggestive of population-specific selective pressures (Fig. (Fig.1).1). Among these outliers were CNVs near AOX1, GAT/GLYAT, ASZ1, KRTAP9-1, and MCM4 (Fig. (Fig.11A).
We next selected the top 1% diverse 1 kb windows (n = 80)—identified by our VST calculations—from the distinct outlier CNVRs to perform CNV genotyping. The PAM algorithm is the most common implementation of k-medoid clustering, which is related to the k-means algorithm and the medoid shift algorithm.50 Previously, a PAM algorithm was used to cluster copy number derived from array CGH log2 ratios into discrete CN genotypes.21 Similarly, we partitioned the copy numbers of each 1 kb window into three discrete value clusters, representing low, mid, and high ranges and then coded them as 0, 1, and 2 values within a matrix for subsequent genotyping (Supplementary Table S6).
Using CNV genotype calls from the top 1% diverse 1 kb windows within CNVRs, we performed population clustering (Fig. (Fig.2).2). The proximity of an individual to each apex of the triangle indicates the proportion of that individual's genome that is estimated to have ancestry from each of the three inferred ancestral populations. The clustering of most indicine cattle (BRM, GIR, NEL) in the right bottom apex reveals the clear discrimination between indicine and taurine cattle. In contrast, the taurine cattle were scattered along the opposing side with the exception of ROM near the centre. This distribution of ROM individuals agreed with previous results based on SNP genotypes,44 confirming that ROM has both taurine and indicine ancestries. It is also noted that ANG individuals were clustered together in the upper apex, while the other taurine cattle (HOL, LMS, JER) were dispersed around the left bottom corner, suggesting a distinction between different taurine breeds. It is possible that these two clusters differentiate between continental European breeds of cattle from UK breeds or beef breeds from dairy breeds; however, it is also possible that our selected CNV markers may be subject to cryptic founder effects within our ANG individuals. We suspect that the addition of African cattle breeds to this dataset will better resolve the taurine cluster by providing a third distinct lineage. Still, we note that the clustering results were structurally similar to our results obtained with high-density SNP data derived from the same bovine HapMap samples (Supplementary Fig. S4) and other recently published results.44 Based on the genotypes for these 80 loci (i.e. the top 1 kb windows) and using a neighbor-joining algorithm, we obtained a phylogenetic tree that generally agrees with the known cattle breed history (Supplementary Fig. S5). We also performed a MDS analysis based on CNV genotypes and compared it with the plot based on SNPs. Our plot confidently separated the indicine from the taurine cattle (Supplementary Fig. S6); however, the separation and clustering of the taurine cattle using CNVs were not superior to those based on SNPs, suggesting that CNV genotyping still has room for improvement.
Out of these 80 loci, 62 can be reliably assessed for their variable patterns and 54 of these loci, in turn, (87.10%, 54/62) are located in or near tandem duplications (Supplementary Table S6). This estimate was consistent with our initial genome-wide results that 90% of SD in cattle are tandem duplications in contrast to human and other primates, which show a preponderance of interspersed duplications.56 This led us to speculate that while it is challenging to systematically genotype cattle duplication CNV events as shown by Genome STRiP results in human,57 our relatively high cattle CNV genotyping accuracy is likely due to the vast majority of cattle CNV being tandem duplicates. Large tandem repeats or duplication CNVs in cattle could behave similarly like human tandem macrosatellites and multicopy genes. For these tandem duplications, it is likely that we made reasonable approximations of CNV genotyping calls by simply clustering the normalized copy numbers, as shown traditionally for macrosatellites and microsatellites.58–61 Additionally, the tandem distribution pattern could contribute to the high LD at CNV loci as suggested previously,62 thus the majority of CNV genotype calls could better represent local alleles. Combining these two factors, it is not surprising that our CNV-based results generally agree with SNP-based results. Of course, this hypothesis certainly warrants more investigation using larger sample sizes and other mammals like mouse and dog to further validate and improve CNV genotyping approaches.
To provide an evolutionary perspective to our analyses, we also created heatmaps using the CN values for regions within selected gene loci (Supplementary Fig. S7). These analyses of lineage-specific or lineage-differential CNVs separate subspecies/breeds into groupings that are consistent with the generally accepted cattle history.44
We evaluated genes overlapped by cattle CNVs (Supplementary Table S7) and selected genes with known functions (Table (Table2).2). We observed an enrichment of CNVs intersecting with genes (P < 0.0001; Spearman's rank sum correlation), consistent with reduced evolutionary constraints acting on functionally redundant gene categories. We next used DAVID to identify basic biological functional categories for 361 genes overlapped by our identified CNVRs.63 Like other mammals (human, mouse, and dog), statistically significant overrepresentations were observed for multiple categories including chromosome maintenance, immunity and cytoskeleton components (Supplementary Table S8). We then studied how variable genes were distributed across subspecies/breeds using either a heuristic approach based on CNV presence/absence or gene CN per individual (VST).
We first identified lineage specific, copy number variable genes (CNV genes) using a heuristic approach (see Materials and methods). Dairy cattle-specific CNVs tended to be present at low frequencies in our 28 dairy cattle, and they manifested as small copy number changes of affected genes. Several of these dairy-specific CNVRs were found to intersect genes related to cellular growth and development pathways, including RICTOR (rapamycin-insensitive companion of mTOR)64 and TMED2.65 We also identified several lipid metabolism genes that overlapped CNVRs exclusive to beef cattle (over 40 samples). Within our Angus data set, we discovered six animals that had a predicted heterozygous duplication of PNPLA3 (the Patatin-like phospholipase domain-containing protein 3). This gene is expressed in adipose tissue and liver, and is associated with the de novo synthesis of fatty acids.66 In indicine beef cattle (Nelore, Brahman, and Gir), we also detected a duplication in a predicted Ensembl gene (ENSBTAT00000043749) containing functional domains related to lipid metabolism.
We identified copy number variable genes among the different subspecies/breeds using VST statistics. We defined highly stratified genes as genes having VST values >0.2. CN plots for these stratified genes showed clear differences in the average CN value for taurine and indicine animals (Fig. (Fig.3A).3A). Based on VST values, ZNF280B, FBXO16, KRTAP9-1, MCM4, SERPINB4, CA1, FZD3, GLAYAT, GAT, MANBA, and DENR were the most stratified genes. To provide orthogonal experimental support for the sequence-based VST results, we also retrieved log2 ratios from array CGH data for the same animals. Representative results are shown in Supplementary Fig. S8 for the GAT, GLYAT, and KRTAP9-1 genes, further confirming the sequence-based VST results.
CN stratified genes tended to be immune system related, which is expected given the different environmental challenges in the history of evolution of taurine and indicine cattle. One of these stratified CNVs represents a significantly higher duplication of the KRTAP9-1 gene in taurine cattle, which is a paralog of KRTAP9-2 that was previously reported to likely be involved in indicine tick resistance (Fig. (Fig.3B).3B). A duplication of the MCM4 gene (Fig. (Fig.3C)3C) was found in indicine cattle compared with taurine cattle. We also confirmed several other gene families appearing to be copy number variable, including lysozyme, defensin, and unique long binding protein (ULBP) families and the major histocompatibility complex (MHC).
Discoveries of high-frequency gene duplications suggest that the affected gene families are currently expanding in a ‘gene family birth and death’ model as described by Nei and Rooney.67 Such multicopy genes, if present in a sufficiently large proportion of the population, can be thought of as signs of diversifying selection or selection by overdominance.68 One example of this can be found in the Olfactory Receptor (OR) gene family, which has several member genes that detect odorant molecules through combinatorial binding across other paralogous family members.69 Therefore, the duplication and subsequent mutation of OR gene members allow for a greater range of odorant detection for a host organism. Indeed, out of 134 annotated OR genes, we have identified 31 (23.1%) individual genes that have predicted duplications in our data set.
We have detected several additional gene families that appear to be subject to a high degree of duplication in our data set, and these families likely represent classes of genes that are in the processes of subfunctionalization and neofunctionalization in cattle. They include a cluster of prolactin-related protein family (PRP) genes that appears to be duplicated in 96% (74/75) animals. It was previously discovered using the BovineSNP50 array70; however, we refined the event from 2.4 down to 0.7 megabases in size. Another locus containing several pregnancy-associated glycoprotein (PAG) family members was found to be duplicated in all animals within CNVR 1717.
It is important to note that CNVs in some loci may have different alleles. Earlier results also suggest that the diversity of a subset of multicopy genes like human OR genes may have been maintained by balancing selection, in the form of overdominance.68 For example, a 660 kb deletion with antagonistic effects on fertility and milk production was recently found at high frequency in Nordic Red cattle, providing evidence for balancing selection of CNVs in livestock.71 To investigate the potential effect of overdominance on selection and evolution of multicopy genes, we further investigated haplotype evolution pattern using the BovineHD SNP array. We obtained 11 haplotypes within the 50.3 kb haploblock region near the GLYAT/GAT locus (Fig. (Fig.4A).4A). The most common haplotype, H1 (with frequency of 70.06%), was mainly found in taurine cattle (HOL, ANG, JER, and LMS) and only minor portions were found in indicine cattle (ROM, BRM, GIR, and NEL) (Fig. (Fig.4A).4A). H2 (with frequency of 10.56%) included a large proportion of taurine cattle (ANG, LMS, and ROM). We also observed two haplotypes exclusive to indicine cattle with a combined frequency of 10.54%: H3 and H4. Altogether, this pattern indicated that separate haplotypes were clustered only for the indicine cattle (BRM, GIR, and NEL), while other common haplotypes were identified for taurine cattle (HOL, ANG, LMS, JER, and ROM). BRM and ROM were the only exceptions, as they were often associated with both taurine and indicine cattle. This was not unexpected, as it mirrors the complex ancestral backgrounds of these two breeds, since BRM cattle are a known indicine breed with taurine influence and ROM share distinctive genetic ancestry with indicine cattle.44 We also found similar results for other copy number variable genes, such as ASZ1 (Fig. (Fig.4B),4B), AOX1, and FZD3 (Supplementary Figs S9 and S10). These haplotype network analyses suggest that for a subset of multicopy genes: (i) common overlapping allelic haplotypes were often present within the taurine cattle, while separate distinct haplotypes were present in the indicine cattle, suggesting different evolutionary history for these two cattle subspecies; and (ii) there was high allelic diversity near multicopy genes maintained by balancing selection, in the form of overdominance, suggesting that they have been under different selection pressures in these two cattle subspecies.
Different versions of cattle reference genome assemblies (Btau_4.0 and UMD3.1) have different RefSeq gene annotations, particularly in CN variable regions of the cattle genome. For example, the CATHL4 gene, for which we previously reported copy number change,16 was located on chrUn of UMD3.1. Since our RD method uses a window approach that relies on large genomic segments, this prevented us from assessing the copy number status of CATHL4 on chrUn. However, we did find CATHL1 was copy number variable in this study. Similarly, we were unable to assess the copy number of KRTAP9-2; however, we detected copy number changes for one of its paralogs, KRTAP9-1, in this study.
Similar to SNP and microsatellite, CNV distribution within and among populations seems to be shaped by mutation, recombination, gene conversion, selection, and demographic history.18,19,28,72 However, CNV genetic markers may not be currently compatible with current population analyses, because CNVs violate the classical population genetics assumptions based on the infinite allele model and the infinite site model for SNP. Compared with SNPs, limitations of CNVs as markers were observed in this study probably due to their distinct mutation mechanisms, high mutation rates, heterogeneities among loci, and uncertainties related to allele calling. Similar observations were also reported for microsatellite—primarily due to similar limitations of detection and variability.73,74 For example, although NAHR is believed to be responsible for most of large duplication CNVs in cattle, inference, and predictions on the forces influencing populations require the modelling of the mutational process generating CNV. We currently lack such a model for the large duplications present in cattle. This is also compounded by the fact that homoplasy caused by recurrent events is expected to occur relatively often for CNV compared with SNP because of their high mutation rates.
Despite the aforementioned limitations, this study represents one of the first attempts to genotype CNVs within large, diverse cattle populations using sequence data. Although beyond the scope of this study, a comparison with human-centric CNV genotyping methods using cattle sequence data will provide a useful contrast in approaches. Our results provide a new glimpse into the diversity of selective pressures during cattle speciation. We confirmed that cattle are strikingly diverse, despite relatively low estimated current population sizes for several taurine cattle as shown previously.37 Our population-genetic analyses based on CNVs reveal the population structures of these taurine and indicine cattle and uncover hundreds of CNVs showing elevated population differentiation near important functional genes. We highlighted several subspecies specific or differential CNV gene overlaps that are likely subject to subfunctionalization and neofunctionalization. We also identified key regions of the cattle genome that are subject to variation and reported several potential genes affecting productive traits. These discoveries provide a basis for future efforts to genotype and track large CNVs in cattle. More sequencing data from the 1000 Bull Genomes Project75 or the analysis of additional outlier groups (e.g. African cattle breeds) will help to validate and refine the link between genomic copy number in these regions or different alleles with production and health traits.
G.E.L. and D.M.B. conceived and designed the experiments. D.M.B., L.X., J.L.H., J.B.C., and J.S. performed in silico prediction and computational analyses. D.M.B. and L.X. performed aCGH, qPCR confirmation. D.J.N., S.G.S., J.F.G., C.P.V.T., T.S.S., R.D.S., J.F.T., and H.A.L. collected samples and generated sequence data. G.E.L. and D.M.B. wrote the paper.
T.S.S. is an employee of Acceligen Inc. of Animal Agriculture Subsidiary of Recombinetics, Inc. All other authors declare no potential conflict of interest.
This work was supported in part by Agriculture and Food Research Initiative (AFRI) competitive grant No. 2011-67015-30183 from the USDA National Institute of Food and Agriculture (NIFA) Animal Genome Program. J.F.T. and C.P.V.T. were supported by AFRI competitive grant No. 2009-65205-05635 from the USDA NIFA. J.F.T. and R.D.S. were further supported by AFRI competitive grants No. 2011-68004-30367, 2011-68004-30214 and 2013-68004-20364 from the USDA NIFA Animal Genome Program. Funding to pay the Open Access publication charges for this article was provided by the USDA National Institute of Food and Agriculture (NIFA) Animal Genome Program.
We thank members of the Illumina BovineHD SNP Consortium for sharing their samples and data. We would thank the Council on Dairy Cattle Breeding (Reynoldsburg, OH) for access to animal phenotypic information, and we thank the Cooperative Dairy DNA Repository (Beltsville, MD) for access to semen stocks used to generate Illumina DNA sequence libraries. We also thank T. Brown, R. Anderson, and A. Dimtchev for technical assistance. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the US Department of Agriculture.