Milk and mammary gene sets
Two proteome studies of bovine milk [11
] were used to derive a milk protein gene set of 197 unique genes (see 'Collection of the milk protein set' in Materials and methods). Using 94,136 bovine mammary ESTs, mammary gene sets were created to represent the following developmental stages or conditions: virgin, 3,889 genes; pregnancy, 1,383 genes; lactation, 3,111 genes; involution, 867 genes; and mastitis, 840 genes (see 'Collection of the mammary gene sets' in Materials and methods). In total, 6,469 genes are constituents of one or more of these mammary gene sets, suggesting that one-quarter of all predicted genes are expressed in the mammary gland at some point during the lactation cycle. Genes from the milk protein and mammary gene sets are present on all 29 bovine autosomes and on the X chromosome (Figure ).
Figure 2 Distribution of milk and mammary genes across all bovine chromosomes. In this chromosome map, each of the 30 bovine chromosomes is illustrated by a pair of columns, with genomic locations of milk and mammary genes in the first column, and milk-trait QTL (more ...)
The milk protein gene set is the most extensive curation to date of genes that give rise to milk proteins, the functions of which have not yet been comprehensively studied. To gain insight into the possible molecular functions of milk proteins, the milk protein gene set was analyzed for enriched molecular function Gene Ontology (GO) terms (see Materials and methods). Four significant, minimally redundant molecular function GO terms were identified: 'GTPase activity,' 'GTP binding,' 'pattern recognition receptor activity,' and 'calcium ion binding.' More than 30 milk proteins that were previously isolated in the milk fat globule membrane [11
] were associated with 'GTPase activity' or 'GTP binding'. GTPases are known to be involved in numerous secretory processes, and for this reason, it seems likely that these proteins have a role in assembly and secretion of the milk fat globule and possibly other milk components. The 'pattern recognition receptor activity' GO term was enriched due to the presence in milk of the cell surface and immune recognition components CD14 [GenBank:NM_174008
], TLR2 [GenBank:NM_174197
], TLR4 [GenBank:NM_174198
], and DMBT1 [GenBank:S78981
]. These proteins are involved in the activation of the innate immune system when they associate with cells. Further, the soluble forms of CD14 and TLR2, which can act as decoy receptors for microbial pathogens, could potentially modulate local inflammation following bacterial colonization in the neonate gut [13
]. Enrichment of the GO term 'calcium ion binding' was expected as many milk proteins are known to bind calcium, a mineral required in abundance by the growing neonate.
Milk is traditionally thought of as a food that provides the neonate with nutrients and some immune protection, such as that provided by immunoglobulins. Prior research also suggests that various milk proteins are resistant to digestion by gastric proteases at physiological pH [15
] and that intact or partially intact milk proteins may either express their functions in the neonatal intestinal tract or may be absorbed and act on other organs [16
]. To understand what signaling might be possible if milk proteins remain partially or wholly undigested, the milk protein gene set was interrogated for enriched pathway annotations (see 'Pathway analysis' in Materials and methods). The milk protein gene set contains elements of two marginally significant pathways that lead to activation of PPARalpha and LXR, two nuclear receptors involved in sensing nutrients and modifying metabolic responses at the level of gene transcription. Milk proteins that are associated with the LXR/RXR activation pathway include the cell surface or secreted molecules CD14 [GenBank: NM_174008
], CD36 [GenBank:NM_174010
], TLR4 [GenBank:NM_174198
], and MSR1 [GenBank:NM_001113240
], the apolipoproteins APOA1 [GenBank:NM_174242
] and APOE [GenBank:NM_173991
] and the lipid synthesis enzymes ACACA [GenBank:NM_174224
] and FASN [GenBank:NM_001012669
]. Those associated with the PPARalpha/RXRalpha activation pathway include the cell surface molecule CD36 [GenBank:NM_174010
], the endoplasmic reticulum protein disulphide isomerase PDIA3 [GenBank:NM_174333
], the apolipoprotein APOA1 [GenBank:NM_174242
], the transcription factor STAT5B [GenBank:NM_174617
], the heat shock protein HSP90AA1 [GenBank:NM_001012670
], the regulator of adenylate cyclase GNAS [GenBank:NM_181021
], and two enzymes involved in lipid synthesis, GPD2 [GenBank:NM_001100296
] and FASN [GenBank:NM_001012669
]. It is likely that the products of these genes, which are well known to be active at metabolic control points in many organs, are active in the mammary gland and then enter the milk via cytoplasmic crescents in the milk fat globules. Keenan and Patton [17
] noted that cytoplasmic sampling, as part of milk fat globule formation, is present in all species examined to date, including humans, and that such evolutionary persistence suggests possible benefits for mother or offspring. Further research will be needed to determine whether these proteins are present in milk at sufficient quantities to have a physiological effect in the neonate.
All mammary gene sets were interrogated for enrichment of GO terms or pathway annotations, but the results did not further our knowledge of mammary biology. Consistent with our previous study [18
], current GO term annotations were incomplete or generally out of context when applied to the mammary gland. Although bovine EST data indicate that more than 3,000 genes are expressed in the lactating mammary gland, a mere 22 genes are currently annotated with the GO term 'lactation.'
Bovine milk production QTL
Milk trait QTL delineate gene regions that harbor genes or cis
-acting elements that are responsible for the milk trait phenotype. The dairy industry has invested enormous resources into the identification of these QTL for milk production traits in bovine, particularly milk yield, protein yield, fat yields, protein percentage, and fat percentage. Reviewing the literature, 238 milk trait QTL were identified for these five traits in 59 references (Additional data files 8-9). Of the 238 QTL, 63 were reported with flanking markers having a median interval size of approximately 17 million base pairs. Following a previously established method [19
], the 175 remaining QTL that were reported with only a single peak marker were assigned this median interval size. Some QTL were reported for more than one milk trait; thus, these QTL span only 168 unique genome locations. These milk trait QTL span all 29 autosomes (Figure ), with the highest densities of QTL occurring on chromosomes 27, 6, 20, and 14 (Additional data file 10). Possible differences in genetic architecture are most obvious between fat and protein percentage traits, where fat percentage QTL are present on fewer chromosomes with lower QTL density and protein percentage QTL are present on all but two chromosomes, most with higher QTL density (Additional data file 10). Fat percentage may be controlled by relatively fewer genes each with larger effects, whereas protein percentage may be controlled by far more genes each with smaller effects.
The milk trait QTL provide a very coarse map of genomic areas of interest that cover nearly half of the bovine assembly. Milk yield QTL overlap with 19.5% of the genomic assembly, fat yield QTL with 15.4%, protein yield QTL with 21.1%, fat percentage QTL with 12.3%, and protein percentage QTL with 33.6% of the genome assembly. The densities of genes within these QTL are very similar for each milk trait, with between 9.1 and 10.1 genes per million base pairs. Meanwhile, there are 8.4 genes per million base pairs in regions that do not overlap with any milk trait QTL. Given the gene density and number of QTL associated with each trait, each individual QTL is expected to contain between 105 and 127 genes.
To identify candidate genes within milk trait QTL, the lactation mammary gene set was intersected with the milk trait QTL. Between 12.5% and 13.7% of the genes within milk trait QTL are expressed during lactation. In other words, within a single milk trait QTL, between 13.9 and 17.1 genes are expected to be expressed during lactation. Thus, although the set of milk trait QTL reduces the search space for milk trait effectors by less than one order of magnitude, the use of expression data can contribute considerably towards the identification of candidate genes. Genes within milk trait QTL that are expressed in the mammary gland during lactation are listed in Additional data files 11-16. Milk trait effectors are likely to be near these candidate genes.
Genome organization of milk and mammary genes
Studies of eukaryotic genomes have demonstrated that genes with coordinated expression or shared ancestry appear in clusters across the genome [20
]. Given that the clustering of the casein milk protein genes is essential to their coordinated transcription in the lactating mammary gland [9
], the arrival of the bovine genome sequence provides the opportunity to discover other gene clusters relevant to milk, lactation, or mammary biology. A genome-wide search was conducted for genomic intervals of 500 kb and greater that are statistically enriched with genes from the milk protein and mammary gene sets (see 'Genomic localization analysis' in Materials and methods). Among these gene sets, 190 non-overlapping statistically significant clusters were identified: four unique clusters in the milk protein gene set and 54, 60, 30, and 19 unique clusters in the pregnancy, lactation, involution, and mastitis mammary gene sets, respectively. Spreadsheets of all significant gene clusters are available in Additional data files 17 and 18.
The four significant milk protein gene clusters comprised the immunoglobulin genes, casein genes, fibrinogen genes, and genes that encode milk fat globule proteins. Because it is known that immunoglobulins, casein genes, and fibrinogen genes are each clustered in mammalian genomes [9
], this is a good verification of methodology. The cluster of genes that encode milk fat globule proteins contains FASN [GenBank:NM_001012669
], ARHGDIA [GenBank:NM_176650
], and P4HB [GenBank:NM_174135
]. However, P4HB has only been isolated in mastitic milk [11
]. By manual inspection, we found that these genes also cluster in the human, mouse, and other mammalian genomes. Based on EST data, other genes in this genomic region are expressed at various times in the mammary gland. Aside from these four clusters, there does not appear to be a preponderance of putative regulatory modules among genes in the milk protein gene set. Whereas only 6.6% of the milk protein genes were within a milk protein-specific cluster, 27.9% were within one of the mammary gene set clusters. Therefore, it is likely that milk protein genes are regulated along with other mammary genes independent of the function or cellular localization of the proteins they encode.
Next, we examined whether genes were clustered according to developmental stage, but found there were no gross differences in gene clustering using this parameter. Between 24% and 30% of the genes from each mammary gene set - virgin, pregnancy, lactation, and involution - were within one of the other mammary set clusters. Likewise, 28% of the genes from the mastitis mammary gene set fell within a mammary cluster. Thus, mammary genes are not differentially clustered by developmental stage or condition.
Genes may be clustered due to shared evolution, as duplicated genes are often co-localized in the genome. In our study, a significant cluster required a minimum of three genes that were not paralogs. When the paralog requirement was removed, only seven additional unique clusters of triplets or greater were identified. Significant clusters with more than one paralog appear to be confined to the major histocompatibility complex region on bovine chromosome 23. These data suggest that recent duplication is not a common driver of clustered mammary genes in the bovine genome.
In summary, the milk protein genes generally do not form clusters with each other but do appear to form clusters with other mammary genes. Milk protein genes may be regulated along with other lactation genes without regards to the final destination of the gene product. As mammary genes are generally clustered neither by developmental stage nor due to recent duplication, it appears that the need for co-expression in the mammary gland is the denominator for co-localization rather than co-functionality or shared ancestry. This organization in clusters of co-expressed mammary genes might be constrained by unidentified distal cis
-acting elements [20
], chromatin conformation [23
], or coordinately expressed micro-RNAs [24
Milk and mammary gene copy number trends in mammals
Gene copy number contributes to genetic diversity both between and within species. Here, copy numbers of bovine milk protein genes were determined in the bovine, human, mouse, rat, dog, opossum, and platypus genomes using orthologs generated for all bovine consensus gene models (see 'Orthology delineation' in Materials and methods). Genes from the milk protein gene set that were uniquely duplicated in B. taurus and those that were missing copies in one or more of the placental mammals were manually curated (see 'Curation of milk protein orthologs' in Materials and methods). K-means clustering of these curated milk protein gene orthologs followed by seriation within each cluster yielded the heatmap shown in Figure . Three major trends were identified: single copy of the gene across Mammalia; gene not found in platypus; and duplication after platypus.
Figure 3 Heatmap of milk protein gene copy numbers across mammals. Milk protein genes were clustered by copy number using the K-means algorithm followed by seriation within each cluster. Major trends, which convey the consensus profile of the cluster, are delineated (more ...)
The absence of a milk or mammary gene in platypus or duplication after platypus (Figure ) may be due to the expansion of gene families in the common therian ancestor. However, some of these genes may not be truly missing in the platypus genome, but may be undetectable by our methods due to incomplete or incorrect assembly of the platypus genome, lower sequence identity, or the inherent bias created by defining milk and mammary genes in the bovine genome. The identification of platypus orthologs of other genes in the bovine genome would also be affected by these biases; therefore, we next compared milk and mammary gene copy number trends to those genome-wide.
For each major trend shown in Figure , rates of occurrence among the uncurated orthologs of the milk protein and lactation mammary gene sets were compared with the orthologs of all bovine consensus gene models using a hypergeometric distribution to determine statistical significance. More bovine milk protein orthologs were found in all six studied mammalian genomes than would be expected given the rate at which other bovine orthologs were found in these genomes (P < 0.0001). Genes expressed during bovine lactation were also more likely than other genes to have orthologs in all of the mammalian genomes (P < 0.0001). In other words, milk and mammary genes are more likely than other genes to be found in all mammals. This result might be explained in part by an increased power to detect more conserved genes (see 'Conservation of milk and mammary genes in mammals' below). There were also statistically fewer lactation genes missing in the platypus (P < 0.005) and opossum genomes (P < 2.2 × 10-20); however, the number of milk protein genes missing in these genomes did not differ from the genome-wide rate. Finally, more milk protein and lactation genes were duplicated after platypus compared with the whole genome (P < 0.001 and P < 0.03, respectively). Together, these data support the essentiality of milk and mammary genes in Mammalia as well as suggest the possibility for expanded functionality in marsupials and placental mammals.
Milk protein gene copy number variation may potentially contribute to the diversity of milk composition. Ortholog analysis indicated that the gene for beta-lactoglobulin (LGB), one of the most abundant proteins in milk, is duplicated in the dog and bovine genomes (Figure ). In the bovine genome, this gene is located at the position of a previously predicted pseudogene [25
]. It has similarity to LGB-II genes in the horse and cat [26
]. The similarity of this second gene to LGB-II in the horse, cat, and dog suggests that the LGB duplication existed in the common ancestor of the laurasiathians (Figure ). Using two different primer pairs, we were unable to identify the LGB-II transcript in bovine mammary tissue samples using RT-PCR (see Additional data file 22 for details). It is likely that the duplicated LGB gene is not expressed in the bovine mammary gland and that the presence of this duplication does not influence the concentration of LGB in bovine milk.
LGB is apparently not present in human or mouse milk [30
], although LGB-like proteins have been isolated from the milk of other primates [31
]. A human protein, progestagen-associated endometrial protein (PAEP), has significant homology to the bovine and equine LGB-II-like genes [29
]. Although PAEP expression has been detected in the epithelial cells of human breast tissue [37
], neither its presence nor that of an apparent LGB-like pseudogene [GenBank:AH011480
] that flanks the PAEP gene [GenBank:NM_001018049
] has been verified in human milk. We found that the LGB-like and PAEP genes are flanked by GLT6D1 [GenBank:NM_182974
] and OBP2A [GenBank:NM_014582
] in both the human and bovine genomes. This observation, combined with the fact that the baboon has both a PAEP gene [38
] and a LGB gene [33
], suggests that the primate genes arose by duplication of an ancestral gene before the Laurasiatheria and Eurochontoglires diverged. We were unable to find this region in the rodent or rabbit genomes, and an evolutionary break point is present in mouse and rat in this region [39
], suggesting that these genes may have been lost after the split between primates and glires. Although the presence of LGB in laurasiathian milk and its absence in rodent milks has an obvious genetic basis, we cannot yet explain the absence of LGB in human milk.
Some immune components of milk are uniquely duplicated in certain species or clades. For example, SAA3 [GenBank:NM_181016
], which is duplicated in the bovine and dog genomes (Figure ), is thought to be involved in mucin induction in the gut [40
] and a human analog, SAA1, functions as an opsonin for Gram-negative bacteria [42
]. The Cathelicidin gene family is greatly expanded in the bovine, opossum, and platypus genomes, with 10, 8, and 12 copies, respectively [43
], but some of the opossum and platypus orthologs were not found in our automated analysis due to their high heterogeneity. Expansions in this gene family may reflect increased exposure to bacteria at epithelial surfaces in these species. Our results show that the CD36 gene [GenBank:NM_174010
], which encodes a scavenger receptor, has duplications in the B. taurus
and rat genomes. Beta-2-microglobulin [GenBank:NM_173893
] has a second copy in the bovine genome and may also have a duplicate in the platypus genome. This gene encodes one of two chains in the IgG transporter FcRn, which transfers IgG molecules across epithelial cells [46
]. Other variations in milk protein gene copy number (Figure ) potentially give rise to diversity in milk protein composition.
Milk protein gene loss does not appear to be a common occurrence. Of the bovine milk protein genes with an ortholog identified in the platypus genome (Figure ), all but ten genes were found in all of the other studied mammalian genomes. However, because the bovine milk proteome is used as the reference, the loss of some milk protein genes in placental mammals relative to the monotreme and marsupial mammals may have been missed in our analysis. For example, whey acidic protein has been identified in the milk of many mammals such as mouse, rat, opossum, and platypus, but it is absent in bovine milk due to a frameshift mutation in the whey acidic protein gene [47
]. A full proteomic analysis of the milk samples from extant monotremes and marsupials will be needed to identify gene loss in placental mammals.
Our analysis of milk protein gene copy numbers has several other limitations. First, the mammalian orthologs of bovine consensus gene models derived on a genome-wide basis (see 'Orthology delineation' in Materials and methods) may be inaccurate for genes in which the bovine gene model is incorrect or may be incomplete when orthologs are too divergent to be detected by this method. Although we attempted to overcome these limitations by manually curating milk protein gene orthologs, the analysis is only as good as the available genome sequences, and some duplications and deletions may have been missed due to errors and gaps in the genome assemblies. Directed sequencing will be needed to confirm specific results. However, we can generally conclude that there is considerable copy number variation of milk protein genes that may contribute to the taxonomic diversity of milk composition.
Taxonomic relationships of the milk protein genes
To understand the relationships of the milk proteins between mammalian taxa, a consensus tree of those milk proteins with single copy orthologs in the human, mouse, rat, dog, bovine, opossum, and platypus genomes was constructed using a super-alignment of the concatenated sequences (see 'Consensus tree construction' in Materials and methods). An unrooted radial tree depicting the relationships of the milk protein sequences (Figure ) differs from the accepted phylogeny (Figure ). Rodent milk proteins are more divergent from human milk proteins than are dog and bovine milk proteins despite the fact that the rodent and human common ancestor is more recent. To further test the relationships of human milk proteins with those of other taxa, pairwise percent identity (PID) was calculated between the human protein and its putative ortholog for the set of single copy orthologs present in all seven taxa. Average pairwise PIDs for the milk protein gene set confirm that human milk proteins are closest to dog, followed by bovine, then the rodents, then opossum and platypus (Figure ). This observation is not unique to milk proteins as it is also true on a genome-wide basis [43
]. It has been proposed that rodent proteins are more divergent from human than are bovine proteins because rodents have a faster reproductive rate and are, therefore, evolving more quickly [43
]. Although rodent milk proteins may appear more distant from human milk proteins than are bovine milk proteins, whether these differences have functional importance is a matter for future scientific inquiry.
Figure 4 Relationships between the milk protein sequences of mammalian taxa. This milk protein consensus tree, which is incongruous with the accepted phylogeny shown in Figure 1, was derived from a super-alignment of milk protein amino acid sequences for those (more ...)
Figure 5 Pairwise percent identity of human milk proteins with milk proteins of other species. Bars depict the average amino acid (AA) pairwise percent identity between human milk proteins and those of the species named on the x-axis. Note that human milk proteins (more ...)
Conservation of milk and mammary genes in mammals
To determine whether milk and lactation-related genes are more or less conserved across mammals than other genes, average PIDs of the 21 pairwise comparisons of the seven taxa were computed on a genome-wide basis for all bovine consensus gene models and genes from the milk protein and mammary gene sets with single copy orthologs in these taxa (Figure ). The distribution of the average pairwise PIDs of the milk protein gene set did not significantly differ from the whole genome distribution, nor did the means of the two distributions significantly differ (see 'Statistical analysis of PID distributions' in Materials and methods). However, when the sample size was increased by individually assessing pairwise PIDs between human and each of the seven taxa, requiring in each case that orthologs be single copies only in bovine and the two taxa being compared, milk protein sequences were statistically more conserved between human and other mammals than the products of other genes in the genome (see Additional data file 20 for details). The human-bovine distribution is most dramatically different from the whole genome as a full quarter of the set of the 137 milk protein genes with single copies in these two genomes are very highly conserved with a pairwise PID of 97.5% or greater.
Figure 6 Average pairwise percent identities of milk and mammary genes across mammals. The distribution of average amino acid pairwise PID of amino acid sequences across the seven taxa - human, mouse, rat, bovine, dog, opossum, and platypus - is plotted for those (more ...)
Of the average pairwise PID distributions of the mammary gene sets in Figure , all are significantly different from the genome-wide distribution. The means of their distributions also differ from the genome-wide mean. As a group, mammary genes of every developmental stage and condition appear to be more conserved across Mammalia, on average, than other genes in the genome.
To discover which milk proteins are most conserved in mammals, the average pairwise PIDs among the seven mammalian taxa were computed for all genes from the milk protein gene set with single copy orthologs in the manually curated set (see 'Curation of milk protein orthologs' in Materials and methods). The top 25 most conserved milk proteins across all seven mammals are listed in Table . These proteins have greater than 95% identity across mammals, some more than 99%, despite the fact that they have not shared a common ancestor for more than 160 million years. Based on the amino acid length and conservation, we can predict that these milk proteins have a small size with functions that depend on strictly conserved structure.
Highly conserved milk proteins
Nearly all of the highly conserved milk proteins (Table ) are found in the milk fat globule membrane proteome. GO analysis of these proteins yields four enriched terms: 'GTPase activity,' 'GTP binding,' 'small GTPase mediated signal transduction,' and 'intracellular protein transport.' Twelve of the proteins listed in Table are annotated with one or more of these GO terms. GTPases are known to be involved in the exocytotic pathway by which proteins are trafficked from the Golgi compartment to the plasma membrane. Further, GBB1 [GenBank:NM_175777
], RAB11B [GenBank:NM_001035391
], RAP1B [GenBank:NM_175824
], YWHAB [GenBank:NM_174794
], and RAB18 [GenBank:NM_001075499
] listed in Table have previously been isolated in Golgi fractions from the mammary glands of pregnant and lactating rats [48
]. An additional four milk proteins, SAR1A [GenBank:NM_001034521
], SAR1B [GenBank:NM_001035315
], RAB3A [GenBank:NM_174446
], and RAB3C [GenBank:NM_001046606
], are annotated with the GO term 'secretory pathway.' The finding that so many of these secretion-related proteins are associated with the milk fat globule membrane suggests they may also be involved in the highly specialized process by which the milk fat globule is secreted or that the exocytotic and lipid secretion pathways intersect at some point during the secretion process. Because the conserved proteins listed in Table are related to the generic molecular function of secretion, it seems highly likely that they facilitate the secretion of milk lipid.
Conservation of mammary genes relative to other genes in the genome suggests hypotheses about the evolution of milk production. First, conservation of mammary genes involved in all developmental stages supports the hypothesis that, at the genetic level, the basic biological transformation of the virgin gland through pregnancy, lactation, and involution is conserved among all mammals, and occurred by co-opting existing structures and developmental pathways. Second, many of the most highly conserved proteins found in milk are constituents of the milk fat globule membrane and are known to be part of the secretory process. High conservation of these genes between platypus, opossum, and the placental mammals indicates that molecular mechanisms of secretion were already in place 160 million years ago.
Divergent milk protein genes in mammals
Because the technique for ortholog detection relies on a minimum threshold of conservation, orthologs of many of the more divergent proteins could not be found in the platypus or opossum genomes. Therefore, to determine which proteins in milk are most divergent in mammals, average PIDs were computed across only the five placental mammals. The 25 most divergent milk proteins across placental mammals are presented in Table . These milk proteins are primarily secreted or cell-surface proteins with structures that are apparently not constrained by function relative to other proteins in milk. Four GO terms associated with these proteins are enriched: 'pattern binding,' 'response to other organism,' 'inflammatory response,' and 'extracellular space.'
Highly divergent milk proteins
The greatest inter-species divergence among milk protein sequences occurs with those proteins that are most abundant in milk (caseins, alpha-lactalbumin (LALBA)), those most abundant in plasma (fetuin, albumin), and with those contributing to immunity. The casein proteins are the most divergent of the milk proteins, with an average pairwise PID of only 44-55% across placental mammals. Nutritionally, the caseins provide the suckling neonate with a source of amino acids and with highly bioavailable calcium. Additionally, peptides derived from partially digested caseins have potential anti-microbial, immune-modulating, and other bioactive properties. The fact that the caseins are the most divergent of the milk proteins suggests that the nutritional and immunological functions of these proteins do not particularly constrain their amino acid sequence and structure.
The sequence divergence of LALBA is surprising given its essentiality to the synthesis of lactose, the primary source of digestible carbohydrate. LALBA
encodes a protein that forms the regulatory subunit of the lactose synthase heterodimer. However, additional functions of LALBA have emerged. When human LALBA is partially unfolded and bound to oleic acid, it functions as an apoptotic factor that kills tumor cells and immature cells, but not healthy differentiated cells [49
]. Thus, it is possible that this variant of LALBA protects the gut of the human neonate. Furthermore, the apoptotic capabilities of LALBA appear to be utilized in the regulation of involution of the mammary gland. A recent study suggests that Cape fur seals escape apoptosis and involution of the mammary gland during long foraging trips because they lack the LALBA protein [50
]. While lactose synthesis may be a common essential function, it appears that it does not overly constrain the sequence divergence of LALBA. The sequence divergence of LALBA may rather be related to the potential of this protein to modulate species-specific strategies related to immune function and the regulation of the mammary gland.
The most divergent immune-related proteins in milk are products of the following genes: mucin 1
], immunoglobulin IgM
], polymeric-immunoglobulin receptor
], peptidoglycan recognition protein
], Toll-like receptor 2
], Toll-like receptor 4
], macrophage scavenger receptor types I and II
], and chitinase-like protein 1
]. In milk, CD14 and TLR2 are present in soluble forms and may neutralize pathogens by binding to them as decoy receptors [13
]. MUC1 prevents the binding of pathogenic bacteria to epithelial cells in vitro
(R.L. Tellam, personal communication). Our finding that the most divergent milk protein genes are those that confer immunity presumably reflects a flexibility to confront a wide variety of pathogen challenges.
Evolution of milk and mammary genes along the bovine lineage
To investigate the selective constraints on the evolution of bovine milk and mammary genes, the rate of non-synonymous substitutions per non-synonymous site (dN) to synonymous substitutions per synonymous site (dS) was estimated for proteins in each gene set using bovine genes and their putative orthologs in the human and mouse genomes (see 'Evolutionary analysis along the bovine lineage' in Materials and methods for details). The average dN/dS ratio of the genes from the milk protein and mammary gene sets (Table ) was significantly below the genome average (Mann-Whitney U test, P < 0.05), indicating that milk and mammary genes are subject to more stringent selective constraint than other genes in the bovine genome.
Milk and mammary gene average dN/dS
Given the taxonomic diversity of milk composition, we expected that the processes of lactation would be under stronger selective pressure than the genes that give rise to proteins in milk. However, the average dN/dS of the milk protein gene set was similar to that of the lactation mammary gene set (Table ). This result suggests that species-specific variation in milk composition is primarily due to mechanisms other than milk and mammary protein sequence variation.
Next, milk and mammary genes were evaluated for positive selection. A gene is inferred to be subject to positive selection when dN
is significantly greater than 1. Of the 6,530 genes from the milk protein and mammary gene sets, only two bovine genes with dN
>1 were significant under the likelihood ratio test (see 'Evolution analysis under the bovine lineage' in Materials and methods): ADP-ribosyltransferase 4
] and prenylcysteine oxidase 1
]. The ART4
gene product, which has previously been reported to be subject to positive selection in cattle [51
], is an erythrocyte protein that carries antigens to the Dombrock blood group. PCYOX1
produces a protein that degrades a variety of prenylcysteines. Using RT-PCR to determine PCYOX1
mRNA levels in alveolar mammary tissue from virgin, prepartum, lactating, involuting and dried-off cows (Additional data file 22), we found that PCYOX1
are not differentially expressed in these tissues. The accelerated evolution of these genes may be unrelated to mammary biology.
Two abundant milk protein genes, beta-casein
] and kappa-casein
], were among those with dN
>1, but they were not statistically significant under the likelihood ratio test (see 'Evolution analysis along the bovine lineage' in Materials and methods). The requirement that the entire gene shows statistical evidence of positive selection may be too stringent. Evidence of positive selection within the family Bovidae has been previously detected in a 34-codon region of CSN3
]. Further site-specific evolutionary analysis of the casein genes may be warranted.
Despite the domestication of cattle for milk production, breeding regimes have not caused the apparent accelerated evolution of even a single milk protein or member of the lactation mammary gene set. Furthermore, milk and mammary genes are undergoing stronger purifying selection than other genes in the bovine genome. It has previously been theorized that the evolution of the mammary gland has been subject to forces that maximize the survival of the mother-child pair [53
]. Because all components in the milk are produced at the expense of the mother, it can be argued that few superfluous components would survive evolution. Our findings are consistent with this hypothesis. Genes encoding milk components and other genes expressed in the mammary gland were found to be under significant negative selection compared to the whole genome, highlighting the essentiality of milk in mammalian evolution.