|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Rodent and primate pregnancy-specific glycoprotein (PSG) gene families have expanded independently from a common ancestor and are expressed virtually exclusively in placental trophoblasts. However, within each species, it is unknown whether multiple paralogs have been selected for diversification of function, or for increased dosage of monofunctional PSG. We analysed the evolution of the mouse PSG sequences, and compared them to rat, human and baboon PSGs to attempt to understand the evolution of this complex gene family.
Phylogenetic tree analyses indicate that the primate N domains and the rodent N1 domains exhibit a higher degree of conservation than that observed in a comparison of the mouse N1 and N2 domains, or mouse N1 and N3 domains. Compared to human and baboon PSG N domain exons, mouse and rat PSG N domain exons have undergone less sequence homogenisation. The high non-synonymous substitution rates observed in the CFG face of the mouse N1 domain, within a context of overall conservation, suggests divergence of function of mouse PSGs. The rat PSG family appears to have undergone less expansion than the mouse, exhibits lower divergence rates and increased sequence homogenisation in the CFG face of the N1 domain. In contrast to most primate PSG N domains, rodent PSG N1 domains do not contain an RGD tri-peptide motif, but do contain RGD-like sequences, which are not conserved in rodent N2 and N3 domains.
Relative conservation of primate N domains and rodent N1 domains suggests that, despite independent gene family expansions and structural diversification, mouse and human PSGs retain conserved functions. Human PSG gene family expansion and homogenisation suggests that evolution occurred in a concerted manner that maintains similar functions of PSGs, whilst increasing gene dosage of the family as a whole. In the mouse, gene family expansion, coupled with local diversification of the CFG face, suggests selection both for increased gene dosage and diversification of function. Partial conservation of RGD and RGD-like tri-peptides in primate and rodent N and N1 domains, respectively, supports a role for these motifs in PSG function.
In tandemly repeated gene families, in which all members share a common function, there is a tendency for concerted evolution that is characterised by homogenisation of gene sequences . Classical examples include the histone and ribosomal RNA genes. In such cases the expansion of gene families is driven by selection for high expression . Concerted evolution is generally maintained by unequal crossover, intergenic gene conversion or other illegitimate recombination mechanisms [1,2]. Conversely, there are multigene families whose members encode diverse functions e.g. genes encoding immunoglobulin (Ig), T cell receptor (TCR) and major histocompatibility complex (MHC) proteins . Such diversity occurs when there is less homogenisation than mutation, due to the evolution of specific programmed mutational mechanisms . In addition, more complex modes exist; for example, the immunoglobulin heavy-chain variable-region (VH) genes encode proteins with identical functions, but exhibit little concerted evolution . Instead, their evolution is governed by divergence and a birth-and-death process of gene duplication and dysfunctioning mutations .
Similar to other families of highly expressed trophoblast-specific genes such as the pregnancy-associated glycoproteins (PAG) , the pregnancy-specific glycoproteins, which are the most abundant foetal proteins in the maternal bloodstream during human late pregnancy, are encoded by multiple tandemly arrayed genes [6,7]. The PSG family of glycoproteins, with the related CEA-related cell adhesion molecule (CEACAM) proteins, are part of the immunoglobulin superfamily . The Ig domain structure of the human and mouse PSGs differs, as follows: Human PSGs contain one V-like Ig domain (N), C2-like Ig domains (A and B) and relatively hydrophilic tails (C), with domain arrangements classified as type I (N-A1-A2-B2-C), type IIa (N-A1-B2-C), type IIb (N-A2-B2-C), type III (N-B2-C) and type IV (A1-B2-C) . In contrast, mouse PSGs typically have three or more N domains followed by a single A domain [7,10]. The common ancestor of rodent and primate PSGs and CEACAMs was probably similar to CEACAM1, which is the only CEA family member with an identical gene structure in the human, rat and mouse that encodes all types of extracellular domains present in CEACAM and PSG proteins. The time of initial gene duplication is estimated at 90 Myr , approximately the time of rodent-primate divergence. The independent expansion of human and mouse PSG gene families occurred through further gene duplication and exon shuffling events [7,12,13].
The independent expansion of PSG gene families in rodents and primates indicates convergent evolution, implying that PSG function is conserved. These events can be interpreted in the context of evolutionary theories of parent-offspring and inter-sibling conflicts that promote transcriptional 'arms races' leading to high expression of trophoblast-specific genes that influence maternal investment in offspring [14,15]. In one scenario, duplicated PSG genes are selected because they increase effective PSG dosage, thereby enhancing an effect on maternal investment in offspring. In this context, it is noteworthy that human PSG N domains contain putative integrin-binding 'RGD' motifs that are proposed to mediate cell interactions with the extracellular matrix [16,17] and immune cells . Such PSG-mediated functions could potentially influence trophoblast invasion or maternal immune cell function. However, not all human, and none of the mouse, PSGs contain an RGD motif , suggesting that, if human RGD motifs are functionally significant, there has been diversification of function of some human, and all mouse, PSGs, relative to a putative RGD-containing ancestor. In the context of parent-offspring conflict, such divergence might reflect co-evolution of PSGs and their receptors, similar to the co-evolution of ligand / receptor pairs observed in host-pathogen interactions [19,20].
In this study, we sought to analyse PSG evolution to determine the extent and patterns of rodent and primate PSG sequence divergence by analysing intraspecific and interspecies DNA substitution rates in PSG coding regions. We also sought evidence in support of functionality of RGD and RGD-like tri-peptide motifs in PSG amino-terminal effector domains.
With the exception of mouse PSG24, PSG30 and PSG31 and human PSG2 and PSG5, all PSGs for which full length sequences are available have a structure based on four Ig-like domains and a leader sequence that is cleaved during post-translational processing. The only type of domain found in all rodent and primate PSGs is the N domain located at the amino terminus. Indeed, this domain is shared by all members of the extended CEA family, suggesting that it may contain important functional motifs. We sought to test this hypothesis with respect to PSG function, by analysing both full-length PSG sequences and selected domains of possible functional importance. Alignments of full-length 4-domain human and mouse PSG protein sequences were generated with ClustalX, followed by pairwise comparisons of all mouse sequences with all human sequences. Mean Dayhoff PAM250 log scores were calculated for each alignment position and grouped by domain. The scores within each of the four domains were then visualised using box and whisker plots (which show the median value, upper and lower quartiles plus range) (Fig. (Fig.1).1). The N domains exhibited significantly higher scores (p < 0.001) than the other three domains, with positive scores indicating conservation. There was no evidence of interspecies conservation of the other domains, which is unsurprising given the known lack of orthology between human A1 / mouse N2, human A2 / mouse N3, and human B2 / mouse A domain pairs.
Rat N1 domain exon sequences were identified in NCBI and Ensembl databases. Three novel rat PSG genes were identified and named PSG41, PSG42 and PSG43 in keeping with accepted nomenclature . We also identified a novel PSG40 splice variant with alternative leader and N1 domain exons, situated between the N1 and N2 domain exons of the published PSG40 sequence (NM_021677). Both BLAST and pattern matching methods retrieved the same rat PSG genes from different databases; therefore we considered our search to be exhaustive. All rat PSG genes were found to reside on contig NW_047556 and this was used for the prediction of remaining exons for each PSG gene based on BLAST generated alignments with mouse Psg gene sequences (Table (Table1).1). The CDS sequences of the novel predicted rat PSG genes and PSG40 splice variant are listed in additional file 1. We used our predicted sequences in preference to the publicly available sequences in our analyses.
Following the preliminary identification of amino-terminal N domain conservation, we planned to use an evolutionary tree building approach to further examine inter-domain relationships in rodent and primate PSGs. However, using split decomposition analysis, McLenachan et al. , in their study of a subset of human PSGs, concluded that it is not possible to accurately determine branch points in an evolutionary tree of human PSGs. Split decomposition analysis identifies contradictory relationships within alignment data; for example, there may be a pattern grouping PSGX and PSGY together, and another pattern grouping PSGY and PSGZ together . This information is normally approximated when drawing evolutionary trees, however split decomposition is a non-approximation method that permits the building of trees with support indicated for relationships based on all patterns in the data. Such analysis can therefore predict to a limited extent the occurrence of sequence homogenisation e.g. by gene conversion or positive selection.
We performed split decomposition analysis on nucleotide sequences using the SplitsTree4 program  on the individual domain exons of mouse Psg genes (Fig. (Fig.2).2). For a more complete analysis of N1 domains we also performed the analysis using rat N1 domain exons, all known human N1 domain exons and all known baboon N1 domain exons (Fig. (Fig.3).3). We detected no conflicting signals for mouse Psg N1 domain exons (Fig. (Fig.2A),2A), in contrast to the human N domain exons (Fig. (Fig.2B).2B). However, our results for human N1 domains (Fig. (Fig.3B)3B) differ from those obtained by McLenachan et al.  because we observed only two contradictions: i. regarding the relationship of PSG4 and PSG9 to each other, and to their nearest neighbours PSG3 and the common ancestor of PSG6 and PSG10 and, ii. between 'the relationship of PSG2 to PSG1 and PSG11'. This discrepancy is probably due to our inclusion of four extra PSG N1 domain sequences, and the fact that the PSG11 sequence (GenBank: M69025) used by McLenachan et al.  has been updated.
Analysis of the mouse N2 domains indicates numerous contradictions in the alignments of the Psg24, Psg29, Psg30, Psg31 and Psg32 group (Fig. (Fig.2B).2B). In contrast, the N3 domains exhibit no discernable conflicts (Fig. (Fig.2C).2C). The A domain only showed contradiction within the Psg24, Psg29, Psg30, Psg31 and Psg32 group (Fig. (Fig.2D).2D). Examination of the rat PSG N1 domain exon alignments demonstrated minor contradictions between the common ancestor of PSG36, PSG37 and PSG39 and that of PSG38 and PSG41 (Fig. (Fig.3A).3A). In contrast to all the other PSG N1 domains thus compared, the baboon PSGs demonstrate considerable conflicting signals as demonstrated by the 'spider's web' appearance of the SplitsTree graph (Fig. (Fig.3C3C).
Few examples of orthologous relationships between PSG sequences have been identified. In order to compare the relationship between rodent and primate amino-terminal N domain exon coding sequences, an NJ tree was produced (Fig. (Fig.4).4). The tree was generated from ClustalX alignments of nucleotide sequences, with bootstrapping 1000 times to test the reliability of branches. The human and baboon N sequences formed one distinct cluster, the mouse and rat N1 sequences formed a second, the mouse N2 domains formed a third and the mouse N3 domains formed a fourth. Of particular interest was the split between the ancestral N-type domain and the common ancestor of the N2 and N3 domains. The confidence of this split was 93% and demonstrates that the mouse N1 domains are more closely related to primate N domains than to the mouse N2 and N3 domains. A similar comparison of the entire set of mouse and human PSG domains confirmed that the interspecific N domain clustering is unique because the human PSG A1 and A2 domains segregated into distinct branches (sharing a common ancestor with the mouse A domains) and the B2 domains cluster on a distinct branch (Fig. (Fig.55).
Mouse and rat PSG gene coding sequences were analysed using an NJ plot which highlighted four putative orthologous relationships, as follows: rat PSG36 and mouse Psg24; rat PSG40 and mouse Psg29; rat PSG42 and mouse Psg32; rat PSG38 and mouse Psg16 (Fig. (Fig.6).6). There is also distinct branching of rat PSG43 with mouse Psg30 and Psg31. The orthologous relationship is also supported for PSG36 and Psg24 because both contain five N domains.
The crystal structure of mouse CEACAM1 (soluble murine sCEACAM1a [1,4]) has been resolved . Comparison of the mouse PSG N1 domains identifies the predicted β-sheet-forming CFG β-strands as the most variable regions of the N domains (Fig. (Fig.7A).7A). The CFG face of CEACAM N domains has been shown to interact with pathogens and mammalian proteins (Fig. (Fig.7B).7B). Within Box 1 and Box 2, there is considerable variation between mouse N1 domains, which is illustrated quantitatively using Dayhoff charts (Figs. (Figs.88 – 10). Positive Dayhoff scores and generally low standard deviations indicate good conservation of mouse PSG N1 domains (Fig. (Fig.8),8), and even stronger conservation of human PSG N domains (Fig. (Fig.9).9). The latter may be explained by homogenisation of human PSG gene sequences . Dayhoff score analysis using comparisons of all mouse N1 domain versus all human N domain ClustalX aligned sequences gives an indication, at the amino acid level, of the general pattern of evolution of these domains since the rodent / primate divergence (Fig. (Fig.10).10). Again, the majority of residues exhibit good conservation, and relatively little variability is observed between pair-wise comparisons particularly with regard to residues that are involved in protein folding. The reduction in size of Box 2 in Fig. Fig.88 and Fig. Fig.1010 is explained by deletions of mouse DNA sequences, requiring exclusion of the corresponding amino acids from the analysis.
To gain further insight into mouse Psg N domain exon evolution, the N1, N2 and N3 domain exons of mouse Psg genes (mN1, mN2 and mN3, respectively), the N1 domain exons of rat PSG genes (rN1) and the N domain exons of human PSG genes (hN) were analysed in the following comparisons: mN1 vs mN2; mN1 vs mN3; mN2 vs mN3; mN1 vs rN1; mN1 vs hN. Synonymous (ds) and non-synonymous (dn) substitutions per synonymous and non-synonymous site, respectively, were determined in each case for all combinations of PSG gene pairwise comparisons, and box and whisker plots were generated from the data (Fig. (Fig.11).11). The majority of data points derived from individual comparisons lie under the 45° line of equivalence where dn = ds, and most variation in the comparisons lies within the values of ds (Fig. 11A). When the data are presented as box and whisker plots, the values are indicative of conservation, with median values ranging from 0.48 – 0.70 (Fig. 11B). The higher values for median dn/ds in the mN1 vs rN1 comparison appear to be the result of a tighter ds distribution as observed in Fig. 11A, with values not exceeding one substitution per synonymous site in any pairwise comparison.
In view of the sequence variations in the CFG face, which are visible in alignments (Fig. (Fig.7A),7A), against a background of overall conservation, as estimated from dn/ds analysis, we sought to determine whether the dn/ds values were higher in the CFG face than the ABED face of the N1 domain. Nucleotide sequence alignments were generated using all mouse Psg N1 domain exons (based on protein alignments), and the nucleotides present in the three sections comprising the CFG face (Boxes 1, 2 & 3; Fig. Fig.7A)7A) were separated from those comprising the ABED face. The two new sets of data were analysed individually to determine mean dn and ds values from pairwise comparisons of all sequences within each dataset (Fig. (Fig.12).12). A plot of dn vs ds for the ABED face of the mouse N1, N2 and N3 domains (Fig. 12A) demonstrates a distribution of pairwise-alignment data points which overwhelmingly lie below the line of equivalence. However, a similar plot generated from analysis of the CFG face has data points distributed approximately equally on both sides of the line of equivalence (Fig. 12B). This is due predominately to a higher number of non-synonymous substitutions. The values of dn/ds obtained for the CFG face in the N1, N2 and N3 domains of the mouse and the N1 domain of the rat are all significantly greater than the values obtained for the ABED face (p < 0.0001, Fig. 12C). The dn/ds values obtained for the mouse N1, N2 and N3 domain CFG faces equal or exceed 1.0, with the highest median value of 1.1 observed in the N1 domain. The rat N1 domains are more conserved, with dn/ds values derived from both the CFG and ABED faces under 1.0 on average.
Within Box 3 of the CFG face (Fig. (Fig.7A)7A) there is evidence of conservation of putative integrin-interacting RGD-like motifs in the mouse N1 domain, which may have functional significance. To investigate this possibility further, a survey of all mouse, rat, baboon and human PSG RGD, and related, motifs was compiled (Fig. (Fig.13).13). Extant primate and rodent PSG RGD-like motifs are linked in sequence space by an RGD motif encoded by the sequence CGA GGA GAT which, incidentally, is not observed in any of the extant PSG coding sequences. The most commonly observed motif, RGD, is encoded by CGA GGT GAT, and the majority of variants are closely related to this sequence. In rodents, RGE and HGE are the most commonly observed motifs. However, the NGK motif, which is not an RGD-like motif as we have defined it, is well represented, and is separated in sequence space from HGE by a transition and a transversion.
Of the seventeen aligned mouse PSG N1 domain exon sequences, 53% possess a tri-peptide at the site of the RGD-like motif belonging to the RGD-like 5-1-4 tri-group (as defined in the Methods section). For comparative purposes, tri-groups were determined for tri-peptide motifs at fifty random positions within the alignment. The number of most commonly represented tri-groups at each position was expressed as a percentage of the number of aligned sequences, and the mean and standard deviation was determined to indicate the mean maximal tri-group representation for the 50 random alignment positions. The control value obtained was 67.6 ± 22.9%; the value of 53% of 5-1-4 tri-groups at the RGD site therefore lies within the control range, albeit 14.6% below the mean value. However, a more revealing statistic is derived from aligning the mouse N1 domains with the mouse N2 and N3 domains (see additional file 2), compared to aligning the mouse N1 domains with the human N domains. In the former comparison (mouse N1 vs N2 and N3 domains) the most commonly represented tri-group is 4-2-5, with 27% representation. This tri-group is not RGD-like and its representation is lower than the mean maximal tri-group representation of 49.8 ± 22.7% determined for fifty random alignment positions. However, when the mouse N1 domain is aligned with the human N domain, the most commonly represented tri-group is the RGD-like 5-1-4 group which has 59% representation, comparable to the mean maximal tri-group representation of 60.7 ± 20.4%.
We recently collated the full-length coding sequences of the entire mouse Psg gene family . In the present study we aimed to identify evolutionary signals embedded in Psg gene and PSG protein sequences to determine whether PSG protein function has diverged between the rodent and primate lineages, and to attempt to understand the reasons for the independent expansions of rodent and primate PSG gene families.
Mouse and human PSG protein amino-terminal N domains exhibit different patterns of evolution. McLenachan et al.  analysed the evolution of a subset of human PSGs using split decomposition analysis and found, in individual comparisons of N, A1, B2 and C domain exons, strong contradictions in alignments, which they suggested was due to gene conversion and/or positive selection. Our similar analysis of an expanded set of human PSG sequences revealed a detectable, but less marked, degree of homogenisation. Analysis of mouse N and A domain exons showed that, in general, there is less evidence of purifying selection compared to the human, although there are examples of gene conversions as described previously for the closely related Psg21 and Psg23 genes . Detailed analysis of alignments using plots of Dayhoff scores confirmed the difference between mouse and human N domain evolution.
Using dn/ds analysis for interspecies comparisons, we found that the PSG protein amino-terminal N and N1 domains are relatively conserved, consistent with conservation of function in rodents and primates. However, inspection of mouse PSG N1 domain alignments, and scrutiny of corresponding Dayhoff scores, revealed regions of apparently poor conservation. These regions correspond to the CFG face within the N1 domain of CEACAM1. In the CEACAM family, the CFG face interacts with pathogens and mammalian proteins. Comparisons of dn/ds values obtained from the CFG and ABED faces of mouse N1, N2 and N3 domains confirmed that the CFG face has evolved more rapidly than the ABED face in all three domains. The greatest effect was observed in the N1 domain exon with a doubling of the dn/ds ratio in the CFG face compared with the ABED face. The dn/ds ratio of 1.1 suggests weak positive selection on the CFG face of the N1 domain. The increase in the dn/ds ratio appears to be mainly due to an increase in the dn value, indicative of diversification. The high dn/ds values for the CFG face in the N2 and N3 domains, which are not known to interact with ligands, could be due to a low contribution of these sequences to the structural integrity of the IgV-like domain.
Interestingly, the rat N1 domain CFG face does not appear to have evolved as rapidly as the mouse N1 domain, with a dn/ds ratio of 0.9. This observation, combined with the relatively smaller number of PSG genes identified in the rat (eight to date, compared to seventeen in the mouse) and the higher level of gene homogenisation implied by split decomposition analysis suggests that the rat PSG gene family has not expanded or diversified as extensively as the mouse. However, we cannot exclude the possibility that further rat PSG genes may yet be identified because there may be under-representation in the WGS database . Notwithstanding this possibility, there has clearly been ongoing turnover of the PSG gene family in all of the lineages analysed, as there are no known human orthologues of rat and mouse PSGs, and only four potential orthologous relationships between known rat and mouse PSGs.
These findings suggest partial conservation of PSG N domain function across rodent and primate lineages. However, the relaxed constraint on the CFG face of mouse PSGs suggests diversification of binding partners or modification of existing ligand-binding kinetics, analogous to the CEACAMs. This observation receives experimental support from the recent observation that treatment of mouse macrophages in vitro with recombinant mouse PSG17N, or human PSG1 or PSG11, induces cytokine expression; however, only in the case of mouse PSG17N does this depend on CD9 receptor expression . Divergence of PSG function is also suggested by differences in the level and developmental timing of expression of different mouse PSGs [7,12], expansion of N domain number in PSG24, PSG30 and PSG31 , and loss of secretory signals in PSG32 and in the brain-specific splice variant of PSG16.
As noted above, the only PSG receptor identified to date is the integrin-associated tetraspanin, CD9, which binds the N1 domain of mouse PSG17 but not, apparently, to human PSGs . However, a peptide containing the RGD motif from the human PSG9 N domain binds to a receptor on a promonocytic cell line suggesting that some human PSGs may effect their functions through an integrin-type receptor . In this context, the high frequency of the RGD motif on an exposed loop in primate PSG N domains (seven of ten in human and five of fifteen in baboon) may be significant. Rodent PSG N1 domains do not have an RGD motif, but have a high frequency of the RGD-like motifs RGE, HGE and HAE on the CFG face. Under the null hypothesis that these motifs are unlikely to underpin structural integrity of the N1 domain and are therefore free of constraint, our analysis reveals evidence of unexpected conservation of RGD-like motifs in the N1 domain, which have been lost in the N2 and N3 domains. Given the high transition and transversion rates in the N1 domain and the fact that the mouse N1, N2 and N3 domains share a common ancestor after the divergence of the rodent / primate lineages, the conservation of RGD-like motifs exclusively in the N1 domain may have functional significance. We note that the RGE motif in the context of the POEM protein induced apoptosis of MC3T3-E1 cells in vitro . We speculate that certain RGE or RGE-like motifs may elicit weak cell attachment, followed by apoptosis – a combination of properties, reminiscent of snake venom disintegrins [30,31], that could have important functional implications in the context of the extensive tissue remodelling that occurs during placentation .
In summary, our data are consistent with experimental evidence indicating functional convergence of rodent and primate PSGs, in spite of the independent expansions of the gene families in the two lineages. In the context of parent-offspring conflict, the homogenisation of human PSG sequences is consistent with the theory that placental hormones encoded by multigene families are monofunctional and selected for high expression, possibly due to coevolution with physiologically conflicting maternal mechanisms . However, the evidence for positive selection on the CFG face of the N1 domain implies divergent evolution of rodent PSGs. Allied to the evidence for functionality of putative integrin-interacting RGD-like motifs in rodents, a scenario can be envisaged whereby the different RGD-like motifs observed in human and baboon PSGs also suggest some degree of functional divergence in these species.
Our analysis provides evidence for conservation of rodent and primate PSG amino-terminal N domains, with ongoing independent expansion of the gene families in the two lineages. There has been some diversification of the CFG face of mouse N1 domains, a region that includes putative integrin-interacting RGD-like motifs. Our analysis provides reassurance that the mouse Psg gene family is a suitable model system for the analysis of human PSG gene function.
Perl programs were written to perform most general sequence manipulations and iterative tasks and executed under ActivePerl v5.8.3  on a Windows 2000 (Microsoft) platform.
Blast searches of the NCBI  and Ensembl  RGSC3.1 rat genome databases were performed using coding sequences from known rat PSGs (PSG36-PSG40) and mouse PSGs. Additionally, a search pattern was developed and used to interrogate the Rattus_norvegicus.RGSC3.1.nov.dna_rm.contig.fa.gz archive obtained from the Ensembl FTP resource . The search pattern was derived manually from alignments of amino acid sequences from the N domain exon of all known mouse and rat PSGs (mouse PSG16-PSG32 and rat PSG36-PSG40) generated using the ClustalX 1.81 windows interface . In PROSITE format  the search pattern used was S-x-R-E-x(5)-G-x(3)-[IL]-x(3)-T-x(2)D-x(3)-Y-x(17,18)-L-x-V. Analysis was performed essentially as described , with the program modified to search for the selected pattern in peptides of fifty amino acids or greater derived from genomic DNA sequences translated in all six open reading frames. ClustalX alignments were produced using the complete open reading frames returned by the program combined with the N1 domains of rat PSG36-PSG40. The alignments were trimmed to include only N1 domain exon sequence and a Neighbour-Joining tree was generated using MEGA version 2.1 software  to aid the identification of the new sequences.
Mouse PSG sequences were obtained from McLellan et al. , rat PSG sequences were obtained as described above, human PSG sequences were obtained by name searches at the NCBI Entrez (nucleotide or protein options) database  and baboon N1 domain sequences were obtained as described . To generate protein alignments for examination by eye, a Web based ClustalW utility was used , otherwise protein sequences were aligned with the ClustalX using the default parameters. Nucleotide alignments were generated based on ClustalX protein alignments, such that where a single dash was placed in the amino acid alignment, three dashes were placed in the equivalent codon position in the nucleotide alignment. The nucleotide alignments were then analysed using SplitsTree version 4b  and software and NJ trees were generated from the data (with bootstrapping 1000 times to test the reliability of branches). Individual domains of the mouse PSGs were also analysed by the split decomposition method using the same software. During NJ or Splitstree tree-building, the Jukes-Cantor  correction for multiple hits was applied and positions with gaps were ignored.
Multiple alignments of either one set (e.g. all mouse PSG N1 domain exons only) or two sets (e.g. all mouse PSG N1 and N2 domain exons) of amino acid sequences were produced using ClustalX. A Perl program was written to perform the subsequent analysis. At each position of the alignment, the Dayhoff PAM250 log score was determined for pairwise comparisons of each sequence in the set against all the others in the set in one-set analyses, or of all set 1 sequences against all set 2 sequences in two-set analyses. The mean and standard deviation of scores obtained for the pairwise comparisons at each site were determined to give an indication of the general level of conservation and variability at the site. Sites where gaps were present in any of the sequences were not analysed. Where full-length mouse and human PSG amino acid sequences were compared, the scores were split into five groups at domain junctions and a box and whisker plot produced.
ClustalX was used to produce multiple alignments of either one set of amino acid sequences (e.g. all mouse PSG N1 domain exons only) or two sets combined (e.g. all mouse PSG N1 and N2 domain exons). These alignments were used to inform the alignment of corresponding nucleotide sequences as described above. Values of ds and dn were determined for pairwise comparisons of each sequence in a set against all the others in the set for one-set analysis, or of all set 1 sequences against all set 2 sequences for two-set analysis. The analysis was performed according the method of Yang and Neilsen  using the 'YN00' program in the PAML3.14 software package . Before each pairwise comparison was executed, pairs of aligned sequences were extracted from the alignment file, placed in a Phylip format file and gapped positions were removed. Plots of dn vs ds, and box and whisker plots of dn/ds were produced in order to visualise the data. Where statistical significance was evaluated, the Mann-Whitney test was applied.
A perl program was written to analyse ClustalX alignments of mouse and human PSG N domain exons. These alignments were inspected and modified where necessary. For a tri-peptide at a given position within an alignment, a tri-group code was generated for tri-peptide motifs based on amino acid properties of the residues in the motif where group 1 contains G, A, S, T; group 2: V, L, I, M; group 3: F, Y, W; group 4: D, N, E, Q; group 5: H, K, R; group 6: P; group 7: C. For example, an RGD tri-peptide motif is represented by tri-group code 5-1-4 as arginine is in group 5, glycine is in group 1, and aspartate is in group 4. Conversely, tri-group 5-1-4 is 'RGD-like' in terms of the biochemical properties of the constituent amino acids. The number of sequences in the alignment containing each group code at a given position was determined. The most highly represented group code in the alignment at that position was used in the analysis. The program was designed to compare a user selected tri-peptide motif position with fifty randomly selected tri-peptide motif positions.
A. McLellan performed data collection and analysis and co-wrote the manuscript. W. Zimmermann and T. Moore co-conceived the project and co-wrote the manuscript.
An ASCII text file containing the CDS sequences of novel predicted rat PSG41, PSG42 and PSG43 and a novel splice variant of PSG40.
A rich text format file containing the Clustal W amino acid sequence multialignment of PSG N1, N2 and N3 domains. The RGD-like motif is boxed for comparison between domains.
We thank two anonymous referees for helpful comments. This work was supported by the Irish Higher Education Authority Program for Research in Third Level Institutions funded under the National Development Plan, and an Irish Health Research Board / Wellcome Trust 'New Blood' Research Fellowship to T. Moore.