Sets of imprinted genes selected for analysis
The Imprinted Gene Catalogue (IGC) [1
] reports imprinted genes in various species, including human and mouse. We gathered information from the IGC on 62 genes for which solid experimental data on allele-specific expression was available [see Additional file 1
] (see Methods for definition of exclusion criteria). Among these 62 genes, only 30 had been analysed in human as well as in mouse revealing that 26 of these are imprinted in both species. For one of these genes the status was only confirmed in human, for 3 genes only in mouse. Thus, of the genes analysed in both species 87% showed conservation of imprinting. For the additional 23 imprinted genes the imprinting status had only been analysed in mouse, and for additional 9 genes only in human.
Tissue-specific expression patterns of imprinted genes
Using a publicly available gene expression dataset derived from microarray hybridisations we wanted to find out if imprinted genes form a subset of genes expressed in a particular fashion in human and mouse.
The raw microarray data were preprocessed and normalized as described in the Methods section. We confined the analysis to genes that were present on the respective expression arrays (GNF1M for mouse and HG-U133A for human) and which exhibited a confirmed imprinting status [see Additional file 1
] in at least one species. For human, 29 imprinted genes met such criteria (of 35 genes with a confirmed imprinting status), and in mouse 43 (of 52 with a confirmed imprinting status). This list also includes genes reported to be imprinted in certain tissues only but not in others. As information on tissue specific imprinting is only available for some but not all imprinted genes the consideration of tissue-specific imprinting is not feasible for an unsupervised genome-wide approach. The array data did not allow distinguishing expression of the parental alleles. Our analysis could therefore only address the overall expression level. In the analysis of human gene expression we included 21 postnatal tissues and placental tissue. For the mouse, data sets for oocyte, fertilised egg and five embryonic stages were included. Unfortunately, such data was not available for the human.
We first analysed the expression profiles of human and mouse separately. The aim was to identify tissues that differ considerably from other samples in terms of their expression profiles of imprinted genes. We performed biclustering (euclidean distance and average linkage). The resulting biclustered expression matrices from separate human and mouse expression analyses are shown in figure and .
Figure 1 Relative normalized expression levels of imprinted genes in human and mouse. The figure summarizes the average (of normalized expression levels) of gene specific probes (annotated with the same gene name after normalization) of repeated array experiments. (more ...)
For human, placenta forms a clearly separate branch (1st split) in the clustered tree of tissues (Figure ). In the 2nd split pancreas branches off (high influence of insulin), followed by a 3rd split branch formed by pituitary, ovary, and adrenal gland. In the clustered tree of imprinted genes, GNAS clusters apart after the 1st split followed by DLK1 in the 2nd split. The expression of both genes differs significantly from the other genes in all tissues (p value < 10-12 for all tissues). While GNAS was highly expressed in all tissues, DLK1 is significantly over-expressed in placenta, adrenal gland, ovary and pituitary and the high expression is relevant for the clustering of these tissues (see above). Regarding the 3rd split, the remaining imprinted genes branch into two large clusters (Figure ). The first group, consisting of INS, KCNQ1, SLC22A18, NDN, NNAT, SNRPN, PEG10, CDKN1C, MEST, ATP10A, GRB10, IGF2, SGCE, PEG3, ZIM2, comprises genes expressed at largely median level. Among those INS stands out, that is remarkably over-expressed only in pancreas, the major insulin producing organ of the body. IGF2, PEG10 and CDKN1C are strongly (over)-expressed in placenta. The second cluster consists of PHLDA2, PLAGL1, PPP1R9A, DIRAS3, L3MBTL, WT1, HYMAI, DLX5, MKRN3, TP73, UBE3A, MAGEL2 with all genes showing a relative under-expression with respect to the tissue expression median. PHLDA2 and PLAGL1 are clearly downregulated in almost all tissues but strongly over-expressed in placenta. The genes that contributed most to the specific expression pattern of placenta and pancreas were those that were either strongly up or downregulated compared to their expression in other tissues. The placental expression pattern was dominated by DLK1, PHLDA2, CDKN1C, MEST, PEG10 and IGF2. For pancreas, INS, DLK1, SNRPN, MEST KCNQ1 and HYMAI were the most prominent genes. Finally, for the 3rd split, a cluster which consisted of adrenal gland, ovary and pituitary, we applied random forest analysis to determine which genes contributed most to the formation of that cluster. These were DLK1 (mean standard error – MSE: 5.36%), PPP1R9A (4.61%), HYMAI (3.45%), PEG3 (2.3%), MEST (2.13%), ATP10A (2.09%), ZIM2 (1.99%). Applying a random forest analysis to the same tissues in mouse identifies Dlk1 (4.84%), Sgce (4.79%), Kcnq1 (2.11%), Phlda2 (1.91%), Gtl2 (1.76%), Inpp5f (1.41%), Usp29 (1.41%) as major contributors.
In mouse tissues the clustering has some similarities to human but also clearly distinct features. Pituitary and brain tissues branch off together at the 1st split. This branching is predominantly caused by seven genes which we identified applying a random forest analysis as: Gtl2 (7.42%), Rasgrf1 (6.79%), Nap1l5 (6.57%), Impact (6.06%), Inpp5f (4.78%), Mirg (4.27%), Rian (4.08%). Applying a random forest analysis to the same brain tissues in human results in the following genes: PEG3 (7.44%), PEG10 (7.07%), ZIM2 (6.34%), SNRPN (5.49%), NNAT (5.12%), SLC22A18 (3.44%), PPP1R9A (2.81%). In the 2nd split embryonic tissues separate from the adult ones (Figure ). The highest scoring genes for this cluster in a random forest analysis were: Igf2 (11.03%), H19 (10.50%), Grb10 (7.96%), Cdkn1c (7.34%), Slc38a4 (4.79%), Plagl1 (2.96%), Peg3 (2.63%). Most notably Igf2, H19, and Cdkn1c that dominate this branch lie in the BWS region and, together with Phlda2, are all highly expressed in embryonic tissues (Figure ). When omitting these four genes, from the clustering analysis the specific branching and clustering of embryonic stages is lost (data not shown). Among the postnatal tissues, adrenal gland splits off in 3rd branch (as in 2nd split in the human).
Figure 2 Boxplots of relative expression levels. As described in figure 1, normalized expression levels of imprinted genes in human (a) and mouse (b) are shown in boxplots. The x-axis displays the different tissues, and the y-axis indicates relative normalized (more ...)
The clustering of imprinted genes in the mouse shows that a series of genes play a role for the branching into embryonic and brain specific clusters at the 1st/2nd split. Aside from the predominant expression profile of H19 (1st split, H19 is not represented on the human array), the remaining genes split into 2 clusters (2nd split). One is the group characterised by moderate to high expression which splits into two clusters (3rd split), where the cluster of Gtl2, Inpp5f, Nap1l5, Ndn, Nnat, Dlk1, Peg3, Grb10 and Plagl1 shows high expression in brain tissues. The second group is characterised by moderate to low expression and falls into two clusters according to the 2nd split. Of these, the cluster of Slc38a4, Peg12 and Zim1 shows mainly low expression throughout the tissues.
Additional clustering (data not shown) by combination of Euclidean and Manhattan distances, respectively, was generated with either complete or average linkage. For mouse, the structures of the obtained trees were very similar to the ones shown in figure . The human clustering was found to be less stable (particularly applying Manhattan distance). However, placenta always separated in the first splits and in most analyses, pancreas separated in the 2nd whereas adrenal gland and pituitary as well as amygdala and hypothalamus separated in the 2nd or 3rd.
Imprinted genes do not show prominent overexpression in distinct tissues
We next analysed whether imprinted genes on average are more strongly expressed in certain tissues compared to the non-imprinted genes present on the arrays (Figure and ). For the analysis we sampled groups of non-imprinted genes (same number of genes as the examined imprinted gene group) 1000 times and compared their relative expression levels to the average expression of imprinted genes. These analyses were performed separately for each tissue. For human, the median expression levels of imprinted and non-imprinted genes were not significantly different (after multiple testing adjustment, i.e. Hochberg adjustment, p values ~ 0.64). In mouse, hypothalamus showed a slightly increased median expression compared to other tissues (p = 0.05).
We also compared the distribution of expression levels of imprinted and non-imprinted genes across tissues. Testing included either all non-imprinted genes on the array or randomly sampled sets. In both cases we observe similar distributions of standard deviations of expression levels across tissues between imprinted and non-imprinted genes (background) on the array [see Additional file 2
]. Testing against randomly sampled gene sets the distributions of human and mouse standard deviations did not differ significantly from genomic background. Thus, the overall variability across tissues in relative expression of individual imprinted genes is not remarkably high with a few exceptions such as DLK1
in human and H19
As a sum, imprinted genes show a median expression across tissues similar to non-imprinted genes. Except for a slight tendency in mouse hypothalamus, imprinted genes do not show a particular tissue-specific enrichment compared to the genome-wide average in either adult tissues or mouse embryonic tissues. In addition, imprinted genes did not show reduced or increased variability in tissue-specific expression levels. This suggests that on average imprinted genes tend neither to be expressed at almost constant levels in all tissues (like house keeping genes) nor to be only expressed in very few tissues.
We next tested whether any two tissues differ significantly. Adjusting for multiple testing in human tissues, no tissue pair reaches significance. In mouse, 60 tissue pairs out of 406 show a p value < 0.01. By chance we would expect 4 pairs with a p value of less than 0.01. Thus we observe approximately a 15 fold increase. Furthermore, pairwise comparison of embryonic tissues (fertilized egg, embryonic stages 6.5 – 10.5) with adult tissues resulted in 36 pairs with p value < 0.01 (out of 126). Hypothalamus is the tissue with the highest median expression level and shows significant over-expression in comparison to 14 tissues (out of 28 pairs). The detailed matrix is given in an additional table [see Additional file 3
The biclustered expression matrices (Figure and ) illustrate that several imprinted genes have conspicuous expression behaviour across tissues. H19 is such an outlier gene which in mouse is highly expressed at all embryonic stages and in skeletal muscle. Others, such as GNAS/Gnas, are strongly expressed in many tissues of one but not the other species pointing towards more general expression differences between human and mouse at this locus. Finally, in some tissues individual outliers show extensive differences in the relative expression between both species. An example is Cdkn1c, which is highly expressed in adrenalgland in the mouse but only moderately in human (Figure ), although the general correlation between CDKN1C and Cdkn1c across all tissues is rather high (see below).
Overall, we observe that in pairwise comparisons imprinted genes are more highly expressed in mouse embryo than in adult tissues, especially bonemarrow, heart, lung, lymphnode, pancreas, prostate, salivarygland, testis, thymus, thyroid. The highest expression levels are observed for genes in the BWS region, namely H19, Cdkn1c, Phlda2 and Igf2. These genes dominate the biclustering of embryonic and placental samples in figure whereas other genes behave rather inconspicuously at embryonic stages.
We also calculated the pairwise Pearson correlation coefficients for genes within particular imprinted regions, i.e. regions which at least contained three verified imprinted genes annotated to the same chromosomal band. The analysis shows that expression profiles show no more similarity among imprinted genes of a common cluster/chromosomal band than genes that reside in different regions (data not shown).
Already the biclustered expression matrices (Figures and ) indicated that maternally and paternally imprinted genes, respectively, do not cluster together according to their tissue specific expression profiles. Calculation of Pearson correlation supports this notion showing that the parental origin of expression has no influence on tissue-specific expression profiles (data not shown). Still, overall, paternally expressed genes tend to be more highly expressed than maternal genes (for human p = 0.02, for mouse p = 0.04, t-test).
Orthologous imprinted genes in man and mouse exhibit relaxed correlation of tissue specific expression
We next investigated whether tissue-specific expression is correlated for orthologous imprinted genes in human and mouse (Figure ). Overall, orthologous gene pairs showed higher correlation than non-orthologous pairs (diagonal entries: median = 0.561; std (standard deviation) = 0.376; off diagonal entries: median = 0.093; std = 0.344). This difference is significant (Kolmogorov Smirnov test, p = 0.006; Wilcoxon test, p = 0.0006). Genes with the highest correlation values were: INS/Ins2
(pc (Pearson correlation) = 0.989); CDKN1C
(pc = 0.861); IGF2/Igf2
(pc = 0.810); PEG3
(pc = 0.774), and DLK1
(pc = 0.798). No correlation – i.e., a Pearson correlation of approximately zero – was observed for: PP1R9A
, and MKRN3
. In general, for imprinted orthologous gene pairs the correlation (pc = 0.56) across tissues is higher than genewise correlation of non imprinted orthologs (median pc = 0.20, 1000 times sampled sets of 19 genes). The values for gene pairs derived from the genomic background are in agreement with published results derived using slightly different methods [19
Figure 3 Correlation of orthologous gene expression in human and mouse. (a) Pearson correlation of normalized tissue-specific expression levels of orthologous imprinted genes in human and mouse. The figure shows Pearson correlation coefficients as 2 dimensional (more ...)
In addition, we analysed if analogous tissues show similar expression profiles of orthologous imprinted genes in human and mouse. Out of a set of 8980 available orthologous genes we randomly sampled 1000 times genes of the same size as the imprinted gene set. For a given tissue we determined Pearson correlation coefficients comparing the relative expression profiles in human and mouse for each set of sampled genes. Thus, we derived 1000 Pearson correlation coefficients for the sampled gene sets and one for the imprinted gene set for each of the 22 tissues.
For 18 of 22 tissues, the correlation of expression of imprinted genes is in the 25% to 75% interquartile range (IQR) of randomly sampled orthologous genes (Figure and [see Additional file 4
]). In trachea, imprinted genes correlated slightly worse (median 0.388 for the random gene set and 0.142 for imprinted genes). In adrenal gland, pancreas, and pituitary, the correlation was stronger than in the set of random genes [see Additional file 4
]. Although the correlation coefficient of placenta was between the 1st
quartiles (placenta differs from other tissues in its expression patterns of imprinted genes), it shows a clear tendency towards higher correlation in imprinted genes than in randomly sampled sets. In summary, the correlation values of human and mouse orthologous imprinted genes are not very different from those of randomly sampled non-imprinted orthologous genes (Figure ). In a few endocrine tissues we observe a strong expression correlation for orthologous imprinted genes. This finding is in line with previous individual expression reports on a few candidates in human and mouse [19
A few genes dominate expression profiles in distinct tissues
Using correspondence analysis, we next examined the relative contribution of individual genes to tissue specific expression profiles in human and mouse. Briefly, we applied a two way table (in our case relative expression values of 19 genes in 22 tissues each in human and mouse) correspondence analysis for describing/uncovering correspondence between rows (here genes) and columns (here tissues). Originally, the method was described by Berzerci [21
]. For this, a high dimensional space of data points (genes, tissues) is constructed, such that the dimensions capture the variance explained by the given data in increasing order. Rotating the high dimensional data, they are projected onto a 2 dimensional planar in a way that maximal variance can be seen according to the first and second axis. In our application, this allows for studying associations between genes and tissues as shown in a two dimensional display (Figure ). Objects (genes, tissues) with similar correlations are clustered together resulting in small angles, whereas dissimilar objects are separated from each other (large angle, e.g. different quadrants), furthermore, the larger the vector length the higher the information content. In figure , each point, i.e. tissue (red triangle) or gene (black dot), marks the direction and distance of a vector originating from the centroid. The appearance of vectors in the same quadrants and a closer angle distance between vectors reflects their relative association. While correspondence analysis allows us to visualize associations in complex matrices, it should be noted that there is no threshold to decide whether an association is strong or weak, the vectors describe relative associations, i.e. stronger or weaker than another.
Figure 4 Correspondence analysis of combined matrices of relative gene expression of orthologous imprinted genes. Human and mouse imprinted gene expression for all tissues was analysed using correspondence analysis. In this figure, the first and second components (more ...)
Examples for such associations are mouse pituitary and human placenta which are associated with DLK1/Dlk1 and PEG3/Peg3 in the second quadrant. Further examples are (1) mouse placenta and PLAGL1/Plagl1, (2) mouse pituitary, human placenta and PEG3/Peg3, and (3) human adrenal gland, pituitary, ovary and NNAT/Nnat, NDN/Ndn.
Overall, however, the correspondence analysis (Figure ) revealed highest variance between human and mouse tissues, with only human placenta being an exception (separated by the first component, which accounts for 41.89%). Thus, almost all human and mouse tissues were clearly separated (explained inertia of the first 5 components are: 41.89%, 19.59%, 10.74%, 9.55%, and 4.02%, respectively). A strong association was seen between GNAS and all human tissues except for pancreas, adrenal gland, ovary, pituitary, and placenta. INS/Ins2 showed a strong association with mouse pancreas but less so with human pancreas, while SLC22A18 showed a stronger association with human pancreas. Dlk1 was associated with mouse pituitary (Figure ).
Distinct sets of transcription factor binding sites correlate to tissue-specific expression patterns of imprinted genes
Next, we questioned whether tissue-specific expression of imprinted genes is regulated by a well defined set of transcription factors. Addressing a possible connection between tissue-specific imprinted gene expression and predicted transcription factor binding sites (TFBS), we again applied a correspondence analysis. We computed a matrix of predicted TFBSs (based on a TRANSFAC pattern) in the upstream region of the respective genes (scored by p values). By multiplying this score matrix with the matrix of relative expression strength of the respective genes we derived a score matrix that relates human tissues and TFBS enrichment based on the studied set of imprinted genes. For visualization we performed a correspondence analysis (Figure ).
Figure 5 Correspondence analysis of relative expression and transcription factor binding sites of human orthologous imprintedgenes. (a) Human orthologous imprinted gene expression for all tissues as well as p-values for the 107 most prevalent TFBSs were analysed (more ...)
The analysis reveals that imprinted genes expressed in human placenta are more frequently associated with binding sites for XPF1, NFKappaB50, ER, MTF1, SF1, GLI, TEF, DR4, MEF3 and CP2. In fact, XPF1, TEF, MTF1, SF1, GLI, MEF3 and CP2 were also present on the expression array and we found XPF1, MTF1, SF1, GLI and CP2 to be upregulated (defined as higher-than-median expression plus 1 SD) in human placenta (Figure ). We further extracted human placenta-expressed genes from the expression dataset and explored whether the same group of transcription factors binding sites were enriched. In fact, all TFBSs except for XPF1 and MEF3 showed significant enrichment. NFKappaB50 had a p-value of 4.3*10-44, TEF 7.3*10-69, DR4 4.1*10-7, ER 0.002, MTF1 9.9*10-57, SF1 1.2*10-11, GLI 5.9*10-20 and CP2 4.4* 10-41 while XPF1 and MEF3 had a p-value of 0.4. After adjustment for elevated GC content of the upstream regions of imprinted genes, binding sites for NFKappaB50, TEF, MTF1, SF1 and CP2 still showed clear enrichment, while binding sites for DR4, ER, and GLI showed no enrichment. In summary, NFKappaB50, TEF, MTF1, SF1 and CP2 displayed placenta-specific TFBS enrichment. Binding sites for XPF1 were significantly enriched in the upstream regions of imprinted genes but not in those of placenta-expressed genes, with or without GC content adjustment.
In mouse, the same TFBSs were significantly enriched as in the human set, namely for NFKappaB50, TEF, MTF1, SF1 and CP2 (even after adjustment for elevated GC content). In addition, DR4 and GLI binding sites also showed significant enrichment. Overall, the results were comparable between human and mouse imprinted genes.
Prominent examples of tissues specific TFBS associations in imprinted genes can also be observed for adrenal gland, pituitary and ovary (2nd quadrant in Figure , and Figure ). As for the clustering of human imprinted genes expression, (Figure ) the multiplied dataset consisting of TFBS enrichment and expression displays a very pronounced cluster of pituitary, adrenal gland and ovary. The strongest associations to these tissues as can be directly read from figure are HEN1, LXRR4, MRF2, CEBP, RP58, HEB and XVENT1 binding sites. While XVENT1 is not represented on the array, HEN1, MRF2 and CEBP show a very pronounced upregulation in ovary and pituitary (at least two fold upregulation compared to the cellular background). For adrenal gland these are HEN1 and CEBP while MRF2 is at least 1.4 fold upregulated.