Microarray and RNA-sequencing datasets
We used two previously published gene expression datasets of postmortem primate liver. One was based on Affymetrix Human Genome U133plus2 GeneChip® arrays measured in six humans, five chimpanzees and in five orangutans
[20],
[21], available at the ArrayExpress Archive (
http://www.ebi.ac.uk/arrayexpress/) with accession numbers E-AFMX-11 and E-TABM-84. A second dataset was based on RNA-sequencing on the Illumina platform, which contained data from 12 humans, 12 chimpanzees and 12 rhesus macaques
[22] and was downloaded from NCBI Gene Expression Omnibus (GEO) public data repository
[56] (
http://www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GSE17274).
We also used a mouse liver gene expression dataset, where expression was measured using Mouse Genome 430 2.0 GeneChip® arrays in 24 mice fed two human diets, one chimpanzee diet, and one regular mouse food diet, six mice for each diet
[19], downloaded from GEO (
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6297).
Preprocessing gene expression data
For Affymetrix microarray data analysis, we summarized expression levels per Ensembl
[57] gene (version 54) using custom CDF files
[58] (available at
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomic_curated_CDF.asp). Expression levels were calculated using the “rma” (robust multichip average) method in the “affy” package
[59], which is part of the R Bioconductor software
[60]. Microarray probes that did not match human (hg18), chimpanzee (pantro2) and orangutan (ponabe2) genomes perfectly were identified using BLAT
[61] and discarded. The extracted expression levels were log transformed and quantile-normalized. Detection
p-values (probability of the expression signal representing background) were calculated using the “mas5” method in the same package. In further analysis we included (1) genes with a nominal detection
p-value <0.05 among at least half of the samples, and (2) genes showing above-detection expression unevenly among sample groups (indicating differential expression), as determined using the “dMFNCHypergeo” method in the R “BiasedUrn” package
[62] at
p<0.1.
For the RNA-sequencing dataset, preprocessed read counts for 20,689 Ensembl genes were directly downloaded from NCBI GEO with accession GSE17274. 7,544 genes that had no read count in more than half of all samples, or which had no differential expression test p-value according to the “DESeq” R package (see below) due to 0 variance, were removed from further analyses, resulting in 13,145 genes. Read counts were log transformed and quantile-normalized.
Choice of statistical tests
We used parametric tests for testing differential expression (t-test or ANOVA) or comparing expression profiles between pairs of genes (Pearson correlation test). When comparing distributions of variables that are by definition not normally distributed (e.g. correlation coefficients or dN/dS ratios) we used non-parametric tests (Spearman correlation and Wilcoxon signed-rank tests).
Testing for differential gene expression in microarray datasets
We used ANOVA to test for each gene's differential expression among groups. If data were generated in different batches, two-way ANOVA was used with experimental batch (the day of hybridization) included as an additional factor. To remove the batch effect, for each gene, we subtracted each batch's mean from expression profiles of samples within that batch. If a factor (e.g. species) had more than two levels, testing differential expression between each pair of levels was accomplished using the Tukey HSD
post hoc test (“TukeyHSD” function from the “stats” R package
[60]).
Testing for transcriptome-wide human vs. chimpanzee differences
We first determined differential expression in each primate dataset separately, and then combined the results. To identify differential expression in the RNA-sequencing dataset, we first determined a
p-value cut-off based on a permutation approach to ensure FDR <10%. Specifically, (1) species identities of samples were randomized, (2) a differential expression test (using the “nbinomTest” method in the “DESeq” R package
[63]) was applied to all genes using the randomized species identities. This routine was repeated for 1,000 times, and the exact p-value cutoff was chosen to ensure that the median number of significant genes across the 1,000 permutations is 10% of the actual number of significant genes. Using this criterion, we identified 4,551 out of the 13,145 genes as differentially expressed between human and chimpanzee at
p<0.024.
For the Affymetrix U133plus2 dataset, using a similar procedure we identified t-test p<0.037 as the cutoff at FDR <10%. At this cutoff, 969 genes were differentially expressed between humans and chimpanzees out of 4,531 expressed genes.
The two datasets were combined using 4,161 genes that were commonly expressed in both, with each gene's expression profile standardized (mean extracted and then divided by the standard deviation, resulting in a z-score) first across human and chimpanzee samples in each dataset and subsequently across all samples from both datasets. 428 genes that showed significant differential expression (at FDR <10%) in both datasets in the same direction (e.g. higher in human in both datasets) were considered as significantly differentially expressed between human and chimpanzee. In addition, we defined an effect size-based set of genes showing species effects (see below).
Testing for transcriptome-wide diet effects in mouse
Following the original mouse diet study, we combined the data from the two human diets – the cafeteria and fast food diets – due to their similar effects on mouse liver gene expression levels, and compared them directly to the chimpanzee diet. Using the permutation-based approach described above, 1,316 out of 6,147 expressed genes showed differential expression between mice fed human and chimpanzee diets at ANOVA p<0.073 with FDR<10%.
Effect size calculation
The following formula was adopted to calculate effect size:
d
=

(M
1–M
2)/SD
p, where M
1 and M
2 are the means of the two groups and SD
p is the pooled standard deviation, calculated as (((N
1−1)×SD
12+((N
2−1)×SD
22))/(N
1+N
2−2))
0.5, where N
1 and N
2 are the sample sizes and SD
1 and SD
2 are the standard deviations of the two groups, respectively. In the mouse and primate datasets, the first group was mice fed human diets or humans, the second mice fed chimpanzee diets or chimpanzees, respectively.
Rational for using effect size as cutoff
In addition to ANOVA, we also used effect size to define differentially expressed gene sets, for two reasons: First, using effect size allows straightforward comparisons of datasets with different sample sizes. Second, the number of genes reaching nominal significance cutoffs for differential expression in both the primate and mouse datasets was limited (N

=

57), too small to allow testing for common regulatory factors. We reasoned that this narrow overlap might partly be caused by weak statistical power to detect differential expression in either dataset (i.e. high false negative rates). One approach to overcome this limitation is to search for regulatory effects over a larger set of genes showing weaker differential expression signals. We thus chose a more relaxed cutoff to determine species or diet effects based on effect size (|effect size|>0.8). The cutoff 0.8 has been proposed as a general cutoff for modest effects
[64]. Note that in the combined primate dataset, this cutoff roughly corresponds to t-test
p-value <0.025 (permutation based FDR

=

4.4%).
Gene Ontology analysis
We used the Gene Ontology (GO)
[49] and the Fisher's exact test for functional analysis. Annotations from the biological process (BP) ontological domain were used. Ensembl genes with GO annotation downloaded from Ensembl (version 64) were assigned to GO categories based on Ensembl GO annotation and the Gene Ontology directed acyclic graph (DAG), accessed through the “GO. db” R package
[65] (this latter step is necessary to assign genes to ancestral GO categories, which are not included in the Ensembl table). The numbers of tested genes and those of their relevant background genes with annotations are shown in
Table S2. Genes expressed in a dataset but that did not show a specific effect, were chosen as background. Only GO categories containing a minimal number of genes with GO annotation were tested (see
Table S2). To correct for multiple testing, we randomly re-sampled the same number of genes as in the tested set from the relevant background genes with GO annotation for 1,000 times. The FDR was defined as the ratio of the expected (median) relative frequency of significantly enriched categories among the 1,000 permutations, to that observed, at a certain
p-value cutoff. The global significance of the tests across all GO categories was defined as the relative frequency of permutations with at least as many enriched categories as that observed, passing a
p-value cutoff. When reporting significance, we use a
p-value cutoff (chosen from 0.05, 0.01, 0.005, and 0.001) at which FDR <10% and the global
p-value <0.05.
Predicting target genes of transcription factors
We borrowed the procedure from
[27] to predict target genes of each transcription factor (TF). Briefly, the “MATCH” algorithm from the TRANSFAC database (version 7.1)
[26] was used to predict TF binding sites (TFBS) on each gene's putative promoter region; genes with at least one conserved predicted binding site of one TF were considered that TF's targets. Specifically, the promoter was defined as the region within 2,000 base pairs both upstream and downstream of the focal gene's TSS (as annotated by Ensembl version 54
[57]). To find TFBS conserved among vertebrates, we required that ≥80% of nucleotides of the focal TFBS have 17-way vertebrate PhastCons scores and an average score ≥0.6. PhastCons scores were obtained from the UCSC Genome Browser 17-way Vertebrate Conserved Element Table
[66].
Identifying candidate transcription factors regulating expression differences
We used the same procedure for identifying candidate TFs regulating observed differential gene expression between groups, in both the primate and mouse diet datasets. Briefly, we first narrowed the search space to TFs and predicted target genes showing differential expression, and then tested each TF for non-random (more positive or more negative) correlations with its targets, compared to non-target genes (genes that are targets of other TFs). This was considered indication of a regulatory effect of the TF on its targets. Specifically, we calculated Pearson correlations between each TF and its predicted target genes that showed at least minimal species or diet effects (|effect size|>0.8). These correlations were then compared to that between the same TF and non-target genes whose |effect size|>0.8, using a two-sided Wilcoxon test (given that correlation coefficients are not normally distributed). A p-value <0.01 was used as cutoff. When a TF was associated with more than one TFBS motif (8 cases in the primate dataset and 9 in the mouse primate diet dataset), we tested target gene sets for each motif separately. To estimate how many TFs would pass the cutoff randomly, TF-target relations were permuted 1,000 times, the above-procedure applied each time, and the number of TFs passing the cutoff recorded. The global significance was defined as the relative frequency of permutations with the same number or more TFs passing the cutoff as that observed.
Testing for excess of common candidate TFs in the mouse and the primate datasets
When choosing TFs that were detected in both the combined primate dataset and the mouse dataset (N

=

36) as background, we performed a one-sided Fisher's exact test for the overlap between candidate TFs from each dataset that showed consistent changes in the primate and mouse diet datasets: e.g. up-regulation in humans or under a human diet in mice (note that
EGR1 was the only TF showing consistent change).
Testing for correlation of TF-target correlations between mice and primate datasets
The predicted TF-target relationships were permuted for genes that showed at least minimal species or diet effects (|effect size|>0.8) in both the mouse and primate datasets. For each TF that was identified as candidate regulator in at least one dataset, we calculated its correlation with its tested target genes in both datasets. Next, the correlation of these correlations (CoC) between two datasets was calculated using Spearman correlation (given that correlation coefficients are not normally distributed). This procedure was applied 1,000 times, and the relative frequency of random cases in which the number of TFs whose CoCs were no less significant than observed (i.e. for EGR1) was used as measure of significance. The procedure was repeated also for YY1 and NFIC, the two TFs that showed a diet effect and regulatory effects and diet in the mouse dataset, and species differences in the primate dataset.
The high-calorie mouse diet dataset
A mouse liver gene expression dataset based on Agilent-012694 Whole Mouse Genome G4122A (Feature Number version) containing liver samples from mice fed normal or high-calorie diets (5 individuals per group) was downloaded from NCBI GEO with accession number GSE6089. The data analyses followed the same procedures described above.
Non-random occurrence of conserved EGR1 binding sites
To test whether the occurrence of conserved EGR1 response motifs (V$KROX_Q6) in the 23 common targets' (genes identified as targets in both the mouse diet and primate datasets) promoter regions may reflect the nucleotide composition of these promoters, we performed a randomization test while controlling for overall G/C or dinucleotide content and conservation. Specifically, using the uShuffle software
[67], we randomized the proximal promoter sequences (±2,000 bp from the TSS) while keeping the numbers of all possible 16 dinucleotides fixed. We further permuted PhastCons scores per nucleotide, while keeping the distribution of PhastCons scores among the 4 nucleotide types fixed. We thus generated one thousand batches of random sequence, for each of the 23 genes' promoter regions. The sequences were used as input in the TRANSFAC “MATCH” algorithm
[26]. In each batch, we asked whether each of the 23 genes would be predicted as EGR1 target, i.e. whether it contained at least one binding site fulfilling the same criteria as in the original analysis. The maximum number of genes containing at least one predicted EGR1 binding site among 1,000 random batches was 4 (
Figure S11). The random expectation was calculated as the median of this distribution.
Overlap between binding sites and DNase I hypersensitive sites
Processed DNase I footprint data from human lymphoblast cell lines as well as 14 cell lines generated by the ENCODE project
[68] were obtained from
http://centipede.uchicago.edu/SimpleMulti/
[35] and used conforming to the ENCODE Consortium Data Release Policy
[69]. DNase I sites cover ~6% of the 23 common EGR1 targets' promoters (genes identified as targets in both the mouse diet and primate datasets). In 22 of the 23 promoters, we found a minimum 1-nt overlap between a DNase I site and the conserved EGR1 binding site. To calculate the random expectation we used the following procedure: (1) For each of the 23 genes, a 14-nt long DNA stretch (the length of the EGR1 response motif) was randomly chosen from the gene's proximal promoter sequence and its G/C content was calculated. (2) This procedure was repeated until 1,000 14-nt sequences of comparable G/C content were chosen. We required that the G/C content of the sequences to be at least as high as that of the originally identified 27 binding sites (79%). We thereby controlled the occurrence of high G/C content within DNase I hypersensitive sites. (3) The number of random sequences with a minimum of 1-nt overlap with DNase I sites was calculated. The
p-value was defined as the relative frequency among the 1,000 randomizations in which the number of genes with a binding site-DNase I site overlap was equal to or larger than that observed (i.e. 22 genes). The random expectation (median number of genes with overlap among the 1,000 randomizations) was calculated as 11/23.
Correlation between EGR1 and the 23 common targets in human liver
Liver transcriptome data from a large human sample (N

=

60, each with two replicates) was downloaded from NCBI GEO with accession number GSE28893
[36]. The data was quantile-normalized. In this dataset 13,942 genes were expressed, including 20 of the 23 common targets. A one-sided Wilcoxon test was used to test if these 20 genes were more strongly correlated with EGR1 than (i) other expressed genes predicted as EGR1 target based on sequence predictions, (ii) all other expressed genes with TF annotation.
Human vs. chimpanzee divergence rate at putative promoter regions and protein coding DNA sequence
We used human-chimpanzee promoter divergence rates (Kp), normalized by an estimate of the substitution rate of a genomic region (Ki), as calculated by
[20]. This measure was used to estimate sequence divergence for promoter regions for human and chimpanzee. Divergence at protein coding regions was defined by non-synonymous divergence normalized by synonymous divergence (dN/dS), and was downloaded from the Ensembl database (version 60).
Mammalian conservation at promoter and 3′ UTRs
We used PhastCons scores to estimate sequence conservation as previously reported
[27]. Briefly, using the PhastCons 18-way Placental Mammal Conservation Track (a subset of the 28-way Placental Track) from the UCSC Genome Browser, for each Ensembl human gene, we calculated mean sequence conservation for proximal promoter (±2,000 bp from the TSS) and 3′ UTR.
Calculating liver-specificity in gene expression
A dataset including 79 human tissues and 61 mouse tissues
[39] was used to calculate each gene's expression level in liver relative to that in other tissues. Specifically, for each species, for each gene, the liver expression level was scaled as the distance to the mean in units of standard deviation across all tissues, i.e. a z-score. We compared liver-specificity among gene sets using this z-score.
Testing liver-specificity differences among gene sets
To estimate the significance of the difference between human and mouse in liver-specificity among genes differentially expressed only in the mouse dataset (the foreground genes), we first had to account for overall differences in liver-specificity between the two species, which could arise because of technical or sampling reasons. To achieve this, we normalized the liver-specificity measure using a background gene set that should not show difference in liver-specificity: genes that were differentially expressed neither in mice nor in primates. We shifted the human and mouse measures so that the background genes had the same median liver-specificity in both species. We then tested for a higher median of these foreground genes' liver-specificities in mouse than in human using a one-sided Wilcoxon test (
p
=

0.0077).