Properties of individual probes
We developed an algorithm to rapidly map short sequence tags to complete genomes (Renaud and Wolfsberg, unpublished) and used it to determine how many times each probe in the Human Genome U133Plus2 microarray (Affymetrix) had an exact match in the human and chimpanzee genomes. The bulk of the probes (86%) in the U133Plus2 microarray have exactly one match in the human genome (Table , Fig. ). This is in contrast to 67% of the probes matching one time in the chimpanzee genome. Overall, 64.7% of individual probes showed one match in both the human and chimpanzee genomes.
Classification of probes in the Affymetrix U133Plus2 microarray
Classification of the 604,258 probes within the Affymetrix U133Plus2 microarray. The pie chart depicts the relative percentage of probes comprising each of the nine probe categories described in the Methods section.
We assigned probes into nine categories based upon the number of matches (0, 1, or Multiple) with the human (H) and chimpanzee (C) genomes (Methods). These included: 0H_0C, 0H_1C, 0H_MC, 1H_0C, 1H_1C, 1H_MC, MH_0C, MH_1C, and MH_MC. The 1H_1C and MH_MC categories are among the easiest to justify. The former involves conserved single gene sequences in both species while the latter at least in part reflects multi-copy gene families in humans and chimpanzees. The 1H_0C category reflects consequences of fixed sequence differences between the human and chimpanzee genomes. Below, we will discuss possible reasons for the remaining six probe categories, with the understanding that sequence errors or polymorphisms could influence each case.
Affymetrix currently employs a system in which the annotation of each probe set is classified into one of five categories based on the evidence available for the probe set interrogating the intended gene of interest. The five categories are designated A, B, C, E, and R, with A indicating the most direct evidence for a probe/transcript relationship. Of the 358 probe sets for which 11 probes are 0H_0C (no match to the chimpanzee or human genome), 182 (~51%) were provided an A level annotation. In contrast, ~81% of the 4,648 probe sets for which 11 probes are 1H_1C (match once in both species) were provided an A-level annotation. In addition to these sequence quality issues, probes designed to overlap splice junctions absent in human genomic sequences will also fall into the 0H_0C category.
Possible explanations for the 1H_MC, MH_1C, and MH_0C probe categories include assembly errors of multi-copy genomic regions, duplication or loss of genetic material in either lineage, or mutations in duplicated segments. The classification of probes in the two remaining categories (0H_MC and 0H_1C) was unexpected since these require one or more matches in the chimpanzee and none in the human genome. This could arise due to the positioning of probe sequences across splice junctions found in human cDNA sequences. Such probes would not match the human genome; however, they could match processed pseudogene sequences present in the chimpanzee genome.
Since our downstream cross-species (i.e. human versus chimpanzee) gene expression analyses would focus on data derived from the 1H_1C probes, we next evaluated the percentage of 1H_1C probes that were located in orthologous regions of the chimpanzee and human genomes. We determined if regions are orthologous by using the liftOver tool provided by the UCSC Genome Bioinformatics Group http://genome.ucsc.edu/cgi-bin/hgLiftOver
. We started with the coordinates of sequences that had a single hit in the human genome (1H), and used the liftOver tool to map them to the chimpanzee genome. We then compared the liftOver coordinates to the coordinates that we had obtained by aligning the sequences to the chimp genome. If a liftOver coordinate was within 100-nt of our coordinate, we counted the chimp hit as occurring in an orthologous region. Of the 390,967 sequences that have a single hit to both genomes (1H_1C), 388,044 (99.3%) hit orthologous regions in the chimpanzee and human genomes.
We then explored in further detail the 2923 1H_1C sequences that did not map to orthologous regions by liftOver. A total of 2488 of the 2923 sequences are either in an intron or an exon of a human gene, or within 5-kb upstream or downstream. Likewise, 2,021 of the 2,488 sequences were also in an intron or an exon of a chimpanzee gene, or within 5-kb upstream or downstream. Of these 2,021 genes, 1,660 are predicted to be human/chimpanzee orthologs.
Taking this additional information into consideration, we conclude that 389,704 (99.7%) of the 1H_1C probes map to orthologous regions in the chimpanzee and human genomes. This is especially impressive since the chimpanzee genome assembly used is of lower quality than the human, which would result in some probes being falsely identified as not mapping to orthologous regions in both genomes. Overall, these observations strongly support the use of 1H_1C probes for the analysis of human and chimpanzee gene expression profiles.
General properties of probe sets
A total of 91.4% of probe sets (49,957 total) in the Human Genome U133Plus2 microarray had at least one probe removed in the initial masking process (i.e. contained at least one probe not in the 1H_1C category). In addition, 3,674 probe sets were completely eliminated from the most basic masking analysis (mask0, Table ). These included 48 probe sets in the AFFX control category, which by design are not expected to match the human or chimpanzee genomes.
Classification of probe sets based on number of 1H_1C probes
Next, we considered how many of the nine probe categories were represented in a given probe set. For each specified category, we determined the number of probe sets containing six or more probes (Fig. ). These are designated as being 'category-enriched' probe sets. Interestingly, we noted deficits in the number of annotated genes in certain category-enriched probe sets. At the time of analysis, 37% of the 54,675 probe sets were annotated with a unique NCBI Entrez GeneID. A total of 2,371 (51%) of the 4,648 1H_1C_11 probe sets were annotated with unique Entrez Gene IDs. However, only 70 (8.1%) of the 0H_0C category-enriched probe sets (N = 862) were annotated. Likewise, only 289 (19.7%) of the MH_MC category-enriched probe sets (N = 1,464) were annotated. Strikingly, no Entrez GeneID was provided for any of the 0H_1C category-enriched probe sets (N = 35).
Figure 2 Classification of the 54,675 probe sets within the Affymetrix U133Plus2 microarray. The composition of probe sets with respect to probe categories is depicted. The height of each bar represents the number of probe sets (Y-axis) that contain a least one (more ...)
Effect of probe number on estimates of intra- and interspecies expression variation
Next, we sought to explore broad effects of masking on expression estimates between species. Since we are focusing on 1H_1C probes for expression estimates in both species, a major question concerns the effects of reducing the number of probes in a given probe set on gene expression scores. For this analysis, we applied mask1 to the entire gene expression set (five tissues for all humans and chimpanzees) and calculated the median interquartile range (IQR) of expression scores for all probe sets as a function of the number of remaining probes (Fig. , red circles). We considered probe sets comprised of odd and even numbers of remaining probes separately since the RMA median polishing algorithm calculates expression scores from such probe sets slightly differently [35
]. This arises from differences in the formulas for determining the median in data sets consisting of odd and even numbers of observations. We propose that the effects of these differences may be enhanced by the small number of probes in each probe set.
Figure 3 Effects of probe number on the variation of gene expression scores. Median interquartile ranges (IQRs) of expression scores (Y-axes) for the indicated tissue from all humans and chimpanzees are plotted in red relative to the number of 1H_1C probes remaining (more ...)
As would be expected, the median IQR decreased with an increasing number of probes. However, the relationship was not linear, with a faster decrease occurring when there were five or fewer probes than for probe sets with at least six probes measured. Adjusting for whether the number of probes was odd or even, the change in slope is statistically significant (adjusted P
< 0.05 for all tissue types). This can be observed from the trend lines (red), generated using a breakpoint at either five probes for the even numbers of probe sets or six for the odd numbers. This reduction in slope supports a requirement of at least six probes in order to have improved stability of the gene expression measure. Similar results were obtained for the corresponding intraspecies IQR comparisons (Additional Files 1
, red circles and lines). The only intraspecies comparison that did not achieve statistical significance at the 0.05 level was for the human testes (adjusted P
for change in slope = 0.10).
Afterwards, we sought to determine if random probe masking could lead to the observed relationships between median IQR and probes remaining in probe sets. To address this question, we generated a total of nine hundred simulated masked data sets wherein we removed N1–9
probes from each of the 4,648 1H_1C_11 probe sets. Overall, this entailed generating one hundred simulated data sets for each N probes removed. We recalculated median interspecies IQRs (Fig. , black lines) and median intraspecies IQRs (Additional Files 1
, black lines) for each of these simulated masked data sets. For all tissues, we observed that the median interspecies and intraspecies IQRs derived from these simulated masked data always increase with decreasing numbers of probes within probe sets.
A comparison of lines fit to the simulated data and real masked interspecies data (combining all human and chimpanzee data) separately, found that the estimated slopes vary for three of the five tissues studied (kidney, liver, and testis) (F-test on 2df P < 0.05). In contrast, the observed relationships between median IQRs and probe numbers in the actual mask1 brain and heart expression data showed a striking resemblance to the relationships found in the simulated masked brain expression data (Fig. ).
We sought to address the possibility that factors, such as the rates of evolution, have a substantial influence on the patterns of expression variation observed in different tissues as a function of 1H_1C probe number. For example, it has been demonstrated in the initial analysis of the current data set that patterns of differences in gene expression and gene sequences are similar in humans and chimpanzees [13
]. As a first step to approach this issue, we calculated dN/dS ratios for RefSeq transcripts corresponding to approximately 20,000 probe sets in both the human and the chimpanzee lineages (see Additional File 3
and Methods). We chose to analyze dN/dS ratios since they provide a commonly used means of measuring rates of evolution, taking nonsynonymous (dN) and synonymous (dS) substitutions per site into consideration. In bulk, we found that the nucleotide substitution rates of RefSeq transcripts expressed in a given tissue do not significantly vary in relation to the number of 1H_1C probes within the corresponding probe set (Additional File 4
). Based on our metrics, the bulk rates of evolution of expressed genes do not explain the discussed relationships between median IQR and probes remaining in probe sets for the five tissues. However, it should be note that these analyses are limited by the quality of current genomic sequence information for mammals such as the common chimpanzee. Thus, this issue could be revisited with improved drafts and annotations of these genomes.
Effect of probe number on inferences of cross-species differential gene expression
Thereafter, we focused on quantifying the effect(s) that probe number has on inferences of differential expression between humans and chimpanzees. For each tissue, we identified 1H_1C_11 probe sets that showed differential expression across species (see Methods for details). Then, we compared the list of differentially expressed genes in a given simulated data set to the list of differentially expressed genes originally observed in the same tissue. This allowed us to calculate the median and range for each of the following: (i) overlap, (ii) gain, and (iii) loss of inferences of differential expression in the simulated data sets relative to that generated from the original 1H_1C_11 probe sets in each tissue.
Overall, we observed linear increases in both the gained and lost inferences of differential expression in relationship to decreasing numbers of probes sampled within a probe set (Fig. ). While this was consistent for all five tissues, probe sets with odd and even numbers of remaining probes behaved slightly differently. Probe sets with even numbers of remaining probes showed more comparable increases in false negative and positive rates with decreasing probe number in all tissues (Fig. , and ), except heart (Fig. ). Probe sets with odd numbers of remaining probes showed steeper increases in lost relative to gained inferences with decreasing probe number (Fig. , and ).
Figure 4 Effects of probe number on the inferences of differential gene expression. The median number of gained (black) and lost (red) inferences of differential gene expression (Y-axes) in simulated data sets subjected to random probe masking relative to actual (more ...)
To illustrate the effects of masking, we compared the inferences of differential gene expression using mask5 (requiring at least 6 1H_1C probes in a probe set) and unmasked data (Additional File 5
). For each of the five tissues, the application of the mask drastically reduces inferences of higher expression in humans relative to chimpanzees, as shown by comparing panels A, C, E, G, and I with B, D, F, H, and J. This is consistent with earlier comparative analyses of human and non-human primate transcriptomes which demonstrated that masking was essential to remove false inferences of differential gene expression caused by mismatches between arrayed human probes and non-human primate transcripts (reviewed in ref. [7