|Home | About | Journals | Submit | Contact Us | Français|
Since the introduction of microarrays in 1995, researchers world-wide have used both commercial and custom-designed microarrays for understanding differential expression of transcribed genes. Public databases such as ArrayExpress and the Gene Expression Omnibus (GEO) have made millions of samples readily available. One main drawback to microarray data analysis involves the selection of probes to represent a specific transcript of interest, particularly in light of the fact that transcript-specific knowledge (notably alternative splicing) is dynamic in nature.
We therefore developed a framework for reannotating and reassigning probe groups for Affymetrix® GeneChip® technology based on functional regions of interest. This framework addresses three issues of Affymetrix® GeneChip® data analyses: removing nonspecific probes, updating probe target mapping based on the latest genome knowledge and grouping probes into gene, transcript and region-based (UTR, individual exon, CDS) probe sets. Updated gene and transcript probe sets provide more specific analysis results based on current genomic and transcriptomic knowledge. The framework selects unique probes, aligns them to gene annotations and generates a custom Chip Description File (CDF). The analysis reveals only 87% of the Affymetrix® GeneChip® HG-U133 Plus 2 probes uniquely align to the current hg38 human assembly without mismatches. We also tested new mappings on the publicly available data series using rat and human data from GSE48611 and GSE72551 obtained from GEO, and illustrate that functional grouping allows for the subtle detection of regions of interest likely to have phenotypical consequences.
Through reanalysis of the publicly available data series GSE48611 and GSE72551, we profiled the contribution of UTR and CDS regions to the gene expression levels globally. The comparison between region and gene based results indicated that the detected expressed genes by gene-based and region-based CDFs show high consistency and regions based results allows us to detection of changes in transcript formation.
A DNA microarray (DNA chip or biochip) is a technology used to identify and measure the expression level of specific mRNA molecules in order to ascertain transcriptional profiles in response to differing conditions. The most commonly used microarray is the Affymetrix® GeneChip® family of arrays. Each GeneChip® consists of a silicon chip with fixed locations called cells, spots or features . Each spot contains millions of identical 25 base oligonucleotides (probes) which are selected to be complementary to various transcript regions of a gene . In order to determine transcript expression, which directly infers gene expression, groups of 11-20 probes matching the same gene/transcript are arranged in a probe set. Given a particular Affymetrix® GeneChip® platform, the design of the probes is fixed based on earlier genome assemblies and annotation available at that time. Since the design of the first Affymetrix® GeneChip®, rapid progress has been made in genome sequencing resulting in more accurate databases of annotated coding and non-coding genes.
The significant differences between old and new genome assemblies and annotations make it necessary to update probe-gene targeting according to current knowledge to get more accurate interpretations from experimental results. Affymetrix® does attempt to provide compatibility between genomic changes by updating links between probe sets and their corresponding genes/transcripts via NetAffx™ . Table 1 shows release dates of source databases used by Affymetrix® for both the incorporated version and the most recently available version. In all cases, there is at least a two year difference between the incorporated and most recent release dates which can lead to inconsistent interpretation.
In addition, updating links between probe sets and their corresponding genes/transcripts does not provide a solution for problems caused by individual probes such as single nucleotide polymorphisms (SNPs) [4, 5], probes that target genes other than the designated gene of a probe set, and probes that no longer align to a genomic location. For example in the Affymetrix® GeneChip® HG-133 Plus 2 array, a total of 40,680 probes out of 603,158 (excluding quality control probes) do not have a perfect match to the most recent human genome assembly (hg38).
Even though the design of the probes is fixed, the methods with which the resulting experiments can be analyzed are dynamic in nature due to the ability to annotate and arrange probes into uniquely defined groupings. This is particularly important since there are publicly available repositories of microarray datasets, such as NCBI’s Gene Expression Omnibus (GEO)  which contains 1,802,922 different samples as of 5/18/2016 that can be reanalyzed computationally based on current knowledge without the need for new biological experiments. As a case in point, each of the four most commonly used species have samples that have been analyzed using the original CDFs (Table 2).
Several research groups have reassigned probes into new probe sets by creating their own custom Chip Description Files (CDF) [7–13], which are specially formatted files used to store the layout information for an Affymetrix® GeneChip® array. Given a CDF, the intensity values of probes located in the CEL file can be extracted and summarized as a defined probe set to detect the expression level of genes or transcripts.
These approaches have a similar workflow of mapping probes but differ in terms of the groupings of probe sets, including: data source used, the selected target level (gene or transcript), whether to create probe sets from scratch or redesign the existing groups and sharing probes between probe sets.
In terms of annotations used, most approaches have mapped the probe sequences to the transcripts obtained from one or more databases such as GenBank, NCBI RefSeq and Ensembl. Unlike other approaches, Harbig et al.  mapped to the target sequences of probes obtained from Affymetrix® rather than the actual mRNA sequences themselves, where the target sequence is an exemplar region of a specific transcript ≤600 bases in length. After mapping, they grouped probes to unique transcripts or genes based on the mapping results. Some approaches update the original probe set groups by removing select probes and changing the link between probe set and gene/transcript. The most comprehensive study for probe annotation remapping was achieved by Dai et al. (brainarray CDFs) . Rather than focusing on one reference database or combining multiple sources to create one custom CDF, they mapped probes to different annotation databases and created a specific custom CDF for each database.
Although the inherent effects of using dated probe gene mapping designs to analyze microarray data sets might seem obvious, the overwhelming majority of experimental results have only been analyzed using the original CDFs designed by Affymetrix®. For example as of May 2016, GEO has 120,920 samples which were analyzed via the original Affymetrix® CDFs for the HG-U133 Plus 2 array (Table 2). On the other hand only 6403 samples were analyzed using custom CDFs, mostly produced by brainarray (Table 3). Given that fewer than 5% of all samples in GEO have been analyzed by alternative CDFs, an opportunity exists to reanalyze existing datasets according to updated transcript knowledge or functional regions of interest.
While microarrays have been successfully utilized for understanding differential expression at the gene or probe set level, less attention has been given to the potential analysis at the individual exon, alternative transcript, and untranslated region (UTR) level. While the selection bias of probes on the 3′ ends of genes for earlier iterations of Affymetrix® GeneChip® designs presents limitations on the completeness of transcript information, more recent designs allow for a more complete coverage of exons and exon junctions. However, information concerning individual exons can still be extracted from earlier GeneChip® designs, particularly in the 3′ UTR regions that have been shown to play important roles in cancer [15–17], development [18–22], and localization in the nervous system [23–27]. In fact, over 40% of genes have been shown to generate multiple mRNAs with variable 3′ UTR lengths . These 3′ UTRs harbor binding sites for molecules including microRNAs (miRNAs) and RNA-binding proteins. Thus, mRNA isoforms with lengthened 3′ UTRs have increased numbers of sites for these cis-interacting factors. The diversity of 3′ UTRs is predominantly regulated by alternative polyadenylation (APA), which employs alternative mRNA cleavage sites that lie progressively distal to the stop codon. APA-driven mRNA diversity is required for normal physiology, and misregulation of this process is associated with diverse disease states . We therefore have developed a framework for analysis of Affymetrix® GeneChip® data by regrouping probes into probe sets based on Ensembl annotations at the gene, transcript, individual exon, and UTR levels in order to detect changes in gene expression that may occur within specific regions of the transcript.
We developed an Affymetrix® GeneChip® probe remapping protocol at the level of genes, transcripts, untranslated regions (UTRs), coding sequences (CDS) and individual exons based on the latest genome (hg38, mm10, rn6) and Ensembl annotations (ENS-85) for human, mouse, and rat. The protocol takes annotations in a General/Gene Transfer Format (GTF)  file, generates a custom CDF where probes are grouped into probe sets based on region (UTR, CDS, individual exon), transcript or gene level. Here, we define individual exons as coding exons within protein coding genes, or all exons within structural RNAs (such as miRNA and lncRNA). In effect, the individual exons refer to all non-UTR portions of exons. Figure 1 shows the flow chart of annotation and grouping of probes based on the region of a gene. It is composed of three main steps: mapping probes to the genome, annotation of probes, and assignment of probes to probe sets based on annotations.
PM probe sequences, which can be obtained from the Affymetrix® Netaffx™ web site, are aligned to the indexed genome using Bowtie version 1.0.1  with the parameters -v 0 and –m 1, requiring that probes align to a single genomic location with 100% identity, thereby reducing cross-hybridization effects. Note that Bowtie version 1 is best at aligning shorter sequences (25-50 bp) as found with microarray probes while the most recent versions of Bowtie are optimized for long sequence reads (>50 bp). Mismatch (MM) probes are not considered in the mapping step, although they could theoretically map uniquely to genomic regions. Rather, the MM probes are set aside and are included with their corresponding PM probe during the final CDF construction step once the PM probes have been assigned to a probe set. During this analysis, only probes perfectly matching to a region are considered. Therefore, probes crossing splice junctions will be discarded.
Probes are annotated based on the overlap between probes and genomic intervals by the following steps.
Since GTF files obtained from Ensembl were used, Ensembl gene ids were employed to distinguish different genes and Ensembl transcript ids were used to distinguish different transcripts. When the generated CDF was based on regions of genes, the region was suffixed to the Ensembl gene id. Table 4 shows example probe set names taken from custom CDFs for the Affymetrix® GeneChip® HG-133 Plus 2.
We applied our framework to the three most widely used GeneChips®: HG-U133 Plus 2, Rat 230 2.0 and Mouse 430 2.0 (summarized in Tables 5 and and6).6). We also examined the effect of probe reannotation over the differentially expressed genes. Three types of CDFs were created for every selected organism. Our results discussed here are restricted to the analysis of the HG-U133 Plus 2 and Rat Genome 230 2.0 GeneChip® for brevity. After CDF creation, we reanalyzed the publicly available data series GSE48611  and GSE72551  from GEO via our custom CDFs.
Using the bowtie parameters as discussed in the methods section, we were able to identify probes that uniquely map with 100% identity for each of the respective genomes. As a result, 87% PM probes of the HG-U133 Plus 2, 84% PM probes of the Rat 230 2.0 and 86% PM probes of the Mouse 430 2.0 were uniquely mapped to the genome and were used in the subsequent steps (Table 7).
To annotate probes, we mapped uniquely aligned probes to gene regions using the most recent Ensembl genome and GTF file for each respective organism. We used the specific regions based on the custom CDF type (gene, transcript or region-based). Consequently we produced three types of custom CDFs (Tables 5 and and66).
The human gene based CDF has 22,651 custom designed probe sets composed from 414,701 probes and 62 original control probe sets. 442,025 annotations were identified between genes and the probes. 27,324 annotations were filtered after shared probes were removed. In order to validate our probe set annotations, we compared the original CDF probe sets with the custom CDF. A total of 21,585 annotated genes were shared between the two CDFs, with 3068 unique to the original CDF, and 1066 unique to our custom CDF. In order to determine why some genes were not covered in our CDF, we examined those unique to the original CDF. First we obtained the probe sets which represent these genes in the original CDF, yielding 2781 probe sets. We retrieved both the PM and MM probe sequences for each of these. We observed that for 667 probe sets, every probe was removed during probe mapping to the genome due to either non-unique mappings or mapping rates less than 100%. 30,150 probes from the remaining 2114 probe sets were not used in our CDF since they either did not map to the genome or they were MM probes. 14,028 probes were used in our newly constructed probe sets which target different genes than the original assignment by Affymetrix® and 2656 probes were not aligned to gene structures and not annotated. As a result, the differences between the original CDF and our method occurs because of probes removed during genome alignment, probes that no longer map to gene structure or probes that map to gene structures different from the original annotation.
For the rat 230 2.0 GeneChip®, the restriction of three probes per probeset yields 12,534 uniquely identified Ensembl genes at the gene level. We determined that for this specific GeneChip®, reorganization of the Affymetrix® probes into mRNA region-specific probesets provides 4024 unique Ensembl gene identifiers with probesets in both the 3′ UTR and CDS. Using this subset of probesets, differential expression of the CDS can then be compared to the 3′ UTR.
We reanalyzed the publicly available data series GSE72551 and GSE48611. Both of these studies involve the nervous system, where differences in 3′ UTRs are likely to have phenotypic effects on transcript localization. The GSE72551 data series examines gene expression changes associated with collateral sprouting and includes 5 naïve controls, 7 replicates at day 7 post-surgery and 7 replicates at day 14 post-surgery. The GSE48611 data series examines Down syndrome gene expression monitoring. This data set includes mRNA samples from the isogenic trisomy of chromosome 21 (Ts21) and control pluripotent stem cells (iPSCs) (DS1, DS4, and DS2U) between passages 24 and 48 and from day 30 neurons. Three biological replicates were present for each condition. Prior to analysis, we removed probe sets with two or fewer probes from the custom CDFs in order to achieve more accurate results for target expression levels. Robust Multiarray Averaging (RMA) normalization  was used for preprocessing. A p-value cutoff of 0.05 was used as the threshold for all experiments.
In the GSE72551 data series, differentially expressed genes (DEGs) were determined for two pairwise comparisons: naïve vs. both 7 and 14 days using region and gene based custom CDFs. We also reanalyzed the data using the brainarray Ensembl CDF version 20. Figure 3 shows a Venn diagram representing the number of differentially expressed genes using region, gene and brainarray custom CDFs for both cases.
Further examination of the 7 day versus naïve ENSEMBL genes found to be differentially expressed in either the gene-based or region-based CDF shows high concordance, with 975 ENSEMBL genes determined to be differentially expressed using both CDFs (Fig. 3a). Examination of the p-values shows a significant correlation between both the gene and the 3′ UTR region (r=0.439; p=1.480E-58) as well as between the gene and the exon region (r=0.101; p=0.001). The higher correlation with the 3′ UTR region is to be expected, due to a higher abundance of probes designed in these regions.
One hundred sixty genes are found to be differentially expressed using the gene-based approach only. Three genes are omitted completely from the region-based CDF. Further examination of the remaining 157 genes measured using both the gene-based and region-based methods shows that 122 of these (78%) have a gene-based p-value >0.03, and 80 (50%) have a gene-based p-value >0.04, indicating the detected differences are just below the cutoff level. Analysis of the region-based p-values show that 120 of these (77%) have a region-based p-value <0.10, and 146 (94%) have a region-based p-value <0.20, putting these genes just above the significance threshold.
An additional 423 genes are found to be differentially expressed using the region-based approach only, with 203 from the 3′ UTR only, 10 from the 5′ UTR only, 206 from the exon only, and 4 from both the 3′ UTR and exon. Unlike the DEGs uniquely found in the gene-based approach, those genes found to be differentially expressed in the region-based approach typically have a much higher p-value in gene-based analysis, with only 31% having a p-value between 0.05 and 0.10. This supports our reasoning that separating into functional regions allows detection of subtle changes in transcript formation that may have a larger functional impact of those transcripts which has been further validated by experimental work showing differential expression of the 3′ UTR of the CAMKIV gene plays a role in localization .
In order to determine why some genes were only detected by the brainarray CDF, we examined the probe sequences of those genes that are brainarray specific. Of these, 39 were excluded from our CDFs since they aligned to multiple locations in the rn6 genome. An additional ten of these probes did not match to known Ensembl gene structures and were thus removed. Eighteen of these probes were excluded because the probe set contained fewer than three probes. An additional 40 of the brainarray probes were used in our CDFs, but with annotations differing from brainarray due to changes in annotation information.
In the GSE48611 data series, DEGs were determined for two pairwise comparisons: isogenic Ts21 vs. control iPSCs for both DS1 and DS4. We reanalyzed the data using region, gene and the original Affymetrix® supplied CDF obtained from the Affymetrix® Netaffx™ web site. For DS1, our gene-based CDF identified an additional 194 DEGs not found using the original CDF and 616 DEGs identified by both methods. For DS2, our gene base CDF identified an additional 331 DEGs found using our method only and 337 DEGs identified by both methods (Table 8).
One of the limitations of microarray technologies is the design of probes based on available sequence and annotation data at the time of design. Based on our analysis, the percentage of uniquely mapping probes varies from 84% (rat) to 87% (human), indicating that changing knowledge about the genome itself plays a role in probe utilization. In terms of annotation, the rat genome is known to have more incomplete information when compared to mouse and human, which is reflected in the fact that only 47% of the rat probes lie in region-based locales (exons and UTRs) compared to 65% for mouse, and 69% for human. Since this can potentially lead to a small number of probes in each annotated region (and thus increased false positive rates), we have further required at least three probes be present in each probe set for our analysis. Both unrestricted (1 or more) and restricted (3 or more) probe groupings are available as CDFs.
To further illustrate the importance of region-based CDFs, using the subset of 4024 genes with probesets in both the CDS and 3′ UTR regions, we were able to identify 203 differential expression events at the 3′ UTR level that do not show differential expression within the CDS. In addition, these events are not detected using the standard Affymetrix® CDF. Further analysis of these 203 genes yields some genes of particular interest. For instance, the 3′ UTR of GRIK4 (Glutamate Ionotropic Receptor Kainate Type Subunit 4) was up-regulated (p-value 0.0450) while the CDS was not significantly regulated (FC=1.07; p-value 0.4525), suggesting the 3′ UTR of this gene was lengthened (Fig. 4). GRIK4 regulates kainite-receptor signaling and neuroplasticity  and its missregulation is associated with neurological diseases including Alzheimer’s , bipolar disorder , and others. Interestingly, a deletion variant specific to the 3′ UTR of GRIK4 is protective of bipolar disorder . Alongside our observation, this suggests that regulation of this plasticity-associated gene occurs though its 3′ UTR. We also observed that the 3′ UTR of VEGFA (vascular endothelial growth factor-A) was downregulated (−1.17 FC; p=0.0102) and expression of its CDS was unchanged (1.01 FC; p=0.8334) (Fig. 5). The 3’UTR of VEGFA, a potent neuromodulator, undergoes a well-described binary switch to regulate its expression . Our observations suggest the VEGFA 3′ UTR undergoes an additional layer of regulation by shortening during collateral sprouting.
As our analysis with the GSE48611 and GSE72551 datasets show, reanalysis of publicly available datasets using updated annotations can yield additional information when compared to the use of the original CDFs. In our case, the region-based CDFs allow for a better understanding of 3′ UTR dynamics through the reanalysis of publicly available data. While current high-throughput sequencing technologies may allow for a more complete picture, this custom CDF approach will allow for deeper insight with only minimal computational cost, taking advantage of the high volume of publicly available GeneChip® data.
We proposed a framework for reannotating and reassigning probe groups for Affymetrix® GeneChip® technology based on functional regions of interest. Our work differs from others in that we annotated probes in UTR and exon levels in addition to gene and transcript (isoform) levels. We illustrated how this framework affects the detection of differentially expressed genes, particularly when focusing on functional regions of interest. Removing probes that no longer align to the genome without mismatches or align to multiple locations can help to reduce false-positive differential expression, as can removal of probes in regions overlapping multiple genes.
The main motivation of our work was profiling the contribution of UTR and exon regions to the gene expression levels globally. Our results indicate that features differentially expressed in either the gene-based or region-based CDF show high concordance and separating out into functional regions allows for the detection of subtle changes in transcript formation.
The authors wish to thank members of the Kentucky Biomedical Research Infrastructure Network Bioinformatics Core, the University of Louisville Bioinformatics Journal Club, and members of the University of Louisville Bioinformatics and Biomedical Computing Laboratory for helpful insight, project review, and suggestions.
Publication charges provided by National Institutes of Health grant P20GM103436. Research support provided by NIH grants P20GM103436 and R01NS094741. The contents of this manuscript are solely the responsibility of the authors and do not represent the official views of NIH.
The datasets supporting the conclusions of this article are available in the figshare repository, 10.6084/m9.figshare.3840144. Individual custom CDFs can also be accessed at: http://bioinformatics.louisville.edu/RegionCDFDesc.html.
This article has been published as part of BMC Genomics Volume 18 Supplement 10, 2017: Selected articles from the 6th IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS): genomics. The full contents of the supplement are available online at https://bmcgenomics.biomedcentral.com/articles/supplements/volume-18-supplement-10.
ES was responsible for code preparation, development of the project, and manuscript preparation. ECR and JCP developed the overall project goals. ECR supervised the overall project, provided the necessary lab space and computational resources for project completion, and led development of the manuscript. JCP and BJH provided test data, analyzed results, and reviewed the manuscript. KW performed testing of UTR analysis of microarrays and reviewed the manuscript. All authors have read and approved the final manuscript.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Ernur Saka, Email: firstname.lastname@example.org.
Benjamin J. Harrison, Email: ude.enu@2nosirrahb.
Kirk West, Email: ude.smau@tsewk.
Jeffrey C. Petruska, Email: email@example.com.
Eric C. Rouchka, Email: firstname.lastname@example.org.