|Home | About | Journals | Submit | Contact Us | Français|
Regulated mRNAs during differentiation of rat neural stem cells were analyzed using the ABI1700 microarray platform. This microarray, while technically advanced, suffers from the difficulty of integrating hybridization results into public databases for systems-level analysis. This is particularly true for the rat array, since many of the probes were designed for transcripts based on predicted human and mouse homologs. using several strategies, we increased the public annotation of the 27,531 probes from 43% to over 65%. To increase the dynamic range of annotation, probes were mapped to numerous public keys from several data sources. consensus annotation from multiple sources was determined for well-scoring alignments, and a confidence-based ranking system established for probes with less agreement across multiple data sources. previous attempts at genomic interpretation using the celera annotation model resulted in poor overlap with expected genomic sequences. since the public keys are more precisely mapped to the genome, we could now analyze the relationships between predicted transcription-factor binding sites and expression clusters. Results collected from a differentiation time course of two neural stem cell clones were clustered using a model-based algorithm. Transcription-factor binding sites were predicted from upstream regions of mapped transcripts using position weight matrices from either JASPAR or TRANSFAC, and the resulting scores were used to discriminate between observed expression clusters. A classification and regression tree analysis was conducted using cluster numbers as gene identifiers and TFBs scores as predictors, pruning back to obtain a tree with the lowest gene class prediction error rate. Results identify several transcription factors, the presence or absence of which are sufficient to describe clusters of mRnAs changing over time—those that are static, as well as clusters describing cell line differences. public annotation of the AB1700 rat genome array will be valuable for integrating results into future systems-level analyses.
DNA microarrays have become an invaluable tool for researchers seeking to analyze large numbers of mRNAs in a given biological system. The parallel analysis of thousands of transcripts allows one to identify specific, regulated genes or to study large sets of genes over many treatment conditions. The benefit and utility of DNA microarrays is, however, completely dependent on the quality of the annotation information associated with each nucleic acid probe. The association of biological information with specific probes is required for context-specific regulation studies, exploration of biological pathways/gene ontologies, and transcription factor regulation analysis.
When attempting to interpret data from the Applied Biosystems 1700 Rat Genome Survey array, we found that many of the probe annotations did not link with public sources. This problem seemed to be caused by two issues: First, the array manufacturer had designed the array using the Celera Discovery System1 (CDS) genome model of rat as well as Hidden Markov models (HHM)-based homologies to mouse and human annotations.2 The CDS genome was originally provided to end users as a subscription service, but was discontinued nearly two years ago. Probe annotations linked to the Celera genome and its transcripts have never been linked to public databases. Second, the rat genome, while completed by a public consortium almost two years ago, has lagged behind other mammalian models in the detail of its annotations.
In its initial launch in 2001, the CDS contained the current assemblies as well as associated annotations for the privately sequenced human and mouse genomes. In 2004, a complete assembly of the rat genome was added, along with a set of predicted rat genes, mostly based on homology to mouse and human sequences. This information was unavailable to the public, and data could be accessed and mined only through the subscription-based CDS. In May of 2005, Celera Genomics closed the Discovery system and vowed to make all data publicly available through the National Center for Biotechnology Information (NCBI) GenBank database. This process was recently completed; however, the Celera gene IDs, referenced directly by AB1700 probes, were dropped as a result of the union of the two datasets, effectively limiting the useful information that can be readily obtained from any given microarray experiment. However, NCBI has assembled both the public and released Celera contigs into full genome sequences. NCBI also provides a curated reference sequence (RefSeq) transcript database that is updated regularly. Using these public sources, combined with other publicly available rat genome resources, we wished to annotate the array to allow more advanced analysis of our results.
Other groups have considered that commercial microarray probes should be frequently re-annotated based on updated genome models and UniGene clustering.3 We reasoned that a simple BLAST alignment would provide annotations as complete as possible using documentable public database identifiers. Furthermore, we considered that different databases should be treated with an ordered value system to give higher confidence to the best quality biological information. By sequentially searching databases in order of confidence, we propose that the biological information assigned to each probe would be most useful to researchers.
To link existing probes to the best quality public annotations, we aligned each probe sequence from the AB1700 Rat Genome Survey array to all current, publicly available rat sequences from multiple repositories.
Two cell clones were isolated by our colleague, Dr. Hedong Li, from v-myc-transformed E15 rat cortical cells.4 Both expressed nestin, a marker of neural stem cells, and one (L2.3) also expressed a marker for radial glial cells (BLBP [FABP7]) and was selected based on radial-glial-like morphology. Upon withdrawal of bFGF, one clone (L2.2) differentiates into neuron-like (TuJ1+) cells almost exclusively, eventually becoming electrically active GABA-ergic cells (Li et al., in preparation), and the other clone (L2.3) differentiates into a mixed neural phenotype, expressing neuronal (TuJ1), astrocytic (GFAP), and oligodendroglial (GalC) markers in individual cells. We prepared replicate cultures (n = 3) of cells at 0, 1, or 3 d following bFGF withdrawal for each of the two cell clones.
RNA was prepared using the mirVana miRNA isolation kit (Ambion/Applied Biosystems), which produces both low-molecular-weight (LMW) RNA for microRNA analysis and high-molecular-weight (HMW) RNA for mRNA analysis. Two micrograms of HMW RNA was labeled using the Chemiluminescent RT Labeling Kit (Applied Biosystems) and hybridized to AB1700 Rat Genomic Survey Arrays following the manufacturer’s protocols.
Data extracted from the scanned arrays were processed using R/BioConductor scripts provided by Applied Biosystems. Raw data were quantile normalized and a linear model was fit to the data, estimating both cell line and differentiation effects; 1,337 probes exhibiting significant differences in either effect were selected, with an accepted false discovery rate of 5%. Significant probes were segregated into nine distinct clusters using a novel, model-based clustering method.3
Sequences for 26,857 probes from the AB1700 Rat Genome Survey Array were provided by Applied Biosystems (Foster City, CA). All available curated rat transcripts were obtained from the following sources: NCBI RefSeq (http://www.ncnbi.nlm.nih.gov/projects/RefSeq/), NCBI Entrez Nucleotide DB (http://www.ncbi.nlm.nih.gov/entrez) (including dbEST, GenBank, and various other non-RefSeq rat transcripts), and Ensembl release 42 (http://www.ensembl.org/Rattus_norvegicus/). Rat genomic sequences were obtained from both the reference genome assembly4 and the Celera assembly1 from the public repository at NCBI. All sequences and associated annotation were stored locally on our LAMP model bioinformatics server. Each ~60-mer probe sequence was aligned to all available transcripts across all public sources using NCBI BLAST5 running on a dedicated Linux server. Probes were aligned using a minimum expect-value of 1.0 × 10–6.
To interpret clusters using potential regulatory sequences, available 1-Kb regions upstream of the 1337 significant probes were searched for putative transcription-factor binding sites (TFBS) using the Match Algorithm, and associated vertebrate position weight matrices (PWM) included in release 10.4 of the Transfac database (Biobase Corporation, Beverly, MA).
Cluster labels were treated as class labels for the purpose of detecting discriminating combinations of predicted TFBS. We explored different scoring mechanisms for the presence of a motif, and the best results were obtained using a measure that incorporates the number of hits as well as the score of a top hit. That is, the top 10 scores for each motif were recorded, and a total score was obtained as the max(score)*range(10 scores). This measure is large if the top score is large and/or there are many moderate (multiple) hits in the promoter for this motif. Finally, to find the discriminating TFBS, we fit a classification tree to the data using the CART software in R.6 The CART method selects a sequence of TFBS that optimally separates the gene classes. The first split in the tree is thus the single TFBS (TF1) that best separates the gene classes. The next two splits find the two TFBS that further improve the gene class separation: TF2(1+) best separates the gene classes with TF1 present, and TF2(1–) best separates the gene classes with TF1 absent. The tree is grown until adding more splits results in no further improvement. To protect against over-fitting, we use 10-fold cross-validation. We repeatedly randomly split the data into a training set (90%) and a test set (10%). We prune back the tree to obtain the tree with the lowest gene class prediction error rate on the test set.
Prior to our analysis, AB provided annotations for public databases of many of the 26,857 probes on the array, but left 9978 with no public annotation. In September 2006, AB released an updated annotation, which decreased the number of nonpublic annotations to 6042. To validate and extend these annotations, we obtained the sequences for all probes and BLASTed them to major public database contents selected for rat sequences. These results are summarized in Table 1.
In order to determine annotations with the highest confidence, we selected only BLAST results aligning probes to target sequences with 100% identity across the entire length of the probe. Among those perfect matches, we next classified matches according to a confidence scale. Preference was given to the RefSeq dataset due to the fact that all sequences designated as a reference sequence must be hand curated, and are not an immediate result of computational gene predictions. Additional confidence was granted to UniGene cluster members due to the use of this gene classification as a widely accepted and informative gene identifier. Probes matching one or more curated RefSeq transcripts that could be directly mapped to a single UniGene cluster were given the highest grade, “A.” This class includes 14,098 probes and can be considered to be the probes with the best annotation, since the curated RefSeq database should ideally represent individual, known transcription units on the genome. Two other classes of RefSeq alignments were identified; 274 probes with perfect alignment to RefSeq sequences, but mapping to more than one distinct UniGene cluster, were given a “B” grade. In several cases, the multiple UniGene clusters were a result of redundancy in the gene clustering at NCBI. In many cases, the multiple-cluster IDs are effectively the same gene, and the hand curation of the UniGene clusters has not caught up with the automated annotation. Regardless, the confidence in these identifications is diminished due to the multiple matches, thus warranting the lesser grade. A second class includes alignments to RefSeq entries that are not found in any UniGene cluster. These may represent RefSeq records created for UniGene clusters that are retired in the current build, or records that have not yet been assigned a UniGene cluster ID; 1306 probes were given a “C” grade annotation if they demonstrated perfect alignment to one or more RefSeq transcripts that could not be mapped to a UniGene cluster ID.
The next category of database matches considered were those probes aligning with UniGene cluster members that are not RefSeq records. Both single UniGene matches (3535) and multiple UniGene matches (357) were identified. The next grade, “D,” was reserved for the 3535 perfect matches to any non-refseq public transcript at NCBI (dbEST, GenBank, etc.) mapping to a single UniGene cluster, followed by “E”-grade probes (357), which matched to multiple UniGene IDs. Probes aligning to a non-RefSeq sequence that could not be mapped to a UniGene ID (63) were given the grade “F.”
Probes were also aligned to predicted rat transcripts from the Ensembl dataset. Ensembl transcript predictions matched 12,688 of all probes, adding 354 new annotations not found in UniGene cluster members. These probes that could not be mapped to any curated sequence at NCBI were given the annotation grade “G.”
Probes were additionally aligned directly to the rat reference genome, and those 5806 probes that could not be aligned to public transcript records were given the lowest annotation grade “H”; 20,915 probes aligned to single genomic loci, and this provided annotation for 5733 probes that were not previously identified. Probes with multiple genomic matches (652) allowed identification of 73 additional probes.
This strategy extracted annotations that were prioritized by quality. Only 1127 probes could not be aligned with any existing public rat sequences used in this procedure. It is important to note that despite the lower score, all probes having an annotation grade represent perfect alignments to one or more curated, publicly annotated transcripts. The annotation grade does not reflect confidence in the quality of the probe sequence itself, but rather the amount and confidence of public information associated with a given probe. Indeed 36 of our significantly regulated probes could not be mapped to any public transcript or genomic sequence, and yet demonstrate detectable and altered expression in our neural differentiation study.
To determine whether our extended annotation provided useful new biological interpretations, we assembled the major gene ontology categories represented among the group of probes with new public annotation contained within our significantly regulated probe list. We were able to readily obtain gene ontology categories for 1127 of our 1337 significant probes, since they were directly mapped to a single UniGene cluster. Only 431 of these probes were originally mapped by Applied Biosystems to UniGene clusters. Subtracting these probes to emphasize the information content obtained with the updated annotation, we present the top 25 represented Gene Ontology categories (Figure 1). This summarizes the contribution of newly annotated probes to the understanding of the entire significant probe list in context.
Our purpose in extending the public annotations of the AB1700 rat array was to examine our gene expression data for potential transcriptional control sequences explaining mRNA regulation. Results from two cell clones (neurogenic and multipotential) and three time points of differentiation (0, 1, or 3 d following bFGF withdrawal), run in triplicate, data extracted, normalized, and selected for significant changes by cell clone or by time after bFGF withdrawal (see Methods); 1337 significant probes were clustered into models representing expected expression patterns (Figure 2). For example, one cluster target was a cell-type difference that did not change over time (clusters 1, 3, and 7). Another cluster target represented mRNAs demonstrating dynamic changes during differentiation, regardless of a cell-line effect (e.g., clusters 2, 5, 8, and 9). Figure 2 depicts the centers of each cluster.
To determine whether transcription factor binding sites (TFBSs) in predicted promoter regions could explain differences between clusters, we collected 1-Kb sequences from each of 1127 annotated sequences. These sequences were searched for TFBSs using position weight matrices from either JASPAR or TRANSFAC databases. A classification and regression tree (CART) analysis was conducted using cluster numbers as gene identifiers and TFBS scores as predictors, pruning back to obtain a tree with the lowest gene class prediction error rate. The resulting tree (Figure 3) depicts the result of this analysis, listing each TFBS that distinguishes two sets of clusters based on their presence or absence in the 1-Kb upstream regions. Several TFBSs split off a single cluster without separating the clusters into clear subcategories. However, Mef2 splits clusters into two groups—identified as different by cell type or different by time. The primary position of Mef2 in this model suggests that it may play a fundamental role in distinguishing neurons from other cell types produced by NSC. This hypothesis is currently being tested.
Microarray annotations provide the link between observed mRNA levels and biological interpretations. Most commercial and custom microarrays provide full information on the sequences of probes used as well as the database record from which the probe was designed, usually a GenBank accession number. At the time of its design, the use of the Celera Discovery Systems rat genome was considered to be an advantage since the CDS builds were more complete than public records. However, by the time the rat AB1700 array came into public use, the public rat build was released and CDS was discontinued, leaving a designed array with limited annotation.
We completed a BLAST search for all probes on the array using a hierarchical strategy based on relative confidence in various databases as well as the biological information associated with each database. Highest value was given to probes that matched a single RefSeq as well as a single UniGene cluster, since RefSeq is curated by NCBI staff and since all known transcripts arising from a transcription unit ideally should be clustered together using UniGene. We next proceeded to annotate probes having multiple matches to one or more transcript database. Finally, probes having only genome matches were included. Annotation data, both new and original public annotation, are summarized in Table 2. Complete annotation data as well as the categories of the best matches are found in Supplemental Table 1 (http://cord.rutgers.edu/appendix/jbt/Supplemental_Table_1.xls). These data will be released through the Probes database at NCBI.
Using these new annotations, we searched for putative promoter elements that could explain different clusters of mRNAs regulated during NSC differentiation. Using the Match algorithm (BioBase), we scanned 1-Kb regions upstream of UniGene clusters obtained from the updated annotation of probes from our significance list. A CART analysis of potential transcription factors yielded several important clues regarding the previously ignored transcription factor family Mef2. First, the result demonstrated an enrichment of binding sites for this family of gene regulators that would not have been initially suspected. There is evidence to suggest that members of this gene family, previously studied in muscle differentiation, may in fact play a role in neural differentiation as well. Secondly, the CART analysis suggests that the presence or absence of Mef2 binding sites separates the significant genes into two groups—those dynamically regulated during neural differentiation, and those that are unchanging in our time course. Again, this result hints at the potential role of Mef2 family members in regulating and/or participating in neural cell differentiation.
Originally identified in differentiating myocytes,7 the Mef2 family of genes comprises a group of DNA-binding transcription factors belonging to the minichromosome maintenance 1-agamous-deficiens-serum response factor (MADS) family. Members of this family contain the highly conserved N-terminal MADS domain, which mediates dimerization and binding activity to the A/T-rich consensus sequence CTA(A/T)4TAG/A. As with many other MADS-containing genes, Mef2 proteins interact with a wide range of transcription factors and various other modifying proteins. The wide array of binding partners creates a diverse population of genes that are likely to be affected by Mef2 activity downstream.
Various Mef2 isoforms have also been found in the nervous system. Mef2 is expressed in the neurons of C. elegans and in Kenyon cells of D. melanogaster8, 9; however, its function has not been investigated in neural development. In mammals, Mef2 family members are expressed in neural crest cells as early as E8.5 and in the brain by E12.5.10 All four isoforms (A–D) are present in developing cortex and olfactory bulb, although their expression patterns do not necessarily overlap. Mef2A has additionally been identified in hippocampus, thalamus, and the internal granular layer of the cerebellum, where it is associated with granule neuron differentiation markers such as gamma aminobutyric acid (GABA).11 Mef2B follows a similar expression pattern, but is found only in the cortex, olfactory bulb, and dentate gyrus after development.12 Mef2D can be located throughout the developing nervous system and on through adulthood.12 Expression of Mef2C in the developing brain is slightly different. This isoform is expressed in select cortical neurons present in the external granular layer (II), the internal granular layer(IV), and the fusiform layer (VI) exclusively.13–15 This expression pattern is seen both during development as well as in the adult animals,13–15 and strongly associates Mef2C with the presence of interneurons. These expression patterns are consistent with a role for Mef2 family members in neural development.
While only recently shown to be expressed in the developing cortex, Mef2 gene activity has already been associated with key neural pathways. Members of this gene family promote neuron survival in both Ca+-dependent16, 17 and neurotrophin-mediated18, 19 mechanisms, act as a switch-controlling post-synaptic dendrite differentiation,19 and suppress neuronal apoptosis in NMDA-induced exci-totoxicity.20 Additional roles for Mef2 genes in regulating the acquisition of the neuronal phenotype are beginning to emerge as well. The temporo-spatial expression of the Mef2 genes, the bioinformatic identification of a pivotal role for Mef2 binding sites, and previous associations of Mef2 family members in neuronal pathways, supports the hypothesis that this family plays a crucial role in the induction of neurogenesis in the developing brain.
Linking AB1700 rat genome survey array probes to public databases should allow the use of these probes in clustering, gene ontology classification, pathway analysis and other bioinformatic analyses. We have demonstrated one use of the updated annotation in facilitating a higher-level analysis through identification of transcription start sites and associated upstream regions. We invite researchers using the AB1700 Rat Genome Survey arrays to download the updated annotations from our website (http://cord.rutgers.edu/appendix/jbt/Supplemental_Table_1.xls).
We thank Robert Genuario of Applied Biosystems for help with obtaining probe sequences and Drs. Qi Wang and Andrew I. Brooks for hybridization and image analysis of the AB1700 arrays. Supported by grants from the New Jersey Commission on Spinal Cord Research, the New Jersey Commission on Science & Technology, NIH, and Invitrogen, Inc. JLD is a graduate fellow of the New Jersey Commission on Spinal Cord Research.