|Home | About | Journals | Submit | Contact Us | Français|
We aim to determine the biological relevance of genes identified through microarray-mediated transcriptional profiling of Xenopus sensory organs and brain. Difficulties with genetic data analysis arise because of limitations in probe set annotation and the lack of a universal gene nomenclature. To overcome these impediments, we used sequence based and semantic linking methods in combination with computational approaches to augment probe set annotation on a commercially available microarray. Our curation efforts enabled linkage of probe sets and expression data to public databases, increased the biological significance of our microarray data, and assisted with the tentative identification of unidentified probe sets.
Xenopus is a well established model organism for cellular and genetic investigations of complex biological processes during development (Nieuwkoop and Faber, 1967; Pollet et al., 2005; Wullimann et al., 2005). Microarrays increasingly are implemented as powerful tools for large scale assessment of gene expression patterns during essential features of Xenopus life such as embryonic tissue specification, organogenesis, neural induction, and hormonal signalling (Altmann et al., 2001; Munoz-Sanjuan et al., 2002; Baldessari et al., 2005; Das et al., 2006). In our laboratory, Xenopus is implemented as a model for identification of genes that are implicated in inner ear function and sensorineural organogenesis (Varela-Ramirez et al., 1998; Serrano et al., 2001; Quick and Serrano, 2007). To this end we are using the Affymetrix GeneChip® Xenopus laevis Genome Array for transcriptional profiling of inner ear organs as a method for large scale identification of genes specific to the inner ear, especially those involved in inner ear development (Powers et al., 2007). Our comparative approach analyses Xenopus inner ear transcriptional profiles together with those of brain and other Xenopus organs. These experiments are undertaken with the long term goal of uncovering genes essential for the process of mechanotransduction and the maintenance and regeneration of the mechanosensory hair cell phenotype.
During our analysis of Xenopus microarray expression data, we observed that the vendor supplied annotation information for the Xenopus laevis GeneChip® probe set IDs (Xl-PSIDs) was a key limiting factor in our ability to interpret results. In part, this is because genetics research with the tetraploid Xenopus species, X. laevis, can be inherently difficult. The X. laevis genome has not been sequenced and the majority of X. laevis gene annotations arise from ESTs and cDNA libraries. Consequently, gene annotations for X. laevis may not be as comprehensive as those for X. tropicalis, a diploid sister species that is viewed as a superior alternative for molecular genetics experiments due to its smaller genome and shorter generation time (Amaya, 2005). The X. tropicalis genome has been sequenced and is available through the University of Santa Cruz (UCSC) genome browser (Kent et al., 2002).
Thus, a central theme that emerged in our genetic analysis is that variations in gene nomenclature and functional descriptions can cause ambiguities when we attempt to extract biological significance from Xl-PSID annotations. What is the remedy for this problem, which is a challenge shared by many investigators? This issue is especially problematic for researchers who work with organisms other than human or mouse, or with organisms having an unsequenced or poorly annotated genome. Some laboratories have solved this problem by developing methods to match genes to multiple gene identifiers and integrate this information into a queriable database (Glasner et al., 2003). For example, Dai et al. (2004) developed the GeneView system to provide more extensive gene annotation information for microarray chips than their vendors.
In this paper we summarise our efforts to enhance the annotations of the Xl-PSIDs using curation procedures that linked them to more informative gene identifiers such as the HGNC symbols (HUGO Gene Nomenclature Committee) and UniProt (Universal Protein Resource) IDs (Table 1). HGNC symbols are derived from the official gene name of a human protein approved by the HUGO Gene Nomenclature Committee (http://www.genenames.org/). As in the case of HGNC symbols, a consortium oversees the curation of protein sequences with accurate and functional annotations for UniProt IDs (The UniProt Consortium, 2007). Thus, linkage of Xl-PSIDs to these public databases increases the value added from the enhanced annotation because a large scientific community regularly oversees and updates the gene identifiers, thereby minimising the impact of the lack of standardisation of gene nomenclature. We developed three computational strategies that relied either on sequence similarity (Figure 1(A)) or semantic similarity (Figure 1(B)). Our results demonstrate how manual and automated annotation identification techniques can be used to enrich the available information that can be mined from a transcriptional profiling experiment in a manner that is cost-effective and easily achievable for a small laboratory operation.
The 15,611 PSIDs on the Affymetrix GeneChip® Xenopus laevis Genome Array comprise 15,503 probe sets representing 14,400 transcripts (16 probe pairs per transcript) that are a combination of mRNAs and ESTs with a bias towards 3′ UTRs. Within this total, 135 PSIDs are Affymetrix controls. For the purpose of this research paper we will only discuss the 15,476 PSIDs that begin with the ‘Xl.’ label because the transcripts used to design these probe sets can be identified in the UniGene clusters by their X. laevis UniGene ID. Annotation information for Xl-PSIDs is provided online at The NetAffx Analysis Center (http://www.affymetrix.com/analysis/index.affx) or in the vendor supplied annotation file, Xenopus_laevis.na25.annot.csv (http://www.affymetrix.com/support/technical/byproduct.affx?product=xenopus). The annotation provided for the Xl-PSIDs includes information about gene identifiers such as a one line gene title, a gene symbol, the archival UniGene cluster, the UniGene ID, the UniGene cluster type, the Entrez gene ID, the RefSeq protein ID, the RefSeq transcript ID, Gene Ontology (GO) terms with GO ID number, the Swissprot ID, and the GenBank Accession number.
After inspecting the annotation file from Affymetrix, we determined that the one line ‘Gene Title’ column contained information that could be used to group 98% of Xl-PSIDs into nine different categories (Table 1): Category X, those without a designated ‘Gene Title’; Category O, so named for ‘other’, contains one line descriptions consisting of gene names or anything else that does not fall into the other eight categories; Category C has designations with cDNA IMAGE numbers (integrated molecular analysis of genomes and their expression, Miller et al., 1997) or numbered cDNA clones; Category H contains hypothetical proteins (in which 20 have additional modifiers such as a putative gene name for the hypothetical protein); Category M contain the MGCXXXX protein numbers from the mammalian gene collection (http://mgc.nci.nih.gov/), and four categories are designated as Transcribed Locus (TL), with TLS, TLM, and TLW abbreviated for transcribed loci that are strongly (>90%), moderately (70–90%) or weakly (<70%) similar to another protein in an aligned region (percentages of similarity are defined by UniGene (http://www.ncbi.nlm.nih.gov/UniGene/FAQ.shtml)). As can be seen in Figure 2, most of the Xl-PSIDs were allocated into the following categories: O (other, 30%), TL (transcribed locus, 26%), and H (hypothetical proteins, 21%). The quality of annotation information provided for the nine different categories varies, with the encircled groups (TL, M, H, and C) in Figure 2 having the most ambiguous gene descriptions. These groupings were used to analyse the outcomes of our enhancement of the Xl-PSID annotations before and after linkage to HGNC symbols or UniProt IDs.
Interesting gene expression patterns for inner ear function are not lacking in the literature. Due to the importance of ion channels for the process of mechanotransduction, as well as their role in hereditary disorders of hearing and balance (Gabashvili et al., 2007; Serrano et al., 2001), we are especially interested in enhancing Xl-PSIDs with annotations for this gene family. However, making connections between experimental data found in the literature and microarray data can be challenging because of variable nomenclature as well as new understanding of gene function for different species. The type of gene identifiers referenced in the literature greatly depends on when the paper was published. If the nomenclature has changed since publication, there can be difficulties in establishing connections to Xl-PSIDs. Additionally, the name of proteins in publications can vary for similar genes between species.
Through literature searches we identified a publication by Gabashvili et al. (2007) that contained a list of 262 genes for Ion Channel Activity (ICA), which are expressed in inner ear tissue. The ICA gene list was compiled based on data gleaned from cDNA libraries, microarray analysis, and RT-PCR experiments. Because the Gabashvili et al. (2007) reference already had HGNCs as the gene symbol of choice, we did not need to search for these. However, it should be noted, that if HGNC symbols are not known, GenBank accession numbers, gene names or symbols can be used to find the HGNC symbols in the UCSC genome browser (Kent et al., 2002). This process of identifying target genes was the first step in the curation process outlined in Figure 1(A1).
We began the process of identifying putative ICA Xl-PSIDs by making a text file of all HGNCs from the list in the paper and using this to collect Homo sapiens protein sequences with BioMart (http://www.ensembl.org/biomart/martview/4ad4b6a9d10e301741f0d1e2755fe0f0), a mining tool that can be used to retrieve information from the Ensembl database (Figure 1(A1)). When the Ensembl database (release 49) was filtered with the list of 262 ICA HGNC symbols, a total of 241 protein sequences were recovered from the H. sapiens gene dataset (NCBI36). We used the TBLASTN algorithm software (Altschul et al., 1990) in a local search with the H. sapiens ICA proteins as the protein query, and the 15,476 consensus nucleotide sequences for Xl-PSIDs that begin with the ‘Xl.’ label (http://www.affymetrix.com/support/technical/byproduct.affx?product=xenopus) as the nucleotide database. Results from the TBLASTN search were evaluated by the returned e-value and a list was compiled of the best Xl-PSID matches to ICA HGNC symbols (e-values less than 10−14). This labour-intensive method limits the number of Xl-PSIDs that can be processed at once, but provides the highest quality annotation for Xl-PSIDs.
This approach (Figure 1(A2)), like the manual annotation identification, used sequence similarity to link Xl-PSIDs to UniProt IDs by mining the UniProt database. For this automated strategy, the BLASTX program (Altschul et al., 1990) was used in a large scale batch search. All known H. sapiens (human), Mus musculus (mouse), Caenorhabditis elegans (worm) and Drosophila melanogaster (fly) protein sequences were collected from UniProt (The UniProt Consortium, 2007), then compared against nucleotide queries consisting of the 15,476 consensus sequences for the Xl-PSIDs that begin with the ‘Xl.’ label. With this method the annotation of an entire chip can be readily enhanced.
To address the limitations of vendor-supplied annotation data, a local instance of an annotation database was created that incorporated internal expression data and publicly available gene annotation information (Figure 1(B)). The vendor-supplied annotation for Xl-PSIDs is in a format that is not readily searchable in a high-throughput fashion. In an attempt to overcome this limitation, a MySQL® database, XenEnhance, was created to store the Xl-PSID annotation data. We opted to store data in a relational database rather than in another format (such as a flat file) because a relational database can accommodate expansion of Xenopus resource annotation efforts in the future beyond microarray curation. XenEnhance permits linkage of Xl-PSIDs with the UniGene cluster IDs. This approach relies on the fact that Xl-PSIDs are derived from UniGene cluster IDs; therefore, their linkage to UniGene cluster IDs is straightforward. Although the vendor provides the UniGene cluster IDs as part of the Xl-PSID annotation, vendor UniGene cluster ID updates are provided less frequently than those on the UniGene public database which are updated monthly (http://www.ncbi.nlm.nih.gov/UniGene/FAQ.shtml).
Specifically, XenEnhance consists of seven database tables (Figure 3) created using data retrieved in flat file format from four data sources: the X. laevis GeneChip® annotation file, UniGene, RefSeq, and HGNC symbol data. Two tables were created from the UniGene data. The table ‘ug_clusters_82’ stores descriptive information for each UniGene cluster in the release, while the table ‘sim_proteins_ug’ stores descriptive information for the human proteins that are similar to the UniGene clusters in ‘ug_clusters_82’. One table, ‘hgnc_051908’, was populated with HGNC data that was downloaded on May 19, 2008. This table contains descriptive information for each current official gene symbol. Another table, ‘hs_prot2rna_29’, was created using data from the RefSeq file, release29.accession2geneid (http://www.ncbi.nlm.nih.gov/RefSeq/). This table contains human RefSeq protein accession, the corresponding human RefSeq RNA accession, and the human Entrez gene ID for the corresponding gene. Two additional tables were created using data from the vendor-supplied annotation file for the X. laevis GeneChip®. The Xl-PSID links the data in these two tables. One of these tables, ‘xl_probeset_031808’, contains descriptive information whereas the second table, ‘xl_probe_031808’, contains the individual probe ID and the sequences that comprise a probe set. Finally, the linking table, ‘ug_82_hgnc_051908’, was created to maintain the association between a X. laevis UniGene cluster and HGNC official gene symbols (see Figure 4(A)–(C)).
To format the data from the various sources into a form that could be inserted into XenEnhance, text parsers and table population software were written in the Java™ language (SDK version 1.5.0_13) to extract the required data from text files and load data into the tables.
The vendor-supplied annotation file is parsed in such a way as to create a separate row in the corresponding database table for each Xl-PSID (a separate table was created for the individual probes, which are linked to the Xl-PSIDs). The UniGene Xl.data file was parsed to extract cluster ID, cluster description, the gene represented by the cluster, and the gene ID. The parsed data were loaded into the table ‘ug_clusters_82’. The presence of the gene ID permits direct linking to NCBI’s Entrez Gene without the need to create a local instance of that data set. As mentioned previously, the Xl-PSIDs are derived from UniGene cluster IDs and the two can therefore be readily linked without the creation of a distinct linking table.
To make the link between the Xl-PSIDs and the HGNC official gene symbols, it was necessary to combine data from other sources to that obtained from UniGene. The first step toward this linkage relied upon the UniGene file, Xl.data. Using the contents of this file, we were able to link X. laevis UniGene clusters with proteins from different species, including human. When a UniGene cluster has a human protein with which it shares a degree of similarity, the protein accession ID is used to locate the corresponding human Entrez gene ID from the RefSeq file, release29.accession2geneid. In addition to Entrez gene ID, the release29.accession2geneid file provides RefSeq RNA accessions (when available), the taxon ID, and the RefSeq protein accessions. Using the Entrez gene ID, the HGNC database table could be searched to locate the corresponding official gene symbol. The Entrez gene ID is used as the direct link between Xl-PSIDs and HGNC official gene symbols. It is also possible to use the RefSeq protein accession to link to the HGNC data, which would be useful in cases where the HGNC symbol could not be linked to Xl-PSID by means of Entrez gene ID. Because we were using a more recent release of UniGene than was used when the chip was created, it was necessary for us to use current cluster assignments of the Xl-PSIDs to make the linkage to HGNC symbols.
The HGNC symbols for the 241 ICA genes that were manually identified, matched to 134 Xl-PSIDs (Table 1A1). Seventy-three percent of the HGNC symbols were linked to Xl-PSIDs in the O category, showing that even this labour intensive technique can enhance one line gene titles. The majority of enhanced annotations were identified using the large scale sequence similarity based approach (Table 1A2). UniProt IDs from H. sapiens and M. musculus had the largest counts for matches, linking 7,259 (47%) and 6,952 (45%) of the Xl-PSIDs, respectively, to a new annotation (Table 1A2). Using the semantic approach, we were able to associate 4031 (26%) of the 15,476 Xl-PSIDs that were derived from UniGene clusters to HGNC symbols based on various degrees of similarity between X. laevis and human proteins.
All three procedures had the highest matches to the other (O) and hypothetical protein (H) categories. Over half of the matches obtained with the sequence similarity approach enhanced annotations within the O category (Table 1A); 1/3 of the matches obtained with the semantic based approach enhanced annotations within the O and the H categories (Table 1B).
Figure 5(A) illustrates that together, the two large scale approaches (sequence and semantic based) matched a majority of the Xl-PSIDs to an additional gene identifier such as a UniProt ID or HGNC symbol (9108 matches, Table 2A). More than half of all the Xl-PSIDs were enhanced with an additional annotation, with over half of these matching 4–6 annotations. Within the enhanced group (Figure 5(B)), over a quarter of Xl-PSIDs were matched to five new annotations.
When we compared the category distribution of the enhanced (Figure 6(A)) and non-enhanced (Figure 6(B)) Xl-PSIDs, we observed that the majority of the enhanced Xl-PSIDs were in the O (other, 45%) and H (hypothetical proteins, 27%) categories (Figure 6(A)). The third largest group with enhanced annotations, the M category (MGCXXXX proteins from the mammalian gene collection) included 10% of the Xl-PSIDs with enhanced annotations (Figure 6(A)). Our curation strategies identified 929 additional annotations within the M category (74%, Table 2). The TL category remained elusive, with 3,733 (94.7%) of the Xl-PSIDs having only Affymetrix annotations (Table 2) after curation.
We have attacked the problem of enhancing microarray chip annotation with multiple approaches that were either sequence similarity based, or relied on semantic relationships between publicly available annotation data from X. laevis and other species. HGNC and UniProt databases were selected as purveyors of target enhancement annotations because they are collaboratively maintained by the research community and they are frequently updated (http://www.genenames.org/; The UniProt Consortium, 2007). Using the semantic approach, we increased our flexibility in mining the microarray data by creating a relational database that linked the vendor supplied annotation data for Xl-PSIDs with publicly available annotation for X. laevis and other species. In so doing, we were able to enhance the existing annotation for 60% of the Xl-PSIDs and to associate gene function properties with Xl-PSIDs that lacked this information in the annotation provided by the microarray vendor (Figures 5 and and6).6). In the future, we intend to extend our curation efforts by augmenting Xl-PSID annotations with information from online public resources dedicated to Xenopus genetics. For example, websites such as Xenbase (http://www.xenbase.org) and the Xenopus Gene Collection (XGC) (http://xgc.nci.nih.gov/) are repositories for Xenopus gene annotation data that are especially useful because they provide lists of full length clones. XGC comprises 10,291 full open reading frame clones and 9,138 non-redundant genes and Xenbase is currently in the process of linking human homologs, along with other species, to Xenopus genes. In addition, software tools developed as a result of our annotation efforts will be made freely available to the Xenopus community.
Ideally, we would prefer to work with universal gene symbols. However, presently there is no universally accepted consensus gene nomenclature for all species and there is variability in the degree of completion of publicly available annotation data for various species. Consequently, multifaceted approaches for annotation such as those presented here will continue to provide value for researchers by enhancing vendor-supplied microarray annotation and tentatively identifying previously unidentified Xl-PSIDs.
The authors express their gratitude to Dr. Charles Whittaker for insights and assistance with the mapping of X. laevis Xl-PSIDs to sequenced genomes. This work was supported in part by NIH (GM008136; DC003292; P50GM068762) awards to E.E.S. and an NSF IGERT Fellowship to S.M.V. (DGE-0504304).
TuShun R. Powers is a Postdoctoral Researcher in the Department of Biology at New Mexico State University. She received her PhD in Molecular Biology with a special interest in ruminant microbiology from New Mexico State University in 2005. She is intrigued by interdisciplinary research approaches and has interests in microbiology, molecular biology, bioinformatics, and functional genomics.
Selene M. Virk is a PhD student in the Department of Biology at New Mexico State University, where she received her Bachelor and Masters of Science degrees. Her current research interests include developing software tools to facilitate functional gene annotation, nervous system development, mechanotransduction, and modelling of complex biological systems. She has ten years experience as a Software Engineer in the biotech industry.
Elba E. Serrano is a Regent’s Professor of Biology at New Mexico State University and a member of the MIT Systems Biology Cell Decision Processes Center. She received her undergraduate degree in physics with distinction from the University of Rochester and her PhD in Biological Sciences from Stanford University with an emphasis in neuroscience and biophysics. Her research interests include neural regeneration, mechanotransduction, and disorders of hearing and balance. She has a special interest in promoting interdisciplinary education that bridges the life and quantitative/physical sciences, and in programmes that encourage students to pursue advanced degrees in Science, Technology, Engineering, and Mathematics (STEM) disciplines.