|Home | About | Journals | Submit | Contact Us | Français|
MicroRNAs are small (19–22 nt) RNAs that play crucial roles in many cellular processes via targeting mRNAs for translational repression or cleavage thus regulating gene expression (1). microRNAs, through their compatible 5′-seed sequences, exert regulatory functions primarily on the 3′-untranslated regions (UTRs) of targeted mRNAs (2–4). microRNAs that target the same mRNAs may share common motifs due to duplication events and/or common evolutionary ancestry (5–7). Previous studies using genome search and target prediction algorithms have provided lists of common genomic regulatory nucleotide motifs, some of which are also shared by microRNA sequences (8). However, whether microRNAs that are similar in sequence exhibit similarities also in function and/or expression is not yet well understood and warrants further study.
Based on large-scale studies, microRNAs have been annotated for their specificity for particular tissues, developmental stages and/or pathologies such as cancer (9–12). For example, Bargaje et al. (13) compiled and normalized multiple data sets from different sources to determine the tissue-specific and tissue-invariant consensus expression profiles. Others have surveyed microRNA expression profiles in large numbers of normal and cancerous tissues to decipher microRNA networks and conserved expression clusters in disease (14). There also is evidence suggesting that expression patterns of microRNAs are conserved at the species level (5). Nevertheless, the conserved associations between microRNA sequence motifs and expression profiles across taxa still remain relatively unexplored (15–17). Similarly, there is a growing need for tools developed for multivariate comparison of expression patterns between different data sets (18).
In recent years, several databases and analysis tools also have been published that feature high-throughput analysis results of microRNA sequence or expression. Among these, miRBase functions as a central repository for microRNA genomics for a variety of organisms and thus serves the community with up-to-date microRNA sequence, chromosome location and transcript information (19). mSigDB, using motif lists from Xie et al. (8), provides microRNA target gene lists that could be tested for enrichment with Gene Ontology (GO) functional terms, KEGG signaling pathways or other gene lists (20). Similarly, a manually curated database, called Mir2DiseaseBase, can be used for extracting associations between diseases and microRNAs (21). Most recently, mirBridge has been developed to predict microRNA function and link microRNAs with cellular pathways using network algorithms (22). Among the expression analysis focused databases, miRGator is a comprehensive repository and analysis tool for microRNA expression, target and ontology data providing a graphical transcriptional evaluation of selected microRNA types for mice or humans (23). microRNA.org is another source of microRNA expression and functional data for understanding microRNA expression regulation through target prediction and examination of tissue transcript abundance (24). Accordingly, using a database approach has been fruitful in allowing the users to interactively query and make associations among large-scale data sets and thus is highly suited for exploring the association between sequence motifs and expression profiles of microRNAs and meta-analysis of microRNA expression data sets.
In the present study, we have developed the microRNA Expression and Sequence Analysis Database, mESAdb, to provide a series of interactive analysis tools for testing the association of microRNA sequence characteristics with target gene function, human diseases and microRNA expression patterns using multivariate analyses. mESAdb is also a meta-analysis tool for comparative analysis of function and expression for microRNA lists across different taxa, including human, mouse and zebrafish. Complementing existing databases, mESAdb takes advantage of the available sequence information to search for microRNAs with common motifs (e.g. dinucleotide frequencies or conserved seed sequences) after which these microRNA sets can be analyzed for determination of the extent of coordinate expression and target gene enrichment using terms from GO, KEGG and HUGE Navigator databases (25–27). mESAdb is compatible and periodically updated in an automated way with data from related large external repositories. It also allows upload and analysis of user-specified data sets, and makes extensive use of existing and customized R packages (28). Overall, we believe that mESAdb, by specifically addressing the need for comparative and multivariate analysis of microRNA sequence and expression profiles, is likely to significantly enhance our understanding of the role of microRNAs in biological processes.
mESAdb enables access and retrieval of microRNAs with specified motifs to associate and analyze them functionally as well as based on expression profiles (Figure 1). An initial version of this work was presented in abstract form in BioSysBio 2007: Systems Biology, Bioinformatics, Synthetic Biology (29).
Data used in mESAdb are obtained periodically from multiple sources and processed for integration into the underlying MySQL database using a series of routines which download, parse and integrate these data from relevant sources (Ensembl, miRBase, microCosm, HUGE, KEGG and GO) either directly or through the Biomart integration service (Figures 1 and and2)2) (19,25–27,30).
Mature microRNA ID and sequences were downloaded from miRBase Release 15 (31); each microRNA was associated with a species-specific sequence and stored in a table. microRNA microarray experiment data sets for human, mouse and zebrafish, primarily focusing on expression from different tissues and developmental stages, were stored separately as default data sets (14,32–38) (Table 1; Supplementary Data). Tables containing the normalized expression values were associated with sequence data linked with the corresponding mirRBase names for these microRNAs (Figure 2). Where available, the probe sequences printed on microarrays that match exactly with the species-specific reverse complementary sequences in miRBase were included resulting in increased stringency; thus the number of microRNAs from each microarray study incorporated into mESAdb might be smaller than that reported in the original study. Expression data were logarithmically transformed where necessary, and quantile normalized (39). To link sequence and expression properties with functional information, the predicted human targets were retrieved from MicroCosm Targets (Figure 2) (19). These targets were further processed on the R environment (Version 2.11.1) (www.bioconductor.org); transcript IDs were matched with Ensembl Gene IDs (Ensembl Relese 59) using the package biomaRt (30). Only a single Ensembl ID was retrieved for each target gene with multiple transcript entries. Species-specific microRNAs were paired with target gene IDs associated with ontology terms and these matched pairs were stored in mESAdb’s underlying DBMS (Figure 2; MySQL). KEGG and Gene Ontology terms associated with microRNA targets were extracted and matched with the corresponding microRNA IDs (25,26). The disease terms associated with microRNA targets were obtained from the phenopedia view of HUGE Navigator, an integrated knowledge base of genetic associations and human genome epidemiology (27). These terms were parsed and matched with microRNA targets and stored in the MySQL tables underlying mESAdb. Target and associated terms are updated periodically (Figure 2).
mESAdb incorporates a tool for the upload of user-specified expression data sets provided as comma separated files (Figure 3). The user is free to add, view and remove expression data sets having expression data for arbitrary numbers of microRNAs against arbitrary number of expression classes (e.g. tissues, developmental stages, disease states). The format for the input file is straightforward: a comma delimited file with the first row giving the names of the classes, the subsequent lines of the file each beginning with the name of the microRNA (e.g. the miRBase ID) and, optionally, the probe sequence, followed by the measured expression for each of the classes given in the header line. The file uploaded is preprocessed line-by-line; for each, if the reverse complement of the probe sequence given for that line contains the mature sequence for the corresponding microRNA as given in the latest miRBase, the line is verified. For lines that cannot be thus verified, the system searches for a match in the miRBase sequences for the relevant species. If found, the microRNA is renamed to its miRBase standard name, if not, the line is discarded. Subsequently, the lines for duplicate microRNAs are averaged. The upload module generates a downlodable text file listing the actions performed while parsing and processing the .csv file. mESAdb uses nomenclature by miRBase for cross taxa comparisons performed in ‘expression–expression’ module where microRNAs with the same name from two different species are matched. Most microRNAs carrying the same name exhibit high sequence similarity across species while ~5% are relatively divergent. Data Upload utility also warns users for such cases. The module also provides utilities to log transform, center and scale or quantile normalize the expression data upon verification by miRbase (Figure 3).
A user-uploaded data set is tied to the specific user account that creates it and may be retrieved from another source or may be the product of the users’ own research. We protect privacy of proprietary data by keeping uploaded sets visible only to the account that owns it and no data is retained once a user removes a data set. An exemplary data set was provided in the current version of the mESAdb (40) (i.e. GSE2564NORMAL_seq.csv; mESAdb supporting material; http://konulab.fen.bilkent.edu.tr/mirna/supplementary_files.php). Accordingly, we downloaded GSE2564 expression series matrix from GEO (41). This data set includes normal tissues from stomach (n = 5), colon (n = 5), pancreas (n = 1), liver (n = 3), kidney (n = 3), bladder (n = 2), prostate (n = 8), uterus (n = 9), lung (n = 4), breast (n = 3) and brain (n = 2), together with cancer samples for different tissues. For the example used herein, only the expression data on the normal tissues were obtained; linked with microRNA IDs and probe sequences in the GPL1986 description file; and a comma separated file was formed for upload. An account of processing of the microRNAs in the .csv file was generated by mESAdb. The data set, called GSE2564_normal, could be uploaded using the ‘Manage Datasets’ facility of mESAdb (Figure 3) and compared with the existing data sets listed in Table 1.
mESAdb uses a hybrid of PHP and R as a computational environment. The basic operations and the web interface elements are coded in PHP whereas more significant statistical analyses are performed in R (Figure 2). The web interface has been made as responsive and user-friendly as possible with the addition of dynamic elements created with Javacript and the JQuery UI (http://jqueryui.com/) library. The communication between the PHP and R environments is performed using the common underlying MySQL database and Unix pipes. Briefly, a PHP script creates a child R process to which command line arguments are passed onto. The R process uses this information to retrieve the relevant information from the MySQL database and subsequently prepares the output (e.g. graphics; the bar plot, correspondence plots) which it passes onto the calling PHP script to display on the page. If the output is mostly textual (e.g. tabular data), it is passed on the output stream of the R program. If it is a larger result like an image, the R program saves it under a predetermined filename in a temporary location, which the PHP script retrieves from once the child R process is finished. This two-way communication between the PHP code and its R child processes has been implemented as a simple but effective API, which allows new R scripts to be easily integrated into the mESAdb tool as needed. This enables mESAdb to build on well-designed and verified analysis packages such as MADE4 (28) available for the R environment and use them to leverage its analysis tasks.
mESAdb has a motif selection tool with a pulldown menu in which users might select from different options to group retreived microRNAs with a given motif, i.e. dinucleotide motifs or motifs up to 6 nt long using the IUPAC code (42). It is also possible to upload user-specified microRNA lists. ‘Motif-expression’ module integrates the motif selection tool with default microarray data sets found in mESAdb as well as those uploaded by the user (Figures 1 and and3;3; Table 1). Accordingly, mESAdb provides a platform for visualization of microRNA expression in humans, mouse and zebrafish. Once a microRNA list is selected, expression of this set of microRNAs can be investigated using three different analysis options: ‘expression analysis’, ‘correspondence analysis’ and ‘co-intertia analysis’.
The ‘expression analysis’ option enables the user to compare, using bar plots, the amount of mean expression of the selected microRNAs with those of the remaining microRNAs across the studied expression classes, i.e. tissues or developmental stages. Expression data (Table 1) for the selected microRNAs and those for the unselected microRNAs are extracted from the quantile normalized log transformed expression tables that have been generated by the mESAdb Data Upload facility. The class (e.g. tissue) specific mean values for the selected and unselected microRNAs then are plotted separately for each column of the data set (e.g. each tissue) using a bar plot. Bars are color coded by the value of the ϕ-coefficient to assess the association of the selected microRNAs with the tissue in consideration (also called the Yule-ϕ; Supplementary Data) (43). A dynamic hover feature has been implemented for user to see exact information about each column by hovering the mouse pointer over it in the barplot. Expression data sets are accessible in the html format, and the χ2 and P-values for the ϕ-coefficient also are generated. Help boxes are made available for data plots and analysis tools.
mESAdb performs multivariate analysis of expression using the R package MADE4 customized for visualization and analysis in mESAdb (28). ‘Correspondence analysis’ of the selected set of microRNAs produce three graphical outputs, allowing for visualization of the expression patterns across classes (e.g. tissues), or microRNAs, or both the classes and microRNAs. ‘Co-inertia analysis’ (28) of the selected set of microRNAs helps visualize the similarities between microRNA expression and occurrence of common 6-mer MEME motifs (44) found among the microRNA sequences housed in mESAdb (45). Users can link from a motif to back to the ‘expression analysis’ module explained above to visualize the expression data as bar plots per expression class (e.g. tissue), of the group of microRNAs used in the coinertia plot containing the specified motif . MEME motif outputs we generated for the human, mouse and zebrafish microRNAs can be accessed from the supporting material (http://konulab.fen.bilkent.edu.tr/mirna/supplementary_files.php) found at the mESAdb.
This module provides a tool for meta-analysis of microRNA expression data sets. Selected sets of microRNAs can be investigated with regard to the data sets listed in Table 1 in a pair-wise fashion; other user-defined data sets can be uploaded and analyzed as well (Figure 1). ‘Expression–expression module’ outputs coinertia graphics for (i) classes (e.g. tissues) and (ii) microRNAs, and also a heatmap of both data sets using customized MADE4 (28) and heatplus (http://bioconductor.org/packages/2.6/bioc/vignettes/Heatplus/inst/doc/Heatplus.pdf) packages in R (www.bioconductor.org). The output has been customized for better visualization; and the degree of association, indicated by the RV coefficient (28,46) between two different microarray data sets also is provided. A high RV score suggests better correlation among data sets. For the microRNA oriented coinertia graph, several utilities are provided in order to facilitate the visualization of potentially high numbers of datapoints. It is possible to visualize the microRNA datapoints with or without labels on the coinertia graph. The coinertia tool also provides an automatic clustering of the microRNAs based on the similarity of their expressions in both data sets using k-means clustering (47); the default clustering displayed is the clustering with the maximum silhouette coefficient (48). Since k-means clustering is not deterministic, for each k-value the module performs 20 runs of the algorithm and the best clustering for each k is selected using highest silhouette. The clustering with the overall best silhouette is displayed by default. The user can manually set a cluster number between 2 and 10 clusters (i.e. 2 ≤ k ≤ 10) if desired. These clusters can further be investigated to visualize the expression profiles for the given data sets using expression bar plots of in-cluster and out-of-cluster microRNAs, by clicking on the cluster centroids.
This function may be useful for functional analysis of, for example, a set of differentially expressed microRNAs (Figures 1 and and2).2). In the present study, information from HUGE Navigator, in addition to GO and KEGG databases can be associated with the selected microRNAs (25–27). For any selected subset of microRNAs, mESAdb then can be used to retrieve the mappings of the selected functional terms, with the targets of these microRNAs and subsequently to calculate a probability value based on the hypergeometric distribution (49).
Functional and expression correlates of a single microRNA can be assessed using this module to enable a quick search involving multiple modules of mESAdb (Figure 1). Terms from GO, HUGE, KEGG and target genes associated with the given microRNA can be extracted; and the observed and expected counts as well as hypergeometric P-values can be downloaded. Expression profile of the selected microRNA also can be visualized using the aforementioned bar plots and downloaded as .txt files.
mESAdb is a highly interactive and flexible database with an ability to analyze and visualize selected expression profiles for a given subset of microRNAs in a multivariate manner using correspondence and co-inertia analyses. One can also study a single microRNA of interest using bar plots associated with a gene expression enrichment index, based on the ϕ-coefficient. This index provides a significance value for the relative enrichment of a microRNA(s) in a particular class with respect to others (Supplementary Data). Furthermore, the user can obtain information about the functional enrichment of a microRNA or a group of microRNAs using different databases, including GO, KEGG and HUGE Navigator.
The default expression data sets currently focus on tissue- and stage-specificity; however, users can add any microarray data containing other types of expression classes, e.g. cancer versus normal, treatment versus control (Figure 3). This allows for great flexibility in analyzing one’s own research data.
As an example, we demonstrate that the user can compare two data sets with respect to a list of microRNA clusters that are common to both mice and humans. Using the ‘expression–expression’ module of mESAdb, we have chosen a human (36) and a mouse (37) data set (Table 1) and uploaded a microRNA list (mESAdb supporting material; http://konulab.fen.bilkent.edu.tr/mirna/supplementary_files.php; the list included let-7a-i, mir-130a-b, mir-15a-b, mir-181a-b, mir-200a-b, mir-23a-b, mir-26a-b, mir-29a-c, mir-30a-d and mir-99a-b clusters). We then performed the coinertia analysis using only the tissues common to both data sets, namely, brain (B), liver (Li), lung (Lu), kidney (K) and heart (H) (Figure 4). mESAdb through coinertia analysis allows for comparison and visualization of two expression data sets by plotting them side by side in terms of the expression of selected microRNAs for the given tissues. Accordingly, we found that microRNAs in our list were expressed similarly in human and mouse data sets because the location of the projected tissues closely corresponded between the two plots (Figure 4). mESAdb also enables visualization of the expression of selected microRNAs from both data sets by simultaneously overlaying them on a two-dimensional plot. In this microRNA-oriented view, similarly expressed microRNAs are found closer in space. The analysis of our microRNA list indicated that several microRNAs formed clusters based on their expression, in particular, mir-181a and mir-181b, and mir-200a and mir-200b (Figure 5 and Supplementary Data). Indeed, mir-181a and mir-181b that are similar in sequence and diverging only with 3 nt exhibit a common sequence motif (i.e. AACATTCA) in their first 8 nt. Similarly, mir-200a and mir-200b are similar in their sequences containing a common motif (i.e. TAA[C][T]ACTG) in their first 8 nt. Using the ‘expression analysis’ module, miR-181a and miR-181b were found to be expressed primarily in the brain and lung (Figure 6a) whereas the miR-200a-b cluster was clearly expressed mostly in the kidney and lung both in mice and humans (Figure 6b). Our findings suggested that expression patterns of mir-181a-b and mir-200a-b were highly conserved between human and mice.
In conclusion, mESAdb focuses on providing a meta-analysis tool/database to enhance our understanding in an important field in microRNA biology, i.e. discovery of associations between microRNA sequence and expression. mESAdb is advantageous because it allows interactive analysis of selected subsets of microRNAs in addition to analysis of single microRNA types. Its modular and expandable nature makes mESAdb a unique and functional database for comparative analysis of microRNA sequence and expression.
mESAdb is freely available at http://konulab.fen.bilkent.edu.tr/mirna/. mESAdb is located on a Linux server (Apache/2.2.4; Ubuntu 8.04 LTS, Kernel: 2.6.24-24-server; PHP 5.2.3-1; R-2-11.1) equipped with four Intel® Xeon® CPU E5335, 2.00GHz processors and 8 GB RAM. Microarray data sets incorporated into the mESAdb, as well as R codes used in correspondence analysis are available for download at the mESAdb site.
Modular nature of mESAdb allows for incorporation of additional data sets and statistical tools. Future extensions to mESAdb will include addition of microarray data sets from GEO particularly focusing on different aspects of human pathogenesis. Use of R packages enhances the modular nature of the mESAdb thus future addition of statistical and visual tools for sequence/expression/function analysis of microRNAs is planned.
Supplementary Data are available at NAR Online.
The Scientific and Technological Research Council of Turkey (TUBITAK) and Bilkent University, Ankara. Funding for open access charge: Partially waived by Oxford University Press.
Conflict of interest statement. None declared.
We thank Alper Tolga Kocatas for help in optimizing MySQL queries for faster execution, Sergen Eren for proofreading microarray data set processing, Rengul Cetin-Atalay for providing rack space for the server and Michelle Adams for her helpful comments on the article.