There are two main steps in creating word-clouds: a) generating the keywords to display, and b) displaying the keywords. The keywords in Genes2WordCloud are generated in several ways depending on the source chosen. In each case the process can be divided into two main tasks: a) obtaining the text related to the user input (Figure ), and b) text-mining the text (Figure ). The text for generating word-clouds can be supplied for six different purposes (Figure ):
Fetching text for Genes2WordCloud. Text to display the word-clouds can originate from six sources. In some cases several steps are taken to convert the input selection to a body of text for further processing.
Text processing pipeline. The extracted text from the different options shown in Figure 1 is then processed by standard text mining algorithms. Several steps are taken to process the text for word-cloud display.
a) Obtaining information about a single gene or a set of genes.
The text for a single gene, or a list of genes, is extracted from several alternative sources: GeneRIF, Gene Ontology [5
], PubMed abstracts, PubMed MeSH terms or mammalian phenotype annotations from the Mouse Genome Informatics-Mouse Phenotype browser (MGI-MP) [6
]. Each of these sources provides text that describes properties of genes. Given a gene ID/s, the software extracts text about the gene/s from these sources.
b-c) Generating a word-cloud from a body of free text or from a give URL.
Free text or text extracted from a URL can also be used to generate word-clouds.
d-e) Generating a word-cloud from articles published by a specific author based on an author's name or from any PubMed search.
Based on an author's name, a word-cloud is created from PubMed abstracts returned for the author, or from any other PubMed search query terms.
f) A word-cloud created from the most popular articles published in the journal BMC Bioinformatics.
All BMC and PLoS journals, including the journal BMC Bioinformatics, provide an updated list of the most viewed articles from a specific journal. Genes2WordCloud provides an option to generate word-clouds from a collection of the most popular abstracts of the journal BMC Bioinformatics.
The different options to obtain text for generating word-clouds are limited to a maximum of 150 abstracts or 500 annotations picked randomly when the queries return more than these limits. Once bodies of text have been extracted from these alternative sources, the text is processed in several steps (Figure ).
The Porter stemming algorithm is used to reduce words such as "stem", "stems", "stemming" to a single root, e.g., "stem". The identified root is not always a real English word. Therefore, to obtain readable word-clouds, after the stemming of all the words, each stemmed-word is replaced by the shortest word of its family. In addition, some words are completely removed from the text. First, all common English words such as: "the", "is", or "are", are removed. Then common biological terms such as: "experiments", "abstracts", "contributes" are removed. These terms were chosen by hand curation after experimenting with many word-clouds, and users can continually refine this selection by suggesting words to be removed. Text-mining of GeneRIF, Gene Ontology annotations and MGI-MP annotations were also processed to remove common terms. Finally, other terms such as the input gene names, the names of authors, or the keywords from PubMed searches, are removed to avoid self-referencing. Next, words are counted: their normalized occurrence provides their weight used by the WordCram Applet to determine their size, position and angle in the outputted word-cloud. In principle, WordCram starts drawing words in the center of the display while gradually filling the space with other words to maximize compactness. The default angles are horizontal and vertical starting at the center but options for wave, swirl, starting from the left, and few other alternatives are available for locating words. In addition, heaped, mostly horizontal, and random angles are choices available for alternative word orientations. Once the text have been extracted and processed, it is displayed as a word-cloud. Genes2WordCloud uses a word-cloud viewer that is based on the open source Java package WordCram. Genes2WordCloud is implemented using Java, Processing, AJAX, mySQL, and PHP.