GO data can be downloaded from the Gene Ontology website [1
]. The data contains two sets of information that are used, the parent-child relationships for each node and the definitions of each node or term. The data collection and analysis techniques are described in this section.
The downloaded GO data is used to populate one table with GO IDs and the ID definitions, and another table with a description of relationships between the GO IDs, which can use terms such as is_a or part_of to define the relationships.
In order to associate GO terms with gene IDs (accession), the files gene2go
were retrieved from Entrez Gene
] for the human and mouse genomes. A similar dataset for D. Melanogaster is acquired from Flybase
]. Each gene can have multiple GO annotations, so this is a many-to-many association table.
A table, whose columns are shown in Table , is used to maintain node information, and to carry out statistical analysis. At each step of the methods listed below, one of the columns gets filled up. The columns in table are filled in the following order,
Columns in the GO statistics database table.
• Level (depth in a tree): A recursive depth-first search in a bottom-up fashion is carried out to determine the level of GO terms associated with the experiment, as explained in Figure .
Figure 5 Result of a GObar analysis of human genes with AT-AC-U12 type splice sites. The result of a GObar analysis is an SVG (scalable vector graphics) image, with a red path signifying branches that are disproportionately over-represented in the gene list, as (more ...)
• Number of trails up: This is obtained from the table of GO ID relationships by counting the number of parents for a node.
The following two items are calculated once in the beginning for all the genes in the genome and for each analysis of a gene-list.
• BC: Bare count (BC) is a number of genes associated with each GO term (node).
• DC: Starting from the lowest node(s) in the tree (determined by the Level column), the total count, BC + DC is propagated to the node's immediate parent. If a node has more than one parent then total count is divided by the number of trails up, which is the same as the number of parents.
Populating the tree with the reference dataset
We have populated the GO tree with datasets from Entrez Gene
for human and mouse data [4
] for fly data and SGD
] for yeast data. In the case of Entrez Gene
, two sets of maps exist, a gene id
map and a gene
to gene id
map. At the end of this process each GO node gets a list of genes. The term bare counts
denotes the counts of genes at each node. The genes on children nodes also contribute to the counts on any given node, which are tracked separately and called distributed counts
. Thus, the distributed count of a node is the sum of contributions of the nodes below it in the gene ontology hierarchy. Each node contributes the sum of its bare count and its distributed count equally to the distributed counts of each of its parents. This process can be recursively applied, starting from the lowest levels (or greatest depths) of the tree and working the way up the tree.
If the accounting of distributed counts is to be done properly, defining the depth of each node in the tree is important. The rule for assigning depth to each node is that, if a node gets multiple levels, then the highest depth is always assigned to it. This can be done by picking the leaves of the tree (nodes with no children) and travelling recursively all the way up to the root (node with no parents). For each path to the top, depth is assigned to each node based on the number of steps to the node from the root. If a node already has a depth assigned to it, then the depth is replaced with the current depth only if it is bigger. This is explained in Figure . Once the leaves have been exhausted, all the nodes in the tree will have depths assigned to them.
In order to calculate the distributed counts for each node, the list of nodes is ordered based on their depths. Starting from nodes with the highest depths the counts are propagated up, as described above, summing up the bare count and distributed count and partitioning the sum equally amongst all the parents. After exhausting the list of nodes, all the nodes should have a bare count and a distributed count assigned to them.
Populating the tree with the experimental dataset
In order to calculate probabilities for a given experimental dataset, we need to first populate the GO tree with the experimental dataset. A procedure identical to the one used in the previous section is implemented, resulting in a GO tree with just the dataset of interest on it.
Calculating the probabilities
Let BCi, DCi be the bare and distributed counts respectively at node i for the genomic dataset and let bci, dci be the bare and distributed counts respectively at node i for the experimental dataset. Then, for the Node 0 in Figure 8.
DC0 = (BC1 + DC1) + (BC2 + DC2) + (BC3 + DC3)/2 (3)
dc0 = (bc1 + dc1) + (bc2 + dc2) + (bc3 + dc3)/2 (4)
The following are defined for ease of notation:
N1 = (BC1 + DC1) (5)
N2 = (BC2 + DC2) (6)
N3 = (BC3 + DC3)/2 (7)
N0 = DC0 = N1 + N2 + N3 (8)
n1 = (bc1 + dc1) (9)
n2 = (bc2 + dc2) (10)
n3 = (bc3 + dc3)/2 (11)
n0 = dc0 = n1 + n2 + n3 (12)
Then, the probability that a dataset is a random selection from the genes in the genome is given by the Hypergeometric formula (explained above)
The expected value for n1 is given by
We define PD as the deviation of the counts on a node i from its expected number and is given by
We use P and PD to prune the trees, as described in the next section.
Pruning the tree
Listing all the nodes of the GO tree for a given dataset is not very informative, especially if only a few nodes are populated or if a large number of GO terms are populated by a small number of genes. This also defeats the purpose of helping users narrow down the GO terms of interest.
A node can only be pruned if every node under it also satisfies the pruning condition. The tree is pruned using the following rules to make the viewing manageable,
1. if n0 <nc, stop traversing the tree, that is, do not show anything below such a node. The population cutoff, nc can be set by the level of details option on the GObar webpage at step 4, shown in figure . This determines how low the population of genes in a node can go before it gets pruned. Less Detailed corresponds to a minimum of 6 genes, Detailed corresponds to a minimum of 3 genes and Very Detailed shows every node.
Figure 3 The front page of the GObar website. Selections are made for each step, and the list of genes is entered in the final step before launching the program. The pruning of the tree is controlled in step 4. A node can only be pruned if every node under it (more ...)
2. Prune nodes that have P > Pth. The threshold Pth is arbitrarily set at 0.1.
3. if ni deviates significantly up from <ni>, then the path is hightlighted using red color. PD can be set using the deviation stringency option in step 4 on the webpage, shown in figure . Less strict corresponds to a deviation cutoff value of 0.2, Strict corresponds to a cutoff of 0.5 and very strict corresponds to a cutoff value of 0.8.
The pruning is done starting with leaves (nodes with no children) on the tree, and stops when it reaches a node that should not be pruned according to the rules above. Fine variations of the pruning conditions are not allowed as these do not offer useful biological information and make the tool difficult to use.
Visualizing the tree and user-interaction
] is used to create the layout of the GO tree, and scalable vector graphics (SVG
The tool also allows the downloading of all the genes in the GO tree below any node. The downloaded list is in the form of a comma separated valued (csv) file, which contains the gene, the GO terms for each gene and a short description.
In the downloaded list, the uploaded genes are highlighted, since the list will also contain genes that belong to the nodes but are not in the uploaded list.