Human and mouse transcripts data
We obtained known and predicted transcripts from Ensembl (human build NCBI35, mouse build NCBIM33). Ensembl transcripts were filtered for repetitive sequence regions with RepeatBeater (graciously provided by Dr. Coward). Similarity searches between ESTs and Ensembl transcripts were conducted for each organism (human or mouse) with megablast [34
]. ESTs that matched transcripts with less than 95% sequence identity or over less than 150 base pairs were rejected (timegablast parameters – error 0.05 -required-length 150 -assemble-hsps
Mining of dbEST
Our EST mining pipeline is based on TissueInfo, previously described in [14
]. Briefly, TissueInfo identifies ESTs that match a set of query sequences and provides curated data and analysis tools to query the set of matching ESTs for information about the tissues and organs from which the EST was sequenced. Curated data consists of a tissue ontology (formerly called TissueHierarchy). The TissueInfo ontology is provided in OWL format on the TissueInfo web site [35
]. TissueInfo also provides tissue annotations that map most ESTs in dbEST to concepts in the TissueInfo ontology (formerly called curated tissue information). This mapping makes it possible to effectively mine data from multiple EST sequencing projects. Together, the ontology and the EST annotations provided by TissueInfo support the rich semantic queries needed to implement the cancer biomarker approach described here.
Cancer TissueInfo extensions
To mine dbEST for transcripts differentially expressed in cancer, we extended TissueInfo as follows: (1) We extended EST annotations to note if the EST was sequenced from a tumor tissue. For instance, if an EST library description is "cervical tumor", we annotate this EST with "+cervix, /cancer", to reflect the fact that the tissue anatomical provenance was cervix, and that the tissue was tumorous. (2) We implemented ways to calculate hitTumor() and hitNonTumor() for each query sequence. The calculation hitTumor() returns the count of ESTs matching the query sequence whose tissue provenances have the /cancer attribute. Conversely, the calculation hitNonTumor() returns the count of ESTs that do not have the /cancer attribute. (3) We devised and implemented a P-value calculation method to allow ranking transcripts by the likelihood that they are differentially expressed in cancer tissues (see below).
Cancer P-value calculations
As shown in Figure , grouping transcripts by the tissue in which they are most abundantly expressed (mostExpressedIn() calculation, see [14
]) removes the variation of the ratio hitTumor()/hitNonTumor() observed across tissues. We define groups of transcripts G
such that each transcript t
in the group is mostly expressed in the same tissue (i.e., t.mostExpressedIn
() is constant in a group G
). For each transcript t
in a group G
, we define:
x = t.hitTumor()
y = t.hitNonTumor()
From these definitions, we can calculate the TissueInfo cancer P-value:
P - valuecancer = FisherTwoTail(x, y, N1, N2)
where FisherTwoTail is a Fisher Exact Test (two tailed), as described and implemented in [36
False discovery rate
The control experiment is designed to estimate the number of transcripts that are falsely marked as being differentially expressed between tumor and non-tumor tissue at different thresholds of the P-value. Each query transcript matches a number of ESTs that can come from a number of different libraries. In the control experiment, all non-tumor libraries are randomly split into two groups based on their raw tissue information. For instance, the raw tissue labels "whole eye", "eye", "eye anterior segment", "Pterygium" or "Trabecular meshwork " are all equivalent to +eye in the TissueInfo ontology; each raw tissue designation is randomly assigned to one group or the other at each step of the simulation. (Because there are a number of raw tissue designation for a given tissue type, and because groupings are redone at each simulation step, the grouping procedure is not expected to result in more homogeneous groups in the control than in the real experiment.) After this grouping, the number of ESTs matching a transcript in each group is counted. The P-value of the difference of the counts in the two library sets is then calculated as for the cancer P-value calculations above, grouping transcripts by the tissue in which they are most abundantly expressed. This control experiment was repeated 10,000 times. We counted the number of transcripts identified below a certain P-value for a genome and averaged this number over the number of simulations. This value is the estimated number of false positives (FP) observed in the control, and can be used to calculate the False Discovery Rate (FDR) as shown in Figures and . Figure indicates that when mouse and human data are combined, the approach presented in this manuscript generates less than 22% false positive predictions.
Gene list constructions
Gene lists were obtained from the supplementary material of published articles whenever possible. We collected 13 gene lists from six microarray studies [7
] and one meta analysis study [11
]. In addition to HM200, we evaluated smaller subsets of HM200 constructed by selecting the 10, 50, and 100 genes ranked highest by the TissueInfo cancer P-value. We called these lists HM10, HM50, HM100. We constructed a gene list to be used as negative control (NC01-2000) by selecting 420 genes with a TissueInfo cancer P-value between 0.9 and 1.0. Genes in NC01-2000 are not expected to be good predictors in cancer classification tasks. Finally, for each dataset, we considered the gene list that consists of all the probesets on that array (gene list: full). Accession code conversions (e.g., from Affymetrix to Genbank, or from SwissProt to Ensembl gene IDs were done using Ensmart/Biomart [38
]). The Rhodes2004 gene list was obtained by manually typing the HUGO IDs that appear in Figure of [11
]. The figure contains 67 gene names, but only 57 of these mapped to HUGO gene names and could be mapped to Ensembl gene IDs (some genes could not be identified unambiguously by the name listed in the Figure, for instance "p100" or "OK/SW-ol56"). Gene lists used for our evaluation can be found with the programs distributed as Supplementary Material to this manuscript.
The datasets used in our evaluation were obtained from the Gene Expression Omnibus [39
], the Broad Institute web site [40
], or the Tmm database at Columbia University[41
]. Datasets were parsed and processed with the TissueInfo software (distributed under the GPL and available from [35
]). A complete list of datasets is given in Supplementary Table 2 [see Additional file 3
Microarray classification tasks
Supplementary Table 3 [see Additional file 4
] lists the conditions and number of samples for each classification task for which a machine learning classifier was trained (see below).
A Support Vector Machine (SVM) is a modern machine learning algorithm [43
] that has been found to outperform other machine learning approaches (e.g., artificial neural networks) in a variety of application domains and applications [41
]. Array signals were mean normalized across each condition of the array. After normalization, and following [33
], low signals were set to a constant signal, Specifically, for one channel arrays, if the signal was lower than f1
, the signal was set to f1
(we used f1
= 300 consistently throughout the evaluation). For two color arrays, if abs((signal
-1)+1) <= f2
, the signal was set to 1.0 (we used f2
= 1.15 consistently throughout the evaluation). This signal transformation sets low signals to a constant value. When used with feature scaling (see below), this significantly reduced the impact of noisy features on the outcome of the predictions. The resulting signals were used as features of the SVM. Features were scaled to the range [-1,+1] to reduce the risk that features with large variance dominate the decision function of the SVM. Learning was performed with a linear kernel. Since we trained with small training sets (n < 50), we did not optimize the learning parameter C of the SVM. This parameter was set consistently for each dataset and classification to the value n
, where n
is the number of training examples and
is the feature vector. We measured the performance of the SVM with the leave-one-out protocol. This protocol provides a nearly unbiased estimate of the classification error and is tractable for small training sets [45
]. We performed the machine learning evaluation with two different support vector machine implementations (Thorsten Joachims' SVMlight program [46
] and libSVM). Results were qualitatively similar across implementations (data shown for the SVMlight evaluation).
Similar to [32
], we estimate the significance of the classifiers trained from each dataset/classification task. This test evaluates the likelihood that the machine learning performance observed on the training set is the result of correlations in the features unrelated to the class that the classifier is trained to predict. For microarray data, features encode information about the expression levels of genes, which may contain an experimental noise/error component. Other factors that can affect classification tasks are the size of the training set and the number of features. In this test, many artificial training sets are constructed by shuffling the labels of the reference training set. This will not affect the size of the training set or the number of features, but will control for random noise. Each artificial training set is used to train and evaluate a classifier (see Machine Learning above). The test counts how many classifiers built with artificial training sets achieve the same performance as, or better than, the reference training set. Dividing this count by the number of artificial training sets tested yields a P-value that indicates how dependent the performance measured for the reference training set is on experimental noise or feature distributions. Classifier P-values reported in this manuscript were obtained after constructing 1,000 artificial training sets for each dataset/classification pair.
Ranking gene lists
SVMs are trained for a classification task with different gene lists. This results in multiple accuracy measurements for each classification task (there is one accuracy for each gene list on a given classification task). To facilitate comparison of the predictive ability of gene lists, we rank gene lists by decreasing leave-one-out accuracy on each classification task. The gene list that obtains the best leave-one out accuracy is assigned a rank of 1 (if several gene lists reach the same accuracy, they are given equal rank). The gene list with the second best accuracy is assigned a rank of 2, and so on. Supplementary Table 4 [see Additional file 5
] lists the raw performance measures and the ranks for each combination of gene list and classification task in our evaluation.
Comparing gene lists
In Table , we use a Fisher Exact Test to determine which gene lists are more similar to HM200. The result of this test must be considered a pseudo P-value, because we do not control for correlations that may exist across the gene lists and would create artificially low P-values. A number of such correlations can be excluded because we always compare gene lists obtained from microarray and expressed sequence tags (excluding the comparison between HM200 and NC01-2000). For instance, this protocol excludes correlations due to the ability of probes to hybridize to the array (since this effect does not apply to HM200). However, we cannot totally exclude the possibility that the overlap observed in Table is due to the level of expression that is necessary to detect gene expression with either microarray data or with our EST mining approach (this effect would reduce the total pool of genes that should be considered in the Fisher Exact Test). To acknowledge this effect, we report a pseudo P-value and use it as a gene list similarity measure. The pseudo P-value is useful because it takes into account the effect of different gene list sizes, while the number of genes that overlap do not.
Description of ingenuity pathways analysis
Data in Supplementary Table 6 [see Additional file 7
] were generated through the use of Ingenuity Pathways Analysis, a web-delivered application that enables biologists to discover, visualize and explore therapeutically relevant networks significant to their experimental results, such as gene expression array data sets. For a detailed description of Ingenuity Pathways Analysis, see [47
Swiss-Prot accession codes for proteins coded by genes in the HM200 gene set were uploaded to Ingenuity Pathways Analysis. Each gene accession code was mapped to its corresponding gene object in the Ingenuity Pathways Knowledge Base. These genes, called Focus Genes, were then used as the starting point for generating biological networks. To start building networks, the application queries the Ingenuity Pathways Knowledge Base for interactions between Focus Genes and all other gene objects stored in the knowledge base, and generates a set of networks with a network size of 20 genes/proteins. Ingenuity Pathways Analysis then computes a score for each network according to the fit of the user's set of significant genes. The score is derived from a P-value and indicates the likelihood of the Focus Genes in a network being found together due to random chance. A score of 2 indicates that there is a 1 in 100 chance that the Focus Genes are together in a network due to random chance. Therefore, scores of 2 or higher have at least a 99% confidence of not being generated by random chance alone. Biological functions are then calculated and assigned to each network.
Biological functions were assigned to each gene network by using the findings that have been extracted from the scientific literature and stored in the Ingenuity Pathways Knowledge Base. The biological functions assigned to each network are ranked according to the significance of that biological function to the network. A Fisher Exact Test is used to calculate a P-value determining the probability that the biological function assigned to that network is explained by chance alone.
GO category analysis
We searched GO functional categories for categories statistically enriched in HM200. We used the EASE software [22
] with 189 SwissProt/UniProt identifiers (from HM200) and compared to 17,915 human proteins in UniProt. The P-value reported is the EASE Score [22