The spot intensities of the two channels (Cy3 and Cy5) on each microarray were individually normalized using the Qspline method (6
) with a log-normal distribution as target (M
= ln 1000, S
= ln 1000). For the majority of the arrays, namely those where the relative location of the spots was provided by SMD, the channels were further normalized to correct for spatial biases using a Gaussian smoother with σ = 0.8 (6
). After adding a regularization background intensity of 100 to the normalized intensities, a log-ratio was calculated for each gene on each spotted array. This value was semi-empirically chosen to make the spread of log-ratios independent of the spot intensities.
Currently, arrays have been included from six eukaryotes, namely Arabidopsis thaliana
, Caenorhabditis elegans
, Drosophila melanogaster
, Homo sapiens
, Mus musculus
and Saccharomyces cerevisiae
, as well as six prokaryotes, namely Bacillus subtilis
, Campylobacter jejuni
, Escherichia coli
, Helicobacter pylori
, Salmonella typhi
and Vibrio cholerae
. For each, the log-ratios for all arrays were combined into a matrix, assigning a log-ratio of zero in the case of missing values. We used our gene synonyms resource for solving the problem that the same gene is not always referred to by the same name/identifier on all arrays (http://www.bork.embl.de/synonyms/
One problem when analyzing microarray data is that different arrays may be strongly correlated, e.g. replicate arrays, adjacent time points in time series or similar experiments performed by different laboratories. Singular value decomposition, a powerful method for dealing with such correlations, was used to obtain a new basis (the principal components) for which the covariance matrix of the data is diagonal. The principal components can be interpreted in a biological context by studying the loading factors and the classification of the arrays contributing most to each principal component. As the log-ratios of each array are already guaranteed to be centered at 0 due to the normalization procedure described above, the missing values have a minimal influence on the components. For the subsequent analysis, each gene was represented by its projection onto the principal component basis.
Raw log-odds scores for a functional association between any two genes were calculated with reference to KEGG maps (7
). Two genes are thus considered to be functionally related if their protein products co-occur in at least one KEGG map. Similarly, non-related genes were defined as genes with KEGG assignment but no shared assignment.
For each separate principal component, a two-dimensional Gaussian kernel density estimate was calculated for pairs of related genes [frelated(x1,x2)]. Similarly, a one-dimensional Gaussian kernel density estimate was calculated for all genes (fall(x)), allowing a log-odds score for the projection of a gene pair onto a principal component to be calculated as follows:
A total log-odds score for a given pair of genes is calculated as the sum of log-odds scores from the first N principal components. To include only components with a good signal-to-noise ratio, for each species N was determined by visually inspecting a logarithmic plot of the singular values as a function of component number.
Using raw log-odds scores causes certain genes, in particular those encoding cell cycle proteins, to have high log-odds scores to hundreds of other genes, many of which are not functionally related. We therefore down-weighted the log-odds score between two genes by the number of higher-scoring links for the most highly connected of the two genes. This way, we penalize links to the most highly connected genes, which improves the overall accuracy of the predicted associations considerably.
For each species, we benchmarked the final predicted associations against co-occurrence on KEGG maps. Figure shows the number and accuracy of our predictions for six model eukaryotes. At a specificity of 80%, more than 1000 true positive associations are predicted for 4 of the 6 eukaryotes and more than 10
000 true positives are predicted for S.cerevisiae
alone. Assuming that this performance is representative of the performance functionally uncharacterized proteins, ArrayProspector contains in excess of 200
000 correct functional associations.
Figure 2 Performance of the method. (A) shows for each species the number of predicted functional associations known to be correct versus specificity (also known as accuracy, i.e. the fraction of predictions being correct). Predictors are considered to be better (more ...)
Calibration curves for converting the down-weighted log-odds scores to probabilistic confidence scores were obtained by fitting sigmoid functions to plots of specificity versus score. This calibration strategy is consistent with the one used for genomic context evidence in the STRING server (2