Preprocessing of gene expression matrix
Prior to the search by PDNS, our method first applies a preprocessing step to transform the input data matrix
M to a Behavior Matrix
M'. This preprocessing step aims to highlight the trajectory patterns of genes. Indeed, according to [
43-
45], in microarray data analysis, genes are considered to be in the same cluster if their trajectory patterns of expression levels are similar across a set of conditions. Within the transformed matrix
M', each row represents the trajectory pattern of a gene across all the combined conditions while each column represents the trajectory pattern of all the genes under a pair of particular conditions in the data matrix
M. The whole matrix
M' provides thus useful information for the identification of relevant biclusters and the definition of a meaningful neighborhood of a local search algorithm.
Formally, the behavior matrix M' is constructed progressively by merging a pair of columns (conditions) from the input data matrix M. Since M has n rows and m columns, there is m(m−1)/2 distinct combinations between columns, represented by J''. So, M' has n rows and m(m−1)/2 columns. M' is defined as follows:
with
i ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[1..
n],
l ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[1..
J''],
k ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[1..
m- 1],
q ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[2..
m] and
q ≥
k + 1.
Figure shows an illustrative example. We can observe, by considering each row of M', the trajectory (or behavior) pattern of each gene through all the combined conditions, i.e., up (1), down (-1) and no change (0), of all rows (genes) over combined columns (combined conditions). Similarly, the combinations of all the paired conditions give useful information since a bicluster may be composed of a subset of non contiguous conditions. Our PDNS algorithm uses M' to define its search space as well as its neighborhood that is critical for the search process.
Pattern-driven neighborhood search for biclustering - general procedure
Our proposed PDNS method can be considered as an Iterated Local Search procedure [
46]. It alternates between two basic components: a descent-based improvement procedure and a perturbation operator. PDNS uses the descent procedure to discover locally optimal solutions and the perturbation operator to displace the search to a new starting point in an unexplored search region.
The key originality of PDNS concerns the use of bicluster pattern both in its search space and neighborhood definition. The bicluster pattern is a characteristic representation of a bicluster. It is used to evaluate genes/conditions of bicluster. This representation is defined by the behavior matrix of the bicluster, i.e., the trajectory patterns of the genes under all combined conditions of the bicluster. This representation is important because it is well recognized that in microarray data, genes are considered to belong to the same cluster if they have similar trajectory patterns of expression levels [
43-
45,
47].
Starting from an initial bicluster (call it current solution s), PDNS uses the descent strategy to explore the pattern-based neighborhood and moves to an improving neighboring solution at each iteration. By using the bicluster pattern, we define a set of rules which allow us to qualify the goodness (or badness) of a gene and condition. Using these rules (explained in a later section "Neighborhood and its exploration"), PDNS iteratively replaces within the current bicluster bad genes/conditions by good ones, thus progressively improves the quality of the bicluster under consideration. This iterative improvement procedure stops when the last bicluster attains a fixed quality threshold according to the ASR evaluation function (see next section) or when a fixed number Y of iterations is reached. At this point, PDNS triggers a perturbation phase by replacing randomly 10% of genes and conditions of the recorded best bicluster found so far. This perturbed bicluster is used as a new starting point for the next round of the descent search.
The whole PDNS algorithm stops when the best bicluster is not updated for a fixed number Z of perturbations. The general PDNS procedure is described in Figure . We describe in the following sections the ingredients of PDNS.
The ASR evaluation function
Many functions exist for bicluster evaluation. One of the most popular evaluation functions is the
Mean Squared Residue (MSR) [
24]. It has been used by several biclustering algorithms [
33,
38,
42,
48-
51]. However, MSR is deficient to assess correctly the quality of certain types of biclusters like multiplicative models [
30,
33,
52,
53].
In this paper, we use the Average Spearman's Rho (ASR) function which avoids the drawback of MSR [
54]. Let
(I', J') be a bicluster in a data matrix
M(I, J), the ASR evaluation function is then defined by:
where
ρij (
i ≠ j) is the Spearman's rank correlation [
55] associated with the row indices
i and
j in the bicluster (
I', J'),
ρkl (
k ≠ l) is the Spearman's rank correlation associated with the column indices
k and
l in the bicluster (
I', J'). According to this definition, ASR(
I', J')
![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
[-1..1].
A high (resp. low) ASR value, close to 1 (resp. close to -1), indicates that the genes/conditions of the bicluster are strongly (resp. weakly) correlated.
Let us notice that the existing evaluation functions can roughly be classified into two families: numerical measures and qualitative measures. Numerical measures, like Pearson's correlation or Euclidean distance, are easy to compute but they are quite sensitive toward outliers and noise. Qualitative measures, like measures that consider only ups, downs and no change of conditions, are very sensitive to precise the values of changes. As ASR is based on Spearman's rank correlation it can be considered as a good compromise between numerical and qualitative measures.
Configuration representation
PDNS uses a solution representation based on the behavior matrix M' obtained from the preprocessing step described previously. More precisely, given a bicluster B = (I', J'), we encode the bicluster by its behaviour matrix s = (I',K) which is the sub-matrix of M' including only the set of genes in I' and all the combinations of paired conditions in J' (see example of Figure ). It is clear that s has the same rows as B, its number K of columns is equal to |J'|(|J'| - 1). In the rest of this paper, s is called a configuration (or solution). As it is shown below in Section "Neighborhood and its exploration", such a configuration representation enables the definition of dedicated move operators to improve progressively the quality of the generated biclusters.
Initial solution
Our algorithm needs an initial bicluster to start its search. The initial bicluster can be provided by any means. For instance, this can be done randomly with a risk of starting with an initial solution of bad quality. A more interesting strategy is to employ a fast greedy algorithm to obtain rapidly a bicluster of reasonable quality. We use this strategy in this work and adopt two well-known algorithms: one is presented by Cheng and Church [
24] and the other is called OPSM which is introduced in [
29]. As explained above, each initial bicluster is encoded into its behavior matrix before being improved by PDNS.
Neighborhood and its exploration
The neighborhood is one of the most critical elements of any local search algorithm. The neighborhood can be defined by a move operator. Given a solution
s, let
mv be the move operator that can be applied to
s. Then each application of
mv transforms
s into a new solution
s'. This is typically denoted by
s' ←
s
mv.
In this paper, we devise two specially designed move operators operating respectively on rows (genes) and columns (combinations of pairwise conditions) of a given solution. Both operators are based on the general drop/add operation which removes some elements and adds new elements in the given solution. The critical issue here is the criterion that is employed to determine the elements to be removed and added. In our case, this decision is based on the "behavior pattern".
Our first move operator, denoted by
mvg, performs changes by removing a number of rows (genes) of the bicluster and adding other genes in order to obtain more coherent biclusters. Let
s = (
I', K) be a solution, we first extract from the behavior matrix
M' the associated sub-matrix

. Let
R and
C denote respectively the index set of rows and columns of

. From

we build the bicluster pattern
P of
s which is defined by a vector indexed by
C.
P[
j],
j
C, takes the dominating value
k ![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
{1, 0, -1} such that
k has the highest appearances in the column
i of

(see example of Figure ).
Now for each gene
gi,
i
R of the solution
s, we define the quality of
gi as the percentage of concordances between the behavior pattern of
g and the behavior pattern
P of bicluster
s. Let α be a fixed quality threshold of genes. Let
D denote the set of bad genes of
s such that their quality does not reach the quality threshold fixed by α. Let
G denote the set of good genes missing from
s such that their quality surpasses the quality threshold α. Then our first move operator
mvg removes from
s all the bad genes of
D and adds a number of genes selected from
G.
Figure shows an example where one bad gene (g4) is deleted and one good gene (g10) is added. g4 is bad because its behavior pattern has a low concordance with the bicluster behavior pattern (only 50% which is inferior than the quality threshold α = 70%). Similarly, g10 is good because its quality (83%) is higher than α. This replacement increases thus the coherence of the resulting bicluster. In the general case, the number of deleted gene may differ from the number of added genes. Notice that this move operator does not change the columns of the solution.
Our second move operator, denoted by mvc, performs changes by removing a number of columns (combined conditions) and adding other columns in order to obtain more coherent biclusters. Similar to the first move operator, mvc uses a quality threshold β for each column. The quality of each column is defined as the percentage of concordances between the column pattern and the value of this column in the bicluster pattern.
Then, when our second move operator mvc detects a bad condition from the current bicluster, we test if the dominating value of each condition of the current bicluster has the same value with the corresponding value in the bicluster pattern. If it is different, this condition is considered bad (and removed from the current bicluster). To add a good condition from the current bicluster, we select a condition under the same subset of genes from the "behavior matrix" M' which has a dominating value higher than a fixed threshold β. Notice that this move operator does not change the rows of the solution (see example of Figure ). In the general case, the number of deleted columns may differ from the number of added columns at each application of this move operator.
For a given solution, our PDNS algorithm applies these two move operators to reach a local optimum s (with an ASR value higher than the fixed threshold_ASR threshold). This local optimum solution s is composed of a group of genes and columns, each column representing the trajectory pattern of two conditions across the group of genes. Among the combinations of conditions in s, some conditions may be combined with only a few other conditions. These conditions are in fact insignificant conditions for the extracted bicluster. For this reason, during the decoding process (transforming s into a bicluster B), we retain only conditions which are combined with at least 50% other selected conditions. For instance, if we have s = {(g1, g2, g3, g4); (c1c2, c1c3, c1c4, c2c3)}, condition c4 will not be kept in the final bicluster because it is not combined at least with 50% of the other conditions, i.e., c2 and c3. The bicluster obtained is thus B = {(g1, g2, g3, g4); (c1, c2, c3)}.