In the following, we outline the use of FCA to reason about similarity among a set of diseases. We start by addressing what it means for two diseases to be similar by having shared molecular mechanism, and discuss how we can approach this with FCA. Then we consider the renal disease data set from Bhavnani et al., first looking at the concept lattice to assess similarity, and, second, focusing on a sublattice indicated by the structure of the lattice to consider suitability for further analysis.
Network Dependence:
As discussed above, the definition of similarity that has been used
2,
3,
4,
9 is that either the set of genes overlap or there is some structural connection in a network (e.g., a shared edge in a PPI network, or co-occurrence in a cell-signaling feedback loop). Here we give a sketch of why this is a reasonable definition assuming that we can define a global graph showing how gene products interact in cellular systems. We let Γ = (
G, E) be this (simple) graph where
G is the set of all genes, and an edge indicates that gene products interact, or are closely involved in a biochemical event. In this setting, two diseases will share mechanism if the involved genes, sets
A1,
A2
G, determine subgraphs Γ(
A1)
, Γ(
A2) that are non-independent. (We assume these subgraphs are connected.)
Our problem is analogous to deciding whether two vector spaces
V1,
V2 are independent, which is precisely when dim(
V1 +
V2) = dim
V1 + dim
V2 where the dimension of a vector space
V is the number of vectors in a basis of
V. The analogue of a vector space for graphs is a
graphic matroid
26 of a graph, where a basis is a spanning tree of the graph. And, the analogue of vector space dimension is the matroid rank
ρ, which is the number of edges in a spanning tree of the graph. For the (connected) subgraph Γ(
A) determined by the gene set
A
G, the rank is
|A| − 1.
So, our problem is actually determining whether
ρ (Γ(
A1
A2)) =
ρ (Γ(
A1)) +
ρ (γ(
A2)). Since the righthand side is
|A1| +
|A2| − 2, this happens only when Γ(
A1
A2) is spanned by a forest of two disconnected spanning trees (). Otherwise, if there was a single spanning tree, the matroid rank would be either one larger (), or at least one smaller (). Therefore, non-independence occurs when the subgraph Γ(
A1
A2) has a connected spanning tree. This can occur because the gene sets are not disjoint
A1 ∩
A2 ≠
![[empty]](/corehtml/pmc/pmcents/empty.gif)
, and/or there is at least one edge (
g1,
g2)
E where
g1
A1 and
g2
A2. This is precisely the condition used in the earlier papers.
For us, this means that we cannot use formal concepts directly on the gene-disease associations and be sure that we have a complete picture of similarity, because we only deal with intersections of the sets of associated genes. We can handle this by extending each gene set
Ai,
i = 1, 2 by the genes
![[mathematical script N]](/corehtml/pmc/pmcents/x1D4A9.gif)
(
Ai) ∩
Aj,
i ≠
j = 1, 2, corresponding to the overlap of the other gene set with neighbors of the gene products in some network representing molecular interactions. So, in defining our context for FCA, we can extend the annotated genes for each disease in this way. In our analysis of the renal disease example, we extend the annotated genes by neighbors in the MiMI PPI database
25.
Disease Dependence:
Having reduced the problem of deciding similarity to inspecting intersections of gene sets associated with diseases, we can stay completely within the concept lattice to find relationships among them. In particular, we want to find families of diseases that are maximal in the sense that if we add another disease, the set of shared genes is relatively smaller. As an example, suppose we have three diseases m1, m2, m3 where m1 and m2 share a large proportion of their associated genes, m3 shares relatively few with each of m1 and m2, and nearly none or none with both (). In this scenario, the gene set cardinalities drop significantly when subconcepts involving m1 and m2 are formed by adding m3. In this sense, m3 delineates the sub-lattice of super-concepts of the concept with intent {m1, m2}.
In general, if we want to identify these maximal families, we need to find the concepts that have subconcepts with dramatically smaller extent. This can be done by traversing the concept lattice from the coatoms (concepts covered by top), and visiting subconcepts with maximal extent looking for significant drops in the size of the extent. The following heuristic uses the ratio of extent size from subconcept to superconcept to identify these transitions, testing against a threshold θ.

let C ← ![[empty]](/corehtml/pmc/pmcents/empty.gif)

let P ← coatoms
while
P ≠
︀ do


select p
P


let c ← arg max(A,B)
p
|A|


if
extent(c) / extent(p) ≤ θ
then




let C ← C
{p}


end if


let P ← P
{c}
end while
When complete, the set C contains the concepts representing the strongest families in the lattice. By visiting only the largest subconcepts, the heuristic generally avoids enumerating the full lattice. We can further bound the time required by adding a condition on the minimum extent to step 9.
To quantify the similarity among the discovered families of diseases, we use the Jaccard coefficient defined as |∩
A![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
A| / |
A![[set membership]](/corehtml/pmc/pmcents/x2208.gif)
A| for each family

. We can also substitute the union of superconcept extents into the denominator as an alternative measure of the relationship strength.
Similarity of Renal Diseases:
We now consider the renal disease data set from Bhavnani et al., starting with the context of all 747 genes extended by PPI neighbors as objects, the seven diseases as attributes, and the incidence relation determined by whether the gene is significantly up- or down-regulated in the disease. Applying the heuristic (with θ = 1/2) to this lattice finds ten concepts (listed in and highlighted in ) representing the most strongly related disease families primarily involving DN, FSGS, IgAN, MGN and SLE. The apparent relationships revealed by the lattice correspond to what we would expect based on the fact that these diseases share essential clinical and pathophysiological features (degree of tubulointerstitial damage secondary to glomerular filtration barrier failure driven proteinuria). However, as noted by Bhavnani et al., both MCD and TMD have small subject counts, and as a result have few significant regulatory associations. So, we cannot be sure that this is not the cause of their being relatively independent in the lattice.
| Table 2Renal disease families identified by heuristic with θ = 1/2. |
Focusing on a Sublattice:
The second largest concept, FSGS-IgAN-MGN-SLE, is interesting because the extent is 85% of the union of the extents of its super-concepts, meaning the associated genes are relatively well preserved in the intersection. This concept has 133 genes in its extent, while the largest extent of its subconcepts (the one with DN) has only 35 genes. Note that the sublattice above this concept represents the same data set that Bhavnani et al. focus on in their final analysis, as they drop DN, MCD and TMD for gene set size issues. In our case, the extended gene set for DN has 140 genes, and so could have a stronger overlap with these four diseases than it does. The role that DN plays in delineating this sublattice may be worth evaluating, but we will primarily study the role of IgAN.
There are a couple of things to observe about how IgAN fits within the selected sublattice. The first observation is that the concepts involving IgAN in the selected sublattice all have roughly the same number of genes. This suggests that the genes initially identified with IgAN are also common to the three other diseases, since the size of the extents only change slightly for concepts intersecting IgAN with these diseases. And, the second observation, is that MGN, FSGS and SLE have a large sets of genes in common, but this set is not common with IgAN. In this way, IgAN helps identify the concept FSGSMGN-SLE by the heuristic criteria — the concepts that include IgAN at the same level and the immediate subconcept all have more than 100 genes fewer. These observations suggest two questions: (1) what characterizes the commonality between IgAN and the other three diseases, and (2) what characterizes the genes common among FSGS, MGN and SLE, but not associated with IgAN? For these questions, we use more traditional set enrichments to help understand the constructed gene sets.
Interpreting Gene Sets:
Performing enrichment analysis using Genomatix GePS (
www.genomatix.de) on the 133 genes common among FSGS, IgAN, MGN and SLE, we see that genes are enriched for the terms extracellular matrix, regulation of biological process, glomerulonephritis and mesangial and epithelial cells, fibroblasts (–, left column). These annotations nicely summarize the key biological processes and cell lineages known to be activated in progressive kidney disease irrespective of underlying disease categories; in other words, features shared by all of these diseases.
| Table 3Enriched GO molecular function and cellular component categories for FSGS-IgAN-MGN-SLE and FSGS-MGN-SLE-specific gene sets. |
| Table 5Enriched disease and tissue categories for FSGS-IgAN-MGN-SLE and FSGS-MGN-SLE-specific gene sets. |
On the other hand, the gene set of 120 genes specific to the FSGS-MGN-SLE concept (formed by subtracting out the gene set for the FSGS-IgANMGN-SLE concept) show a significant enrichment for MHC I and II molecules, along with terms inflammation, immune response, antigen presentation and processing (–, right column). This indicates a significant presence of infiltrating cells in the renal tissue (presumably from the macrophage lineage). This is a well known concept for SLE, but had not been described for MGN and FSGS in the past. Interestingly, IgAN appears to be, according to the lattice structure, not as prominently affected by these interstitial inflammatory process. This is an interesting finding, as both SLE and IgAN are diseases characterized by a primary glomerular immune process, but SLE can show aggressive interstitial infiltrates, which would explain the specific enrichment profiles observed above. Ongoing studies by our group are currently characterizing the specific expression profiles generated by intrarenal macrophages in human and mice in SLE.
Adding Regulation:
One of the nice aspects of FCA is that it allows us to add attributes to our analysis within the same framework. Since we have the direction of regulation for the renal disease associated genes, we can find the concept lattice using attributes indicating this direction: MGN
up, and MGN
down. We already know from Bhavnani
et al.5 that the regulatory direction partitions the genes: if a gene is up in one disease, it is up in all diseases. Though, in their case, it took some work to find this fact, it is immediate in the concept lattice, which is partitioned into disjoint lattices based on concepts where genes are up-regulated and genes that are down-regulated. It is easy to see that, if we combine this regulatory context with the context where incidence is either upor down-regulation, then the concept lattice would remain partitioned by regulation. However, since we have added PPI network neighbors to gene sets for the diseases, it is not necessarily the case that these added genes should have consistent regulation with their neighbors. So, concepts in the lattice for the combined context may have both up and down regulated genes, which would mean that PPI edges crossed regulatory classes. But, at least for the FSGS-IgAN-MGN and FSGS-IgAN-MGN-SLE concepts, the PPI edges preserve regulation: only connecting up-regulated genes to up-regulated genes.