Pseudocode for Step 1

The following pseudocode, which focuses on Step 1, corresponds to the main amcBPPS function, after which routines implementing Steps 2 and 3 are called. The output from Step 1 is used to create (in Step 2) a FD-table and a set of seed sequences for mcBPPS sampling (in Step 3). Note that this Step 1 pseudocode creates single category FD-tables, but it can be easily modified to create multiple category FD-tables.

**function** amcBPPS(*SeqAln*)//*creates a CD hierarchical alignment.*

**input:** a multiple alignment of protein sequences (*SeqAln*).

**output:** a hierarchy (tree) and corresponding contrast alignments (CHA).

//

*assign each sequence to its own disjoint set* (

*see Tarjan*[

45])

*.***for each** sequence

*s* *SeqAln***do***s*.labeled := false;

*s*.rank := ∞; Set(

*s*) := {

*s*};

**end for**dheap

*H*;//

*Priority queue; for the data structure and algorithm see*[

45]

*.***for each** sequence pair

<

*s*_{1},

*s*_{2}>

**do**:

**if** the sequences are from the same phylum

**then****if** sequences ≥ 95% identical then merge their disjoint sets

**end if****else if** sequences ≥ 40% identical

**then***key* := PercentIdentity(

*s*_{1}*, s*_{2});//

*Using the pair-wise sequence identity as the key*…

Insert(

*key*, <

*s*_{1},

*s*_{2} >,

*H*);//

*store cross-phyla sequence pairs on priority queue*.

**end if**end for

//*Obtain an array of simple contrast alignments (CA) for distinct subgroups*.

*r* := 0; *g* := 0;

**while**<

*s*_{1},

*s*_{2} > := deleteMax(

*H*)

≠

Ø

**do***r*++;

*s*_{1}.rank := min(

*r, s*_{1}.rank);

*s*_{2}.rank := min(

*r, s*_{2}.rank);

**if** Â¬

*s*_{1}.labeled

Â¬

*s*_{2}.labeled

Set(

*s*_{1})

≠

Set(

*s*_{2})

**then**Set(

*s*_{1}) := Set(

*s*_{2}) := Set(

*s*_{1}) ∩ Set(

*s*_{2})

*;*//

*merge their sets*.

**if** NumPhyla(Set(

*s*_{1})) ≥

*N*_{min} then//

*(by default, N*_{min}*=4).***for each***s* Set(

*s*_{1})

**do***s*.labeled := true;

**end for***g*++;

*Seed*[

*g*] := {};//

*Seed set for group g.***for each***p* {

*p*:

*p*=

*s*.phylum ^

*s* Set(

*s*_{1})

*}***do**:

*Seed*[

*g*] :=

*Seed*[

*g*] ∩ {

*s’*:

*s’*.rank

=

min(

*s’*.rank:

*s’* Set(

*s*_{1})

*s’*.phylum

=

*p*)};

*FD-tables*[

*g*] :=

; //

*column 2: subgroup g vs other proteins in class*.

//

*call mcBPPS sampler*[

42]

*to identify a contrast alignment for subgroup g*.

*CHA*[

*g*] := mcBPPS(

*FD-tables*[

*g*],

*Seed*[g],

*SeqAln*);

**end if****end if**end while

*mc* := CreateFullHierarchy (*FD-tables*,*CHA*, g);*//Step 2: create mcBPPS input.*

**return** mcBPPS(*mc. FD-table*, *mc*.*Seed*, *SeqAln*)*//Step 3: optimize CD hierarchy.*

end function

Pseudocode for Step 2. Step 2 (i.e., the CreateFullHierarchy() routine) is subdivided into three sub-steps. For Step 2a, the MergeSimilarSets() function finds cliques of similar sequence sets by applying the Bron-Kerbosch algorithm [

61] and then combines the sets within each clique:

**function** MergeSimilarSets(*SqSets*)

**input:** sequence sets (

*SqSets*) from Step 2.

**output: **a reduced, non-redundant collection of sets and associated patterns.

//

*obtain an undirected graph of similar sequence sets*.

Create a node for each input set

**for each** pair of sets

*I*,

*J* within

*SqSets***do****if** the smaller set intersects with

<

80% of the larger set

**then** continue;

Find pattern optimally discriminating sequences in sets I and J from other sequences;

//

*The optimum pattern is defined based on the mcBPPS statistical model*.

**if** the two patterns intersect by

<

33% or by

<

5 pattern positions

**then** continue;

LLR

_{i,j} := LLR with foreground = set I, background = , Â¬(set J ∩ set I) & set J pattern.

LLR

_{j,i} := LLR with foreground = set J, background =, Â¬(set J ∩ set I) & set I pattern.

**if** LLR

_{i,j} ≥ 80% of LLR

_{j,i} LLR

_{j,i} ≥ 80% of LLR

_{i,j}**then** AddEdge(

*I,J*)

**end if****end for**Find the cliques in the graph using the Bron-Kerbosch algorithm [

61].

**for each** clique

**do**Create a consensus set of those sequences present in ≥ 50% of the clique sets.

Compute pattern optimally discriminating consensus set from other sequences.

Replace the sets belonging to the clique with the consensus set and pattern.

**end for**end function

By determining whether the sets substantially overlap, are roughly equal in size, and have similar discriminating patterns, the first two ‘if’ statement within MergeSimilarSets() merely prune the search by skipping over sets that are unlikely to correspond to the same protein subgroup. (Note that, if missed, sets corresponding to the same subgroup are likely to be detected in subsequent steps). To determine whether two different yet overlapping sets correspond to the same functionally-divergent subgroup, the procedure computes the BPPS log-likelihood using the pattern from one set with the partition defined by the other set and vice versa. If the patterns are more or less interchangeable between sets then an edge is added between the nodes corresponding to these sets. Next the Bron-Kerbosch algorithm is used to identify set cliques, each of which is then merged into a single (consensus) set. MergeSimilarSets() is applied iteratively to the modified sets from the previous iteration until it fails to identify and combine any additional similar sets.

Step 2b combines subgroup sets into larger supersets using the following FindSuperSets() function:

**function** FindSuperSets(*SqSets*)//*Obtain supersets from overlapping existing sets.*

**input:** sequence sets (

*SqSets*) from Step 2a.

**for each** Set

**do** Assign it to a unique disjoint set

**end for***// (see Tarjan*[

45]

*).***for each** pair of sets

*I, J***do**//

*find candidate supersets***if** the intersection of the smaller set ≥ 66% of the larger set

**then****endif****end for****for each** Disjoint set ‘dset’ containing at least 2 subsets

**do**Superset := the union of the subsets;

Superpattern := the pattern optimally discriminating the Superset from Â¬ Superset;

**if** Any subsets in dset fail to contribute their

*‘fair share’* to the superset LLR

**then****else** Save the superset and superpattern

**endif****end for****return:** The saved supersets and superpatterns.

end function

FindSuperSets() first identifies collections of (possibly minimally) overlapping sequence sets as possible candidates for merging into supersets. Next, it combines into a superset those sets that contribute their ‘fair share’ to the optimum LLR for the proposed superset—where the ‘fair share’ is defined as contributing at least 80% of the estimated average contribution of each sequence to the LLR times the number of sequences in the subset. (Based on the statistical formulation [

40,

42], each sequence will contribute equally, on average, to the log-likelihood. For such calculations, however, the sequences are down-weighted for redundancy, as previously described [

40]).

Next the function CreateSuperSets() is called to create additional supersets from the current sets that fail to overlap or that overlap only moderately. As long as new supersets are created, this function is called repeatedly (this merges subsets into supersets that might otherwise have been overlooked).

**function** CreateSuperSets(*SqSets*)//Create supersets by combining (possibly distinct) sets.

**input**: sequence sets (

*SqSets*).

**output**: new supersets.

**for each** set I

**do**SuperSet := set I; SuperPattern := Ø;

**for each** set J that at least slightly overlaps with set I

**do****if** both SuperSet & set J contribute their

*‘fair share’* a significant LLR

**then****end for****if** set I

SuperSet

**then** save the current Superset

**endif****end for**end function

Step 2c uses the sets obtained in the previous steps to construct a tree hierarchy, from which a FD-table is then obtained—, along with corresponding seed alignments and initial partitions—as follows:

**function** CreateTree(*SqSets*)//*obtain an optimized tree*.

**input:** sequence sets (

*SqSets*) and corresponding patterns from steps 1–2 above.

**output:** a FD-table + corresponding starting subgroup sets, patterns, and seed alignments

wdiGrph := RtnDiGraph(

*SqSets*);//

*returns a weighted directed graph of set relationships*Tree := ShortestPathTree(wdiGrph);//

*as defined by Tarjan*[

45]

Tree := RefineTree(Tree);//

*eliminates insignificant nodes and overlap between sets.*FD-Table := TreeToFDtable(Tree);

sma := CreateSeedAlignments(Tree); //

*characteristic, cross-phyla seqs for each set*.

end function

where the RtnDiGraph() functions is defined as:

**function** RtnDiGraph (*SqSets*)

**input:** sequence sets (

*SqSets*) from Steps 2a and 2b.

**output:** a weighted directed acyclic graph representing the set relationships.

Create a weighted directed graph where each set is a node

**for each** pair of sets

*I, J***do**//

*find pairs of sets where set I set J*.

*//simple heuristics for speed*.

**if** setI is smaller than setJ

**then** continue

**else if** setI ∩ setJ

<

50% of setI

**then** continue

**else if** setJ

<

33% larger than setI ∩ setJ

**then** continue

**endif***//identify those pairs where set I is a (typically fuzzy) subset of set J*Compute the optimum pattern and LLR for set I versus set J – set I;

**if** LLR is

not significant then continue;//

*significance defined as for FindSuperSets ()***if** Set I fails to contribute its

*‘fair share’* to the superset LRR

**then** continue;

Add an arc pointing from node J to node I & weighted by –LLR;

**end for****for each** set that lacks a Superset

**do**Compute the optimum pattern and LLR for the set versus the complementary set;

Add an arc pointing from the root to the corresponding node & weighted by –LLR

**end for**end function

Note that the RtnDiGraph () function returns a directed acyclic graph (DAG), for which the ShortestPathTree() algorithm [

45] finds a minimum spanning tree emanating from the root node. Because the distances assigned to the arcs in the graph correspond to the negatives of the LLRs, this tree maximizes the total LLR as defined for the corresponding FD-table (see [

42]). Incidentally, in this sense, this approach is akin to using the data to infer the DAG and parameters corresponding to a Bayesian network [

62] and then determining the most likely paths through the DAG from a predefined root node. This approach avoids the computational expense of using MCMC sampling to optimally define both the FD-table and the corresponding pattern-partition pairs concurrently by using an heuristic approach that is substantially faster yet still based on statistical criteria.

The sequence sets corresponding to the tree returned by the ShortestPathTree() algorithm are still fuzzily defined and thus typically contain sequences that belong to one or more distinct protein subgroups and thus that are not proper subsets of their respective supersets. The following RefineTree() function eliminates inappropriate overlap between sets while also eliminating nodes from the tree that, as a result of the refinement process, are no longer statistically significant:

**function** RefineTree (*Tree*)//*return a refined tree representing subgroup relationships*.

**input:** a tree where each node corresponds to a sequence set

**output:** refined tree

**do****do**//

*eliminate insignificant nodes from the tree…***while** an arc has been removed;

**do** //

*eliminate overlap between the sequence sets…***while** some nodes were newly labeled as candidates;

Define the root node set as containing all sequences absent from the other node sets;

Merge each leaf node with only a few sequences into its parent node;

Merge nodes with a single child into their parent nodes; //

*this step is optional*Relocate nodes that, due to previous step, are no longer properly placed in the tree.

**while** the tree has been changed in any way;

end function

The tree returned by the RefineTree() function is output as a Newick-format character string (a formal language specification for trees), which is then parsed and translated into a FD-table within the CreateTree() routine. This routine also creates a seed alignment for each row in the FD-table using a few of the most characteristic sequences in each set. These, along with the corresponding patterns (one for each column), are then used as input to the mcBPPS procedure (Step 3).