|Home | About | Journals | Submit | Contact Us | Français|
Although trauma is the leading cause of death for those below 45 years of age, there is a dearth of information about the temporal behavior of the underlying biological mechanisms in those who survive the initial trauma only to later suffer from syndromes such as multiple organ failure. Levels of serum cytokines potentially affect the clinical outcomes of trauma; understanding how cytokine levels modulate intra-cellular signaling pathways can yield insights into molecular mechanisms of disease progression and help to identify targeted therapies. However, developing such analyses is challenging since it necessitates the integration and interpretation of large amounts of heterogeneous, quantitative and qualitative data. Here we present the Pathway Semantics Algorithm (PSA), an algebraic process of node and edge analyses of evoked biological pathways over time for in silico discovery of biomedical hypotheses, using data from a prospective controlled clinical study of the role of cytokines in multiple organ failure (MOF) at a major US trauma center. A matrix algebra approach was used in both the PSA node and PSA edge analyses with different matrix configurations and computations based on the biomedical questions to be examined. In the edge analysis, a percentage measure of crosstalk called XTALK was also developed to assess cross-pathway interference.
In the node/molecular analysis of the first 24 hours from trauma, PSA uncovered 7 molecules evoked computationally that differentiated outcomes of MOF or non-MOF (NMOF), of which 3 molecules had not been previously associated with any shock / trauma syndrome. In the edge/molecular interaction analysis, PSA examined four categories of functional molecular interaction relationships – activation, expression, inhibition, and transcription – and found that the interaction patterns and crosstalk changed over time and outcome. The PSA edge analysis suggests that a diagnosis, prognosis or therapy based on molecular interaction mechanisms may be most effective within a certain time period and for a specific functional relationship.
In recent years, advances in technology have made it possible to measure a wide variety of molecules and molecular interactions in cell lines, bio-fluids and tissues. The increasing availability of these data has opened new avenues of biomedical research, and challenged the scientific community to uncover the meaning of molecular data in contexts ranging from cell signaling pathways to phenotype/genotype associations to personalized medicine . Plausible and meaningful molecular hypotheses that support clinical diagnosis, prognosis and therapies must be derived from a deluge of quantitative and qualitative experimental data that are spread over a variety of experimental paradigms such as clinical outcome, time, cell cycle phase, or molecular localization.
Current approaches to collecting data about molecular patterns in disease include the use of high throughput measurement techniques such as mass spectrometry and microarray immunoassays. Mass spectrometry is the most common technique for “unbiased” discovery where all protein and peptide components of tissues and biofluids are identified within the capability of the equipment. Microarray immunoassays are more sensitive and specific; they measure the concentrations of pre-determined analytes using immunological reactions. Both assay methods have benefits and drawbacks for clinical usage .
A wide variety of analytical approaches, both qualitative and quantitative, are being explored to understand these data . Text mining algorithms search published literature for information about molecular function and disease associations while graphical analysis uses algorithms from computer science to identify subgraph motifs in canonical pathway networks of molecular interactions found in diseases. Network-based graphical analysis using gene expression patterns has been shown to generate novel hypotheses about the classification of breast cancer metastasis, including the finding that some gene associations can only be detected using network rather than conventional analysis . Statistical biomedical informatics methods, such as gene set enrichment analysis (GSEA), identify gene sets, based on gene expression data, that are correlated with phenotypic classes, and generate hypotheses for further exploration [22, 23]. Systems biology tools model in silico biological pathway systems using computational methods that parallel in vitro cell-line and in vivo animal models for hypothesis discovery and instantiation .
Although these approaches are useful, there are limitations for the study of disease progression over time. For example, the most significant molecular interactions associated with the disease may appear in a non-canonical pathway  that text mining and in silico modeling may overlook. Time-based models of biological pathways can be explored using ordinary differential equations (ODEs); however, they usually model a small group of canonical pathways within a single cell and are not easily computable at the organism level. For example, an ODE model of one NF-kappa B signaling pathway in one cell activated by one TNF-α signaling molecule uses 18 nonlinear differential equations, with 33 independent variables and 16 dependent variables in a simplified reaction kinetics model .
Studies of scientific discovery have demonstrated that most new findings arise from data-driven hypotheses generated from unexpected observations rather than from verification of pre-determined hypotheses based on theories . In a bedside-to-bench approach, discovery is driven by patient data collected at the bedside. Mechanisms or therapies are confirmed later at the lab bench. Data-driven, evidence-based molecular patterns are a fundamental component of personalized medicine research. Notable diagnostic successes based on the molecular patterns found in patient data include the validation of 14-3-3 proteins found in cerebrospinal fluid (CSF) as diagnostic of transmissible spongiform encephalopathies  and the validation of a panel of 18 urinary molecules that discriminate antibody-associated vasculitis from other renal diseases .
Here we present the Pathway Semantics Algorithm (PSA) that converts the directed graphs of the most likely biological pathways evoked from patients’ molecular data into transformed matrices of various formats for algebraic analysis, with the goal of generating hypotheses addressing specific biomedical questions about the meaning, or “semantics”, of the pathways. The term hypothesis is used in its broadest sense as a potential explanation or conclusion that is to be tested by collecting and presenting evidence . Generating hypotheses computationally based on scientific and plausible reasoning extends the domain of search beyond that which was originally observed or “known”, while reducing the size of the solution space. In the sample disease progression analysis given in Section 4, the pathway generation algorithm gave a potential solution space of more than 1,000 molecule/time points. Using PSA algebraic node analysis, the solution space was reduced to 7 molecules that differentiated outcomes at different times. The pathway graphs contain two major types of entities: nodes that correspond to specific bio-molecules and edges that correspond to the interactions between the molecules. PSA constructs the matrices to represent biomedical queries for comparative analyses of the pathways over stratifications such as time and/or outcome. Matrix construction is specific to the query because scientific discovery is strongly influenced by data representation . The transformation of graphs to matrices enables the application of powerful computationally tractable techniques that scale well from matrix algebra to develop mathematical comparison methods, analyses, and metrics leading to useful insights into disease progression across time and clinical outcomes.
PSA was applied to patient data from a shock / trauma study of multiple organ failure, first analyzing the nodes of the likely biological pathways and then examining the edges. A matrix format called a Temporal Dependency Matrix(TDM) was instrumental in revealing novel patterns of molecules evoked from patient data over time in shock/trauma, where disease progression is rapid yet not clinically visible. The computational results predicted seven molecules, based on input from the original assays, associated with the biological mechanisms underlying multiple organ failure; only three had been recognized as associated with any shock / trauma syndrome. Next PSA examined the edges of the pathway graphs, corresponding to interactions between molecules including genes, RNAs, proteins, or chemicals. We applied matrix methods to investigate patterns of molecular interactions across time and across clinical outcomes in terms of four functional relationship categories: activation, expression, transcription and inhibition. Applying graph theory and linear algebra, we found that the interaction patterns of relationship sub-graphs changed rapidly within the first 24 hours of insult, and that these patterns differed across clinical outcomes of multiple organ failure (MOF) and non-multiple organ failure (non-MOF). In addition, we developed a numerical metric of crosstalk in molecular pathways called XTALK. In contrast to current practice that merely classifies a network in strictly binary fashion as having crosstalk or not, XTALK quantifies crosstalk among molecular interactions from 0% to 100%, thereby leading to a deeper, fine-grained understanding of crosstalk and its variation due to disease progression. Results obtained suggest that a diagnosis, prognosis or therapy based on molecular interaction mechanisms may be most effective within a certain time period and for a certain functional interaction relationship.
In the following sections, we first present background information and definitions relating to molecular interactions and mathematical notation, followed by a description of the Pathway Semantics Algorithm, its application to analysis of patterns of molecules and molecular interactions in the first 24 hours of trauma progression, the results and a discussion of their meaning, concluding with our plans for future work.
At a sub-cellular level, molecular interactions can be analyzed using the rules of biochemistry when they are represented as sets of differential equations. However, due to computational complexity and lack of interaction parameter rate data, this approach is not suitable for larger comparative analyses. Instead, molecular interactions, such as protein-protein or gene-protein interactions, are commonly combined into biological pathway networks represented as graphs, where the node, or vertex, is the molecule and the edge is the interaction. This representation facilitates the use of qualitative and quantitative methods derived from graph theory and algebra because the same biological pathway network graph can be mapped to a matrix in different ways, allowing for a choice of mathematical methods appropriate to the biomedical question under study.
Biological pathway networks can be generated manually through direct observation of patient data, such as performed in morphoproteomic tissue analysis , and computationally through software, such as Ingenuity Pathway Analysis (IPA) [www.ingenuity.com] that uses a proprietary algorithm to evoke likely pathways generated from the measurements of molecular data in bio-fluids and tissues.
Comparative analyses of nodes in pathways can reveal key molecules that likely play significant roles in disease progression over all time, or only at certain times. Comparative analyses of edges, or links between the nodes, in pathways parallels research into “link communities” in social networks, where one person may be connected to several overlapping communities of home, work, and interests [34, 35]. In both social and biological networks, the edges are directional, showing the influence from one node (a person or molecule) upon another in a multi-directional cascade. Biological link communities also overlap; a molecule may participate in several different interaction categories simultaneously with the same target molecule, or inversely, several interactions may occur simultaneously with different molecules to achieve the same target function. This latter property has been defined as degeneracy – the ability of structurally different elements to perform the same function or yield the same output; in contrast, redundancy requires identical elements to perform the same function. [36, 37]. Degeneracy is a key property underlying the robustness of complex adaptive biological systems, such as the immune system [38–40].
We define crosstalk in biological pathways to consist of the redundant signaling messages sent over degenerate edges to achieve the same biological function. This is consistent with Bruni’s definition that crosstalk exists when edges are functionally compatible to, or dependent, on other edges . Crosstalk relates to how pathways determine functional specificity, how ubiquitous messengers transmit specific information, and how similar messages crosslink within the system while undesired signals are minimized. Quantifying crosstalk in patient data-driven biological pathways can give insights into the relative robustness of different biological functions and suggest timing and approaches for therapies directed at pathway modulation. For simplicity, this study measured crosstalk in one molecular interaction function at a time in each pathway; cascades of “mixed-function” molecular interactions that overall would result in execution of the same target function were not considered.
Notation and definitions used correspond to those used by Ingenuity Pathways Analysis (IPA) , the pathway generation software used in this study. The term node is used rather than vertex, and the term edge rather than arc.
A molecule is any gene, RNA, protein or chemical. A molecule is represented by a node on the directed graph of a biological pathway.
A relationship is a functional interaction from one molecule to another. The relationships used in this study are defined by Ingenuity Systems (Ingenuity Systems, personal communication) as follows:
A relationship is represented by an edge on the directed graph of a biological pathway.
A directed graph, in mathematical terminology, has specific properties that can be exploited computationally. IPA designates relationships as direct or indirect, in a different sense of the word “direct”. A direct relationship is a direct physical contact interaction between the two molecules; it includes chemical modifications, such as phosphorylations, if there is evidence that the two factors involved interact directly rather than through an intermediary. It is represented by a solid line edge. An indirect relationship is an interaction that does not require physical contact but is explicitly documented in the literature . It is represented by a dotted line edge. A relationship graph is a directed graph whose edges are in the same relationship category. Molecules or edges are called invariant when they are the same in different stratifications. For example, edges are invariant over all time in one outcome if they do not change over all time periods for that outcome; alternatively, edges are invariant over outcome if they are the same in both outcomes in one time period or more as specified.
Let B(E,N) be a directed graph with E edges and N nodes that represents a biological pathway with relationship interactions as edges and molecules as nodes. Then A is a relationship sub-graph of B with A B when ∀E in A are in the same relationship category.
The incidence matrix M = [mij] of a directed graph B = B(E,N) is a E × N′ matrix, M(E,N′) where E = number of edges and N′ = number of nodes (with duplicate nodes for self-loops) such that mij = −1 if edge i leaves node j, +1 if edge i enters node j, 0 otherwise .
The Pathway Semantics Algorithm augments pathway generation and core analyses, such as those in IPA, through customized preprocessing of the measured molecular data and post-processing of the evoked pathways so that both input data and output matrices are tailored to the biological and clinical questions under study. The goal is to narrow down potential answers to those most likely and useful as clinical hypotheses (see Fig. 1).
PSA first processes the input data to generate biological pathways (Steps 1–2) and then maps the results to matrices constructed to answer the biomedical questions under study (Steps 3–4). If biological pathways are already available, for example, from morphoproteomic tissue analysis , only Steps 3 and 4 need be performed. See Fig. 2.
This process selects characteristic subsets of the measured molecules. The assayed molecules are assembled into Significance Sets of those molecules that statistically differentiate the disease states over the stratifications under study, such as outcome, time period of measurement, cell cycle phase observed, or a combination of stratifications. The statistical analysis is utilized as a feature extraction tool to identify significant molecules.
The Significance Set for each stratification group plus the statistically observed average values (means or medians as appropriate) for each molecule in the group are input to a pathway generation algorithm that expands each set to include its likely neighboring molecules, based on published literature and pathway databases. A network diagram is then created of the biological pathways showing the interactions among the molecules for each stratification group.
Matrix representations, suitable for the biomedical questions under study, are created from the network diagrams. For example, the molecules, or nodes, in the network diagrams can be mapped to node matrices (or vectors) of molecules over stratifications such as disease state and time. In a similar manner, molecular interactions, or edges, can be converted to edge matrices (or vectors) of molecular interactions over stratifications such as functional interaction types. In the simplest form, the resulting matrices have a 1 in a row/column cell if the row molecule (or molecular interaction) is present in the column stratification; 0 otherwise.
Algebra is used to compare the matrices to identify differential patterns of molecules and molecular interactions of biomedical significance over outcome, time and other stratifications. The specific calculations used depend on the biomedical questions represented by the matrices. For example, in PSA node analysis, matrices of node molecules over time and outcome can be added, subtracted, or logically compared through “ands” and “ors”. Similar calculations can be done with matrices of edge molecular interactions over time, outcome and functional relationship. In addition, when molecular interactions are represented as edges in an incidence matrix, matrix properties such as rank can be used to infer biological processes such as crosstalk.
Definition. The rank R of a matrix M is the maximal number of its linearly independent columns or rows . Rank can be calculated using Gaussian elimination or singular value decomposition.
If rank R is greater than or equal to E, the number of edges (rows), then all the edges act independently. The percentage, or ratio, of independent edges = R/E, and the ratio of dependent edges is 1 – R/E.
We propose the biological interpretation that the maximum number of independent molecular interactions (edges) required for a molecular function is the same as the rank of the incidence matrix constructed from the functional relationship sub-graph, and that a measure of crosstalk for that function can be based on the percentage of dependent edges.
Definition. The XTALK ratio of a directed graph B = B(E,N) with incidence matrix M(E,N′) is defined as 1 – (rank (M(E,N′))/E).
If XTALK = 0%, then all edges act independently for a particular function. The XTALK measure includes normalization by the total number of edges in a graph to allow comparisons of crosstalk over time and outcome.
In Fig. 3, rank R = 2, number of edges = 3. XTALK = 1–(2/3) = 33%, suggesting there exists one-third crosstalk in the biological functional relationship represented by the graph.
Trauma refers to serious bodily injury such as penetrating injuries from gunshots and stab wounds, blunt injuries such as those sustained during automotive accidents, and burns; trauma is the cause of 74% of all deaths for people ages 15–24 . The term shock / trauma is used in this manuscript to refer to trauma that is associated with the clinical signs of shock, defined physiologically as oxygen consumption (VO2) inadequate to meet the oxygen demands of peripheral tissue. Disease progression in shock / trauma is rapid and deadly; patients who survive the initial trauma may suffer morbidity from potentially preventable syndromes such as multiple organ failure (MOF) [47, 48]. MOF is unique in that the organs that fail are not necessarily injured from the trauma and that late MOF may arise days to weeks after the initial incident. The pathophysiology underlying MOF is still not well understood [49, 50]. Patterns of signaling molecules called cytokines  have been associated with patient outcomes in trauma and critical care for some time [52–56], and analysis of the biological pathways evoked from cytokines may offer insights into disease progression. Cytokines are small proteins released by stimulated macrophages, monocytes, T cells, and other cells; they bind to specific receptors to induce a wide variety of local and systemic responses particularly within the innate and adaptive immune systems .
PSA was applied to data from the Jastrow MOF study  that associated certain cytokine patterns within the first 24 hours from trauma with the outcome of multiple organ failure before other symptoms were visible . In contrast, traditional predictors of MOF were not significantly different between MOF and non-MOF outcomes. The PSA goal was uncover patterns of evoked molecules and molecular interactions associated with shock / trauma progression that would lead to clinical hypotheses.
De-identified patient data from the Jastrow study  were extracted from the UTHSC-H Trauma Research Database with the approval of the Committee for the Protection of Human Subjects (Institutional Review Board / IRB) of the UTHSC-H (HSC-SHIS-09-0237). The data included serum cytokine measurements, collection times, and MOF outcomes for 48 patients from an IRB approved prospective observational trauma study conducted in the shock / trauma Intensive Care Unit (STICU) at Memorial Hermann Hospital, a Level I trauma center in Houston, Texas from January through December 2005. The 48 patients had a mean age of 39 ± 3 years, 67% were male, 88% of the insult was blunt mechanism, and the mean Injury Severity Score was 25 ± 2. MOF developed in 11 (23%) of the patients. Twenty-seven cytokines were measured every 4 hours from the start of the resuscitation protocol and were later assayed by the Bio-Plex Human Cytokine 27-Plex Panel. The measurement times were adjusted to time from insult, and grouped into 4 hour time periods starting at hour 2 from insult and ending at the study limit of hour 24. Twenty-seven cytokines were measured by Bio-Plex immunoassay. All were used for the PSA-Node analysis; eleven were used for the PSA-Edge analysis (see Table 1).
The cytokine data were partitioned for analysis purposes into 6 groups by time periods: hours 2–6, 6–10, 10–14, 14–18, 18–22 and 22–24. The four-hour time period was chosen because that was the scheduled time between clinical measurements. The clinical data were pre-processed before Step 1 (see Fig. 2) as follows:
Additional details on data preparation can be found in the Supplement 1, Section 1.
In Steps 1 and 2, the Pathway Semantics Algorithm (PSA) reduces the dimensionality of the pre-processed input data to generate targeted biological pathways. The description that follows is for the PSA-Node analysis based on 27 cytokines.
Notation: I = number of time periods; A = number of significant molecules in a time period. Significance Sets Si=1,I of molecules ci=1,I;a=1,A that statistically differentiated the K outcomes qk=1,K over time periods xi=1,I were created based on the non-parametric Mann–Whitney–Wilcoxon (MWW) test (p<.05) executed in each of 6 time periods within the first 24 hours from insult. Outcomes were q1 = MOF (multiple organ failure) or q2 = NMOF (non-multiple organ failure). Time periods from insult were xi=1,6 = 2–6, 6–10, 10–14, 14–18, 18–22 and 22–24. The Significance Sets S1, S2 and S6 contained the names of 10 of the 27 measured cytokines; S3 and S5 contained 14 cytokines; and S4 had 15 cytokines. The names of the cytokines differed in each Si. For example, S1 contained: c1,1= Eotaxin; c1,2= G-CSF; c1,3= GM-CSF; c1,4= IFN-γ; c1,5= IL-1ra; c1,6= IL-6; c1,7= IL-8; c1,8= IP-10; c1,9= MCP-1 and c1,10= MIP-1 (See Table 2).
Ingenuity Pathways Analysis (IPA) was used to find the likely biological pathway networks associated with the levels of the measured molecules. IPA provides a literature and pathway database search along with a pathway generation algorithm that utilizes weighted lists of molecules (Ingenuity® Systems, www.ingenuity.com). The algorithm breaks “ties” about which neighbors to add to an evoked network based on the relative weightings of the input molecules . Because the analytes were signaling molecules, the relative numbers of molecular signals, rather than the relative weights of the molecules, generate more representative biological pathways . Therefore, two additional data modifications were performed. First, the units for the median values vi,a,k were converted from concentrations in pg/mL to v′i,a,k, the number of molecules per liter (pmol/L) based on the mass of the cytokine in kDa as reported in UniProt. Second, certain cytokines must be present in multiples or have multiple receptors to send signals. Therefore the v′i,a,k were further adjusted to v″i,a,k by how many molecules were required for one signal. The adjusted calculation details are given in Supplement 1, Section 2.
An IPA data template was prepared for each Si with the assayed molecule weightings v″i,a,k (intensities) for both outcomes qk in time period xi and the molecule’s “Gene/Protein ID”. The molecule was identified by its UniProt Knowledgebase (UniProtKB) Accession Number, based on the best match for human (subunit A or chain A). Each v″i,a,k was entered as an “Observation/Expression k”, with k=1 for MOF and k=2 for non-MOF. The 6 datasets generated 12 time-stamped network groups with one to three 35-molecule networks in each group (the default 35-molecule limit is adjustable.) Each group was exported as a text list of molecules (network nodes) and as a graphic image of molecular interactions (network edges) (See Fig. 4).
For the PSA-Node analysis, two biomedical questions were addressed: first, were there molecular patterns in the evoked pathways that were time-shifted differently in outcomes of MOF vs. non-MOF, and secondly, were there molecules that were primarily associated with only one outcome over time?
Given the analysis focus on time, the questions were embedded in a matrix format called a Temporal Dependency Matrix (TDM), using the 12 pathway network graphs (6 for MOF and 6 for non-MOF) generated in Step 2. A general example of the TDM format is shown in Fig. 5.
In Fig. 5, TDMq1 (above) and TDMq2 (below) show 6 molecules mr, r=1…6 over 3 time periods xi, i=1…3 in 2 outcomes qk, k=1,2. To identify molecular patterns by outcome and over time, a summary list mr was compiled of the names of the molecules present in any of the biological networks evoked from the assayed molecules. Then a temporal dependency matrix (TDM) matrix was constructed for each outcome qk, with the molecule names mr as the first column and the time periods xi as the headers across the remaining columns. If the molecule was present in the time period in the outcome, a 1 was placed in the row r, column i cell zkri of the TDM for outcome k; otherwise 0. The rationale behind this process was to facilitate computational comparisons over time and outcome using matrix algebra and logic.
For the trauma application, a summary list Tr of the 193 molecule names mr that were evoked in any outcome at any time were entered into both columns 1 of two temporal dependency matrices TDMMOF(mr, xi) and TDMNMOF(mr, xi). The subscript r ranged from 1 to 193 (number of molecules) and the subscript i ranged from 1 to 6 (number of time periods). For clarity of notation, the TDMs were subscripted by “MOF” for k=1 and “NMOF” for k=2. The headers for the six columns 2 –7 were set as the time periods xi and a 1 or 0 was placed in matrix/row/column cells zkri denoting the presence or absence of the molecule as depicted in the example matrices in Fig. 5. Matrix algebra was then used to compare the TDMs over disease state stratifications to elucidate disease progression and explore the given biomedical questions.
Identify molecules mr that appear at least once in both outcomes in the same time period xi and at least once in either outcome in a different time period.
Danger-associated molecular patterns (DAMP) in the systemic inflammatory response syndrome (SIRS) and sepsis induce the production of pro and anti-inflammatory mediators by pattern-recognition receptors (PRR). A dysfunctional acute inflammatory response may lead to MOF [58, 59].
In this study, are there molecules that are “time-shifted” in different outcomes? Is a molecular interaction continuing past its “normal” innate response?
If the identified molecules appear in both outcomes at different times, then additional research may show how to modulate those molecules to minimize negative outcomes.
To simplify notation in the following TDMs, the k subscript is deleted. It is assumed = 1 in the row / column cells in ZMOF; k = 2 is indicated by “′” in ŹNMOF. Z″ is the summation of both TDMs, and k = 0 is indicated by “″” in its row / column cells.
Let Z″ = ZMOF + Z′NMOF
The cells z″ri of the resulting matrix Z″ had a 2 if the molecule mr was present in both outcomes in time period xi, a 1 if it was present in one outcome or the other, and 0 if it was not present in either. A molecule mr was selected if there was at least one 2 and one 1 in its row. Using these criteria, four molecules were identified that appeared at least once in both outcomes in the same time period and at least once in either outcome in a different time period: CIITA, HIRA, IG9, and KSR2.
Identify molecules that appeared only in one outcome or the other in more than one time period.
Are there molecules in the pathways triggered by the measured cytokines that are associated only with one outcome in at least 2 of the 6 time periods under study?
Molecules that meet these criteria may reveal underlying mechanisms that have not yet been associated with specific clinical outcomes.
To simplify notation, the k subscript is deleted. It is assumed = 1 in the row / column cells zri (MOF) and k=2 in the row / column cells z′ri (NMOF). I = 6, the number of time periods.
Let MOF_SELECT (mr) = 1,
NMOF_SELECT (mr) was also calculated using equation (1) exchanging zri and z′ri. Based on these criteria, four molecules were identified as appearing only in MOF: Egfr-Erbb2, IFI6, MRAS and NOD1; no molecules appeared solely in non-MOF.
The matrix analysis in Step 4 identified eight molecules from the 193 molecules evoked by the assayed cytokines whose patterns at different times differentiate outcomes. Literature searches were performed on each molecule to ascertain associations with multiple organ failure or other shock / trauma syndromes. Although IG9  was generated by Ingenuity Pathway Analysis (IPA), no other published references to the named molecule were found. The investigator confirmed that research on IG9 had ceased and requested that it be deleted from the findings. (T. M. Calderon, personal communication). See Table 3.
Based on a PubMed search for the molecule name and the MeSH term “shock,” which includes syndromes other than MOF, only three of the seven molecules listed in Table 3 have been previously been associated with shock / trauma: CIITA, EGFR and NOD1 (see Supplement 1, Section 4). All three maintain intestinal epithelial cell homeostasis during immune and inflammatory responses and appear in MOF pathways in this study. This is consistent with previous findings that pathophysiology of the gut (epithelium, mucosal immune system, and the commensal bacteria) contributes to critical illness  and to multiple organ failure .
Although four molecules - HIRA, IFI6, KSR2, and MRAS - have not yet been associated with shock / trauma, their biological functions seem to be consistent with trauma progression. MRAS appears in hours 2–10 solely in MOF; it is implicated in the regulation of integrin-mediated leukocyte adhesion in inflammatory and immune responses . IFI6 appears in hours 14– 22 solely in MOF; it regulates apoptosis, suggesting that programmed cell death is essential to MOF . HIRA is observed in non-MOF in the first hours, and later in MOF. It promotes nucleosome assembly . This may indicate either the activation of gene transcription or silencing, with different timings associated with different outcomes. Likewise, KSR2 is associated with both outcomes early on, but appears solely in MOF in hours 22–24. It regulates insulin sensitivity  and, through inhibition of MAP3K8, decreases pro-inflammatory mediators ,. Hence, the presence of KSR2 may reflect the up-regulation of pathways in an attempt to modulate the inflammatory response after injury. This may be an underlying mechanism related to the fact that insulin resistance and hyperglycemia are common in non-diabetic critically ill patients . See Table 4 for a summary list of the seven molecules that differentiated outcomes over time.
The PSA Edge analysis addressed two biomedical questions in the trauma study: did the types of molecular interactions change over time, and did the crosstalk within the interaction categories also change over time? As a demonstration of edge analysis, PSA Steps 1 and 2 were re-run using eleven of the 27 cytokines chosen by the clinicians as those most likely related to multiple organ failure (see Table 1). The number of cytokines was limited due to edge export restrictions of the pathway generation software (Ingenuity Pathway Analysis) and the fact that, as a result, all edges had to be manually transcribed visually from the generated pathway graphs. IPA generated 12 combined network graphs of the most likely biological pathways evoked from the assay results of the 11 cytokines during 6 time periods and 2 outcomes. There were a total of 132 different molecules evoked in silico across all 24 hours.
The PSA Edge analysis evaluated three of the six time periods in the study: hours 6 – 10, 10 – 14, and 22 – 24 hours from trauma; two outcomes: multiple organ failure (MOF) and non-multiple organ failure (non-MOF); and four relationship categories of molecular interactions: activation, expression (including metabolism and synthesis for chemicals), inhibition and transcription, for a total of 24 relationship sub-graphs. Both direct and indirect interactions were used in the edge analysis. See Fig. 6 for the highlighted expression relationship sub-graph for MOF at hours 6 – 10; all are shown in Supplement 2.
Four relationship sub-graphs were extracted from each of the 6 evoked network pathway graphs for both outcomes over the three chosen time periods. The 24 sub-graphs were identified by interactively highlighting the edges for each of the 4 interaction categories of activation, expression, inhibition and transcription. The sub-graphs were represented as cyclic digraphs (directed graphs with cycles). Each directed edge, or arc, of a sub-graph was a one-way interaction relationship from one molecule to another. The sub-graphs could also contain loops, or cycles because feedback, feed forward, and self-loops occurred in molecular interactions. This necessitated the use of incidence matrices for computation and limited graph metrics to those for cyclic digraphs. 1,264 graph edges were manually logged by visual inspection into a FileMaker database (www.filemaker.com). Each edge record was identified by its outcome, time period, “FROM” molecule, “TO” molecule, and molecular interaction relationship category.
Using custom software, the edge records for each relationship sub-graph for each time and outcome were converted to an incidence matrix, called an Edge-Molecule (EM) matrix, where each row represented a from-to edge, and each column represented a molecule, with doubles for self-loops. A -1 was placed in the from molecule column, a +1 in the to column and 0 otherwise. All 132 unique molecules evoked in Steps 1 and 2 were placed in the column header row. 12 molecules had self-loop feedback and required duplicate columns: CCL11, CCNA1, Cyclin A, Cyclin E, IL6, TNF, IFNG, IL1, IL10, Hsp70, RARB, and MYBL2. The final number of molecule name columns in each EM matrix was 144, with the number of row edges (molecule – molecule interactions) changing according to the interaction type and the time period. Fig. 7 shows a portion of the EM matrix for the Fig. 6 graph.
The 24 EM matrices were exported for mathematical analysis into MATLAB (www.mathworks.com).
A descriptive analysis was performed to count the number of edges in each relationship in each outcome over time and to identify edges that were unchanged over time and outcome.
The crosstalk for each relationship, time period, and outcome was calculated as the measure XTALK using linear algebra as shown in Section 3. Relationship sub-graphs were then analyzed using XTALK to uncover which functional relationships had the most or the least crosstalk in different outcomes and how crosstalk changed over stratifications such as time.
Based on the edge count, the most interactions per time period were in the activation function category, except in hours 22 – 24 for non-MOF when activation interactions were fewer than expression interactions. Inhibition and transcription interactions were most active in hours 10 – 14. See Fig. 8.
Only two molecular interactions were present in both MOF and non-MOF over all time periods; both affected transcription: PDGF BB→ CSF2 (GM-CSF) and IL1 (IL-1β)→ IL8. PDGF BB is a platelet-derived growth factor homodimer that causes mitosis in cells of mesenchymal origin; here it affects the transcription of CSF2, which encodes a cytokine that controls the production, differentiation, and function of granulocytes and macrophages. IL1 is a cytokine produced by activated macrophages that mediates the inflammatory response, in this case by increasing transcription of IL8, a chemokine that functions as a neutrophil polymorphonuclear cell (PMN) chemoattractant. It is also a potent angiogenic factor.
Although the majority of molecular interactions were similar in each time period over both outcomes, distinct differences were revealed by a count of the edges unique to MOF or non-MOF. See Fig. 9. In hours 6 to 10 from trauma, there were twice as many unique activation interactions in non-MOF than MOF; whereas by hours 10 – 14, MOF surpassed non-MOF with a greater number of unique interactions in all categories. In hours 22 – 24, MOF had twice as many unique activation edges than non-MOF, although both had the same number of unique expression edges. There were few unique inhibition or transcription interactions. Overall, there were more interactions that appeared solely in MOF than in non-MOF. Another point of interest is that IL6 was involved in ~50% of the unique expression interactions in both outcomes in the first 6 – 10 hours, while IFNG became dominant in hours 10 – 14.
XTALK, a measure of crosstalk based on the dependency between the functional edges as calculated by matrix rank, ranged from 0% to a high of 71%, and changed over time. (See Fig. 10). Activation crosstalk was calculated at ~69% in hours 6 – 10, staying steady to 71% at hours 10 – 14, and decreasing in hours 22 – 24 to 45% in MOF and 32% in non-MOF. In hours 6 – 10, expression edge crosstalk was 51% in MOF and 46% in non-MOF. This increased in hours 10 – 14 with MOF rising to 62% and non-MOF to 54%. Crosstalk then decreased in hours 22 – 24 to 27% in MOF and 31% in non-MOF. There was no crosstalk in inhibition interactions in hours 6 – 10 and 22 – 24; however, crosstalk increased to 17% in MOF and 20% in non-MOF in hours 10 – 14. 9% transcription crosstalk was calculated in both outcomes in hours 6 – 10, rising to ~21% in hours 10 – 14, then decreasing to 0% by hours 22 – 24.
In hours 6 – 10, there were twice as many unique activation edges in non-MOF compared to MOF; however the reverse was the case in the later time periods. This may imply that in non-MOF, a large number of favorable molecular interactions were underway early on, so fewer unique activations were needed as the pathways approached a favorable outcome of non-MOF. The percentage of activation crosstalk was about the same in hours 6 – 10 and 10 – 14 in both outcomes, decreasing only in hours 22 – 24.
By hours 10 – 14, MOF had more than three times the number of unique expression edges than non-MOF, implying a higher energy consumption in MOF metabolism than in non-MOF at this time. The percentage of expression crosstalk was slightly lower in non-MOF than MOF in the first 2 time periods, changing to slightly higher by the end.
Unique inhibition interactions appeared solely in MOF in the last 2 time periods. Crosstalk appeared in both outcomes only during hours 10 – 14; it was slightly higher in non-MOF. Again, this suggests an attempt to damp down molecular interactions in both outcomes starting in hours 10 – 14 that was continued in hours 22 – 24 by additional unique inhibitory interactions in MOF.
Unique transcription interactions appeared in both outcomes in hours 10 – 14, with the majority in MOF. Crosstalk in transcription interactions increased initially, and disappeared in both outcomes by hours 22 – 24 when only 2 transcription interactions occurred in each outcome.
Today it is generally accepted that there is a need to develop computational, data-driven algorithms to exploit the vast quantity of molecular information available in knowledge bases in order to advance systems biology and to improve patient care [63–67]. Due to several successes [68–70], in silico hypotheses generators are no longer denigrated as “fishing expeditions” .
The Pathway Semantics Algorithm (PSA) presented in this manuscript is an initial in silico data integration and analysis step towards formulating hypotheses about disease progression for personalized diagnosis, prognosis, and therapies that can be validated in the laboratory and in the clinic. PSA is based on a novel, flexible approach that uses graph theory and numerical algebra to computationally compare non-canonical biological pathways evoked from patient data over time. The use of matrix representation and algebra, as used in the Pathway Semantics Algorithm (PSA), offers a way to computationally integrate qualitative and quantitative approaches for improved hypothesis generation about disease progression. PSA identifies molecular patterns in biological pathways derived from patient data, an important benefit that supports personalized medicine. PSA preprocesses the molecular concentration data, tailoring it to the biological and clinical questions under study, before submitting it to a network generation algorithm (in this case, IPA). PSA then algebraically post-processes the evoked pathway networks to reveal changing molecular patterns not easily observed in the static text and graphical formats output by IPA. This algebraic post-processing changes the data representation. It is important because the data representation space is one of the four inter-related problem spaces in scientific discovery, along with the hypothesis space, the experiment space, and the experimental paradigm. Changes in data representation uncover regularities and invariants, facilitate categorization, and suggest alternative search strategies key to scientific discovery . PSA differs from graphical analysis since it does not start with predetermined graphs of canonical pathways. Instead, PSA is data-driven; the algorithmis initialized with clinical data from patients upon which biological pathway networks are constructed based on most likely interactions even if they are not part of canonical pathways. As a result, PSA supports personalized medicine. Although both Gene Set Enrichment Analysis (GSEA) [22, 23] and PSA generate hypotheses correlated with phenotype, their inputs, methods and goals are substantially different. The goal of GSEA is to provide a more robust way to compare independently derived gene expression data sets (possibly obtained with different platforms) and obtain more consistent results than single gene analysis. In contrast, the goal of PSA is to efficiently generate clinically useful hypotheses about disease progression over time using matrix algebra. PSA frames quantitative and qualitative data in matrix representation to answer biomedical questions and the PSA matrix node analysis can be applied to the gene sets evoked from GSEA for further hypothesis generation. Insights can be gained, not only into expression of genes as in GSEA, but also to changes in activation, inhibition, transcription and other activities of molecular interactions over time. Finally, PSA uses mathematical algorithms for matrix representation and computation that are readily available and can be implemented in a wide variety of software.
PSA was applied to a prospective observational study of shock / trauma, a research area where patient data is sparse and difficult to obtain even at a Level I trauma center; randomized controlled trials are not an option. By using patients’ molecular cytokine data to evoke non-canonical biological pathways from the Ingenuity Pathway Knowledge Base, PSA expanded the existing information to include the most likely molecules and molecular interactions evoked by the patients’ cytokines. With the expanded information set, and its representation as pathway graphs, PSA was able to use computational tools and algorithms from graph theory and numerical algebra to compare patterns of molecules and molecular interactions over different stratifications. In particular, PSA was able to analyze patterns over time – an absolute necessity for clinicians who treat disease as it unfolds . This feature shows the potential of PSA to support temporal reasoning in medical decision-making and support systems.
Applied to the trauma study, PSA Node analysis identified and qualified 7 molecules in patterns across time of the progression of multiple organ failure; of these, only 3 had been previously associated with any shock / trauma syndrome. A literature search confirmed that the molecules’ biological functions were consistent with the current understanding of MOF. PSA also highlighted the dynamic nature of trauma response, indicating that molecular patterns are specific to certain time periods from insult. PSA uncovered novel molecular patterns in shock / trauma using an unbiased data-driven approach that integrated what was known about the patient and what was known about molecular interactions. The appearance of these patterns made sense within the disease context, and suggested hypothetical answers to the biomedical questions about which molecules differentiated patient outcomes. All 7 of these molecules were in the evoked biological pathways over time and were not measured directly. Instead, they were inferred from published literature documenting molecular interactions.
The results from the PSA Edge analyses suggest that molecular interaction activity – and the nature of that activity – changed dramatically within the first 24 hours of trauma. In both outcomes, the number of interactions peaked during hours 10 – 14 from insult, lessening to about half of the initial activity by hours 22 – 24; this may be due to the effects of interventions during the first 24 hours combined with the innate systemic response. There were core sets of molecular interactions that were invariant over outcomes in each time period plus unique interactions only in one outcome or the other. This suggests a primary molecular response to the injury that was modulated by the unique interactions edges towards favorable or unfavorable outcomes. MOF had fewer unique interactions early in response, but by hours 10 – 14, MOF had almost three times as many unique edges as non-MOF – perhaps an excessive number.
Multiple organ failure has been characterized as an adaptive, multilevel time-based stress response with marked changes in gene expression [73–75]. We believe that ours is the first study to quantify the changing aspects of gene expression in MOF over time. By examining edge interactions in silico, changes in functional relationships and their crosstalk over time and outcome were revealed.
Molecules must be activated before they can be transcribed and then expressed, and inhibition can halt any step in the gene regulation process. It is known that cells respond quickly to stress by altering their metabolism; they can induce apoptosis or cell-cycle arrest and alter nuclear pathways for DNA repair . Activation interactions dominated the initial response in both outcomes through hours 10 –14, showing the immediate cellular response to stress. Expression was higher in MOF, suggesting a higher metabolic load on the system. Inhibition and transcription interactions were a small proportion of the overall count.
For demonstration purposes, we performed a simple analysis that did not include interaction cascades of different functions in order to focus on a “black box” of four dominant functions. Even with this limitation, differences were observed across time and outcome. This is important because it suggests that a diagnosis, prognosis or therapy based on molecular data might only be valid within a certain time period and for a certain functional relationship, due to the degeneracy in the biological network. For example, because there appear to be few inhibition relationships and little or no inhibition crosstalk in initial trauma, it may be worth exploring increasing inhibition interactions early on in order to limit the excessive unique expression interactions in MOF in hours 10 – 14. Crosstalk decreased over time in the first 24 hours from trauma, suggesting that therapies should consider time from insult as well as which interaction functions they are targeting in order to be effective. This also suggests that trauma therapies may have to be administered in a particular sequence, similar to certain cancer therapies.
The quality of the PSA analysis results depends on the quality of the patient data, the clinical study protocol, the assay method, the choice of statistical analysis, and the accuracy of the biological pathway networks generated by the Ingenuity Pathway Algorithm from its knowledge base.
Validation. The Pathway Semantics Algorithm uses generally accepted methods of statistics and matrix algebra, along with a widely used commercial algorithm and knowledge base for pathway generation. Therefore, the overall Pathway Semantics Algorithm and its resulting hypotheses have at least face validity. This has been confirmed in the previous sections though correlation of the results with published literature and expert opinion as is the usual practice .
Because PSA was illustrated based on cytokine time series data from a completed trauma patient study, it was not possible to re-test the patients for empirical validation of the hypotheses generated. Subsequent to the trauma research, PSA was applied to a study of cytokine time series data of a mouse model of inflammatory immune response in hemophilia. Molecular patterns predicted by PSA to occur at specific times were later validated in the mouse model, as documented in the author’s dissertation. .
Evaluation. PSA’s extensive use of matrix algebra for analysis minimizes computational complexity while allowing computationally tractable scaling over large numbers of molecules, molecular interactions, outcomes, time periods and other stratifications. In addition, the matrix algebra reduces the size of the solution space, that is, the set of hypotheses generated from the evoked pathways in response to specific biomedical questions. For example, in the trauma PSA node analysis, the 193 molecules in the pathways evoked by the assayed cytokines over 6 time periods resulted in a potential solution space of 1,158 molecules/times. Algebra reduced that to 7 molecules that differentiated outcomes at different times. Finally, the XTALK measure derived from PSA can be shown to be robust under small changes. Expanding the Fig. 3 three-node graph to 4 nodes only modifies XTALK from 33% to 25% as shown in Fig. 11.
General Applicability. It is well understood that intracellular signaling processes play an important role in disease progression [79–81]. The Pathway Semantics Algorithm is designed to be generally applicable to the development of hypotheses regarding the roles of signaling molecules, such as cytokines, in disease progression, independent of data set, disease, disease state, or specific method of pathway generation. In addition, as mentioned previously, PSA has been empirically validated in a mouse model of immune response in hemophilia also based on cytokine time series data and published in the author’s dissertation. The authors believe that validation with independent data for a different species and a different disease over a different time progression shows that PSA is a general method; it was not “optimized” for a specific data set, domain, or context.
Following are some key application considerations:
The Pathway Semantics Algorithm identified different patterns of molecules and molecular interactions over time, outcomes, and functional relationships in biological networks that would not be easily found through direct assays, literature or database searches. By framing biomedical questions within a variety of matrix representations, PSA had the flexibility to analyze combined quantitative and qualitative data over a wide range of stratifications and generate hypotheses addressing those specific biomedical questions.
The algorithm was illustrated with an application to disease progression in trauma; the results show promise for further clinical investigation. The seven evoked molecules that differentiated outcomes of MOF and non-MOF in specific time periods suggest novel hypotheses for underlying mechanisms of shock / trauma progression. The differences in the number of edges, the number of unique edges, and XTALK showed the utility of evaluating a molecular interaction not just as a connection between two molecules, but as a directed interaction from one molecule to another that may carry out one or many specific functions . The crosstalk measure XTALK provided a novel perspective on the changing functional interaction relationships in disease progression; the results supported the existence of the property of degeneracy in biological networks. Next steps in this work include exploring the biological significance of other matrix-based numerical algebra methods, analysis of other diseases of clinical interest, and laboratory validation of results. Substantial progress has been made in this regard. PSA was applied and empirically validated in a mouse model of hemophilia; the results are being prepared for separate publication at the request of the co-authors.
We acknowledge the participation of the Memorial Hermann Hospital STICU, Houston in the prospective study that collected the data used in this study.
Funding: This work was supported by the National Institutes of Health [T15-LM07093-16 to M.F.M.; GM-38529, GM-08792, CReFF UCRC #M01RR002558 to D.W.M.] and by the Harvey S Rosenberg, Endowed Chair in Pathology for the Morphoproteomics Initiative, UT Medical School at Houston to M.F.M.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Supplementary information: Supplementary data are available at the Journal of Biomedical Informatics online.
Conflict of Interest: None declared.