A consequence of our highly industrialized society is exposure to an increasing number of chemicals that may influence human health. Environmental factors are implicated in many complex diseases including asthma, cancer, diabetes and Parkinson's disease. However, the mechanisms of actions of most chemicals and the etiologies of environmentally influenced diseases are not well understood 
. The Comparative Toxicogenomics Database (CTD; http://ctdbase.org
) promotes understanding about the effects of environmental chemicals on human health 
. CTD integrates manually curated data reported in the peer-reviewed literature with select public data sets to provide a freely available resource for exploring cross-species chemical-gene and protein interactions and chemical- and gene-disease relationships. CTD provides transitive inferences between chemicals, genes and diseases that are intended to help users develop experimentally testable hypotheses about mechanisms of chemical actions and disease etiologies. A transitive inference between a chemical and disease is made when one or more genes have curated interactions with the chemical and the disease (). Likewise, a transitive inference between a gene and disease is made when one or more chemicals have curated interactions with the gene and the disease. In CTD, there are two classes of transitive inferences: a) inferred relationships that also have direct evidence curated from the published literature and b) inferred relationships that do not yet have directly curated evidence. Recent reports citing Swanson's ABC model underscore the potential value of transitive inferences for predicting disease treatments 
. Data in CTD facilitate similar discovery processes for chemical-gene-disease interaction networks.
Transitive chemical-disease inferences and the computational approaches used to score inferences.
All inferences in CTD are built upon manually curated chemical-gene interactions, gene-disease relationships or chemical-disease (C–D) relationships. Integration of these components allows inferences to be constructed reciprocally. For example, inferred chemical relationships can be viewed for a given disease and inferred disease relationships can be viewed for a given chemical. The former provide insights into the potential environmental influences on a disease, whereas the latter provide insight into the potential health effects of exposure to a chemical. The gene sets that underlie these inferences are unique to CTD and provide a foundation for developing novel hypotheses about the mechanisms by which specific environmental factors affect human health. (Analogous data are provided for gene-disease inferences). As the data in CTD have grown, the number of inferences has increased exponentially. To assist users with interpretation and prioritization of inferences, we developed a statistical method for ranking CTD inferences.
We modeled CTD data as a network where chemicals, genes and diseases are nodes, and the relationships between them are edges. Like other biological networks, the CTD network is a scale-free random network that contains highly connected hub nodes 
. The presence of hubs introduces a statistical challenge when evaluating networks, as not all edges are equally likely to occur. For C–D inferences, we construct a local network that consists of the chemical, disease and the set of genes that interact with the chemical and the disease. To rank order C–D inferences, the similarity among the local networks have to be compared. In these comparisons, hub nodes will appear in multiple local networks by chance and make inferences appear more similar unless they are discounted. The following example illustrates the scale of this statistical problem both in terms of the number of disease inferences for a chemical and the topology of the local network for a particular C–D inference.
Bisphenol A (BPA) is a ubiquitous endocrine disruptor that has been associated with developmental abnormalities and cancer 
. In the July 2011 release of CTD, BPA had abundant and varied types of C–D relationships including four that were directly curated, seven that were curated and inferred, and 798 that were only inferred. BPA was associated with breast neoplasms based on both curated evidence 
as well as by inference via 73 common interacting genes. The local network for this inference consists of the chemical (BPA), the disease (breast neoplasms) and each of the 73 genes. A subset of these 73 genes is also associated with many other diseases and chemicals. In this example, such hub genes include BCL2, CYP1A1, ESR1, IL1B, NOS2, PTGS2, TNF
, each of which have over 400 curated interacting chemicals. In addition, BPA and breast neoplasms have been targeted for in-depth CTD curation and are hubs themselves. BPA has curated interactions with 1,235 genes, and breast neoplasms has 266 curated gene relationships. In developing a mechanism for statistically ranking inferences, it was also important to determine the relative influence of hub versus non-hub data.
Two previously published studies used local topology-based statistics to assess the reliability of protein-protein interactions generated from high-throughput assays, such as yeast two-hybrid technology 
. These studies examined the reliability of an interaction between two proteins (A and B) based on how many other
proteins (called common neighbors) interacted with A and B. These data were modeled as a network where each protein was a node and the interactions were edges connecting the nodes. The number of interactions for a node are defined as the node degree. Goldberg and Roth 
applied four different methods to calculate a probability that a given interaction between proteins A and B was reliable based on the node degree of A and B and the number of additional proteins that interacted with both A and B. Among these methods, the hypergeometric clustering coefficient performed best, but this method did not take into account the node degree of the additional proteins. Li and Liang 
developed two common neighbor statistics to assess the reliability of a given protein-protein interaction. Similar to the hypergeometric clustering coefficient, one metric (p1
) took into account the number of common neighbors and the degree of the two proteins that form the interaction of interest. The second metric (p2
) took into account the degree of each common neighbor. The authors presented a sequential process of evaluating interactions with each statistic rather than presenting a combined statistic. We explored whether these methods could be modified for ranking C–D inferences by substituting protein A with a chemical, protein B with a disease, and the common protein neighbors with the set of genes underlying a C–D inference.
Here, we present a novel method that combines and weights the p1 and p2 metrics, taking into account the properties of the local networks containing the chemical, disease and each of genes used to make CTD inferences. This method addresses the challenges presented by the large number of possible inferences, as well as the presence of hub data. The score rewards inferences by the number of genes used to make the inference, and penalizes networks containing nodes where the node degree is high. illustrates the difference between the hypergeometric clustering coefficient and the p1 and p2 metrics. We provide several examples to demonstrate the value of the statistic as well as the biological relevance of the inferences.