We used MINE to explore four high-dimensional datasets from diverse fields. Three datasets have previously been analyzed and contain many well-understood relationships. These datasets are (i) social, economic, health, and political indicators from the World Health Organization (WHO) and its partners (7
); (ii) yeast gene expression profiles from a classic paper reporting genes whose transcript levels vary periodically with the cell cycle (26
); and (iii) performance statistics from the 2008 Major League Baseball (MLB) season (27
). For our fourth analysis, we applied MINE to a dataset that has not yet been exhaustively analyzed: a set of bacterial abundance levels in the human gut microbiota (29
). All relationships discussed in this section are significant at a false discovery rate of 5%; p-values and q-values are listed in the SOM.
We explored the WHO dataset (357 variables, 63,546 variable pairs) with MIC, the commonly used Pearson correlation coefficient (ρ
), and Kraskov's mutual information estimator (, Table S9
). All three statistics detected many linear relationships. However, mutual information gave low ranks to many non-linear relationships that were highly ranked by MIC (). Two-thirds of the top 150 relationships found by mutual information were strongly linear (|ρ
| ≥ 0.97), whereas most of the top 150 relationships found by MIC had |ρ
| below this threshold. Further, although equitability is difficult to assess for general associations, the results on some specific relationships suggest that MIC comes closer than mutual information to this goal (). Using the non-linearity measure MIC − ρ2
, we found several interesting relationships (), many of which are confirmed by existing literature (30
). For example, we identified a superposition of two functional associations between female obesity and income per person, one from the Pacific Islands, where female obesity is a sign of status, (33
) and one from the rest of the world, where weight and status do not appear to be linked in this way ().
Application of MINE to Global Indicators from the World Health Organization
We next explored a yeast gene expression dataset (6,223 genes) that was previously analyzed with a special-purpose statistic developed by Spellman et al.
to identify genes whose transcript levels oscillate during the cell cycle (26
). Of the genes identified by Spellman et al.
and MIC, 70% and 69%, respectively, were also identified in a later study with more time points conducted by Tu et al.
). However, MIC identified genes at a wider range of frequencies than Spellman et al.
, and MAS sorted those genes by frequency (). Of the genes identified by MINE as having high frequency (MAS > 75th
percentile), 80% were identified by Spellman et al.
, while of the low-frequency genes (MAS < 25th
percentile) Spellman et al.
identified only 20% (). For example, although both methods found the well-known cell-cycle regulator HTB1 () required for chromatin assembly, only MIC detected the heat-shock protein HSP12 (), which Tu et al.
confirmed to be in the top 4% of periodic genes in yeast. HSP12, along with 43% of the genes identified by MINE but not Spellman et al.
, was also in the top third of statistically significant periodic genes in yeast according to the more sophisticated specialty statistic of Ahdesmaki et al
., which was specifically designed for finding periodic relationships without a pre-specified frequency in biological systems (24
). Due to MIC's generality and the small size of this dataset (n
=24), relatively few of the genes analyzed (5%) had significant MIC scores after multiple testing correction at a false discovery rate of 5%. However, using a less conservative false discovery rate of 15% yielded a larger list of significant genes (16% of all genes analyzed) and this larger list still attained a 68% confirmation rate by Tu et al.
Application of MINE to S cerivisiae Gene Expression Data
In the MLB dataset (131 variables), MIC and ρ
both identified many linear relationships, but interesting differences emerged. On the basis of ρ
, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. In contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27
) (Fig. S12, Table S12
). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary.
Our analysis of gut microbiota focused on the relationships between prevalence levels of the trillions of bacterial species that colonize the gut of humans and other mammals (35
). The dataset consisted of large-scale sequencing of 16S ribosomal RNA from the distal gut microbiota of mice colonized with a human fecal sample (29
). After successful colonization, a subset of the mice was shifted from a low-fat/plant-polysaccharide-rich (LF/PP) diet to a high-fat/high-sugar ‘Western’ diet. Our initial analysis identified 9,472 significant relationships (out of 22,414,860) between ‘species’-level groups called operational taxonomic units (OTUs); significantly more of these relationships occurred between OTUs in the same bacterial family than expected by chance (30% vs. 24±0.6%).
Examining the 1,001 top-scoring non-linear relationships (MIC-ρ2
>0.2), we observed that a common association type was ‘non-coexistence’: when one species is abundant the other is less abundant than expected by chance, and vice versa
(). Additionally, we found that 312 of the top 500 non-linear relationships were affected by one or more factors for which data were available (host diet, host sex, identity of human donor, collection method, and location in the gastrointestinal tract; SOM, Section 4.7). Many are non-coexistence relationships that are explained by diet (, Table S13
). These diet-explained non-coexistence relationships occur at a range of taxonomic depths—inter-phylum, inter-family, and intra-family—and form a highly interconnected network of non-linear relationships ().
Associations Between Bacterial Species in the Gut Microbiota of ‘Humanized’ Mice
The remaining 188 of the 500 highly ranked non-linear relationships were not affected by any of the factors in the dataset, and included many non-coexistence relationships (Table S14
, ). These unexplained non-coexistence relationships may suggest interspecies competition and/or additional selective factors that shape gut microbial ecology, and therefore represent promising directions for future study.