Perhaps the most critical component of any gene classifier discovery effort is independent validation. Especially in the field of ecotoxicology, there is a current prevalence of candidates but few field-ready classifiers for exposure and risk assessment. With a limited number of biological replicates in most microarray studies due to cost considerations, it is not clear how well these transcriptomics-based gene classifiers will perform in validations. Rather than directly pursuing more complicated multi-class classifiers in our current study, a choice was made to discover simpler binary classifiers for two reasons. One was to avoid potential computational issues arising from the complex biological interactions of EDCs all targeting the various components of the HPG axis. Second, a multi-class problem can be decomposed into a number of binary classifications problems [
34], and initial binary classifiers successfully identified could later be combined computationally to tackle a multi-class classification problem [
35]. While the microarray training dataset used in this study was not designed specifically for gene classifier discovery, it does represent typical microarray data obtained in ecotoxicology experiments, both in terms of the extent of biological sample replication, and the relevance of EDCs to chemicals of concern in fish and wildlife.
In order to make predictions on samples of interest of their chemical exposure history, a typical classification algorithm requires the availability of training data or its representation as a computational model derived beforehand. An approach like GSEA offers a flexible alternative. There are two possible scenarios for interpretation when GSEA determines that a classifier as assayed on samples of interest is enriched on the rank-ordered gene lists generated from reference GEPs associated with a known chemical. For an established classifier on samples of unknown, the chemical exposure history of these samples would be diagnosed by the chemical condition associated with the reference GEPs. For a candidate classifier on known samples, on the other hand, a connection of corresponding chemical conditions between these samples and reference GEPs would effectively validate the classifier. In a typical study of biomarkers, the same algorithm is used in their discovery and validation. In this research, however, independent algorithms were used in the two stages. The successful linkage, in 10 out of a total of 11 cases, of treatment conditions of classifiers to those of their respective GEPs in the training data demonstrated not only the strength of GA-SVM/GA-KNN for discovery, but also the utility of GSEA for validation.
The inconsistent predictive performance of several classifiers on both the original training data and later-generated validation data could be attributed in part to the data complexity of their respective chemical tissue conditions unresolved by computational modeling at their existing sample size. This complexity reflects the differential biological impact of an EDC on an individual fish relative to its overall population. In other words, with all other factors equal, the performance of a classifier is probably a function of both the (mechanisms of) action of the EDC and the population variation in organisms’ response to exposure of these chemicals, both genetic and environmental. Accordingly, the sample size required for classifier discovery would likely vary among different chemical-tissue conditions. When biological replication is inadequate for a given level of data complexity, model overfitting could result in an incorporation of both signal and noise, leading to unsuccessful predictions later [
36,
37].
As one of a variety of measures proposed to capture the data complexity of GEPs, F-ratio statistic varies in its degree of correlation with the performance of classifiers[
11,
12,
14], probably because of a wide range of linear separability present among microarray datasets [
13]. Many complexity measures including F-ratio are thought to be especially effective in characterizing linearly separable data [
10]. In the current study, a similar inconsistent correlation was also observed between F-ratio and classifier performance. In conditions like fadrozole-brain, prochloraz-brain, and prochloraz-ovary in the training data, F-ratio reflected the performance of classifiers (Table ). In other conditions such as prochloraz-brain across the training and validation data, this correlation broke down (Table ). And the F-ratio of the same condition, for example flutamide-ovary, also varied widely between the training and validation data in spite of the fact that fish samples involved were co-exposed in the same batch. Similar challenges of data complexity impacting classifier performance were also observed in cancer classification data [
15]. Thus, a better characterization of microarray data complexity probably requires an exploration of a full suite of statistical measures and depends highly on chemical-tissue conditions.
The discrepancy between GSEA validation results and the corresponding heat maps is difficult to interpret. The heat maps of prochloraz classifiers achieved a perfect separation of treated and control samples from the training data as expected. Given the partial success of these same classifiers on the validation data by GSEA, it was expected that this would also be reflected in the heat maps. While model overfitting cannot be completely ruled out here, it is likely that such a strict bifurcation of samples in heat maps via two-way clustering of both gene classifiers and samples is simply a more stringent criterion than FDR 0.25, a commonly accepted threshold for GSEA. Also notable is that, while GA-SVM/GA-KNN generated classifiers from the entire transcriptome for most of the conditions or from most individual TF networks for a given condition, a majority of these classifiers were later disqualified during visual examination of their heat maps under a strict requirement of bifurcation of treated and control samples. Some of these disqualified classifiers could be false negatives.
While the chemicals included in this study all target the HPG axis and preliminary classifiers were indeed found from this compiled group of genes for all 19 conditions, none of these classifiers was later qualified by their heat maps. A similar pattern was also observed in master regulators, a group of hub TFs anchoring gene regulatory networks previously found to be associated with various chemical-tissue conditions for this same set of EDCs [
18]. In contrast, a previous study of master regulators suggested that they tended to be stable and reliable classifiers [
9]. In addition, almost half of the 515 networks generated a qualified classifier for at least one condition. There are several possible explanations for these observations. First, strict separation of treated and control samples in a heat map might be a criterion too stringent for a qualifying classifier. Second, the target genes in the HPG axis operate in a variety of tissue types [
38] and many of them respond to multiple chemicals. Within a given tissue, these genes may not be involved in endocrine regulatory functions. And lastly, with regard to the master regulators, hub TFs may just be classifiers more suitable for problems with better-defined biological endpoints, as in the case of cancer morphology, than for chemicals with diverse treatment effects [
15].
With binary classifiers found for six chemical-tissue conditions by the transcriptome-wide searches, and 15 conditions by the network-specific searches, each containing many gene features, their distribution statistics across chemicals, tissues, and TF networks could offer some biologically important insights into the MOAs of EDCs. Assuming a positive connection between biological impact and the number of such classifiers and gene features involved, the 10 EDCs under study had the greatest effects on brain tissue, then ovary and testis. Chemical-wise, prochloraz would rank as the most active chemical (in terms of networks affected), followed by flutamide, fipronil, 17α-ethynyl estradiol, vinclozolin, ketoconazole, muscimol, fadrozole, 17β-trenbolone, and trilostane. Another interesting perspective came from the top TF networks which were shown to generate classifiers for a number of chemical-tissue conditions. Some of these networks, as denoted by their hub TFs, included XBP1, HIF1AB, YBX1, SMAD2, and YY1B. The cellular functions behind these TFs include stress response (XBP1) [
39], regulation of oxygen homeostasis (HIF1AB) [
40], regulation of transcription and/or translation (YBX1, YY1B) [
41,
42], and signal transduction (SMAD2) [
43], suggesting far-reaching biological effects of the test chemicals on zebrafish beyond the HPG axis.
With a current knowledge of molecular pathways associated with different types of chemical stressors often lacking, an attempt was made in this study to develop more mechanistically-based gene classifiers by conducting TF network-specific searches. The idea behind this approach was to first construct transcriptome-wide TF networks based on the GEPs from fish samples treated with multiple EDCs, and then to identify a subset of these networks significantly impacted by individual endocrine-active chemicals. Classifiers later identified specific to these individual networks would in effect be tied to EDCs through common TF networks, which would add a mechanistic perspective to these classifiers. In comparison, a classifier from the transcriptome-wide search was more likely to be a biologically random assemblage of gene features. Based on hundreds of individual TF networks previously constructed and linked to various EDCs by GSEA [
18], the network-specific search generated a substantial number of classifiers for multiple conditions in this study despite greatly reduced search space. This approach also yielded classifiers for more chemical-tissue conditions than the transcriptome-wide search. Upon confirmation, these classifiers could provide an explicit linkage among a chemical stressor, its underlying regulatory TF networks, and apical (whole animal) responses.
A common pattern emerging from both training and validation datasets is that multiple classifiers sharing no common gene features could predict the same target GEP. Assuming a subset of these classifiers is valid, such an overlapping prediction reflects complex interactions between the test chemicals and biological pathways within the tissues that comprise the HPG axis [
17,
18], some of which are expected according to the current biological understanding of these chemicals. For example, among the 10 EDCs, both vinclozolin and flutamide are androgen receptor antagonists in multiple androgen/estrogen-responsive tissues while ketoconazole, fadrozole, and prochloraz are overlapping inhibitors of cytochrome P450 11A, 17, and 19 respectively in the gonad. In both the training and validation data, the classifiers for these conditions often simultaneously diagnosed the same GEP, as is the case between prochloraz and ketoconazole. Perhaps more significant is when this relationship of multiple classifiers all predicting a single GEP occurred with seemingly unrelated EDCs, such as prochloraz and fipronil, the latter a GABA (gamma-amino butyric acid) receptor antagonist. Such a many-to-many relationship among classifiers and GEPs from different conditions would be informative with regard to the underlying pathways affected by these chemicals and their possible grouping into a smaller number of classes, thus reducing the need to develop classifiers for individual chemical-conditions.
GSEA proved to be a flexible tool in ecotoxicological application of gene classifiers without relying on their original training data. Future improvement could be made by adopting Cmap, a conceptually similar but more refined approach developed for human biomedical studies [
8]. Cmap connects small molecule drugs to one another or a small molecule drug to a human disease state by mapping a group of gene signatures to GEPs previously assembled as a large reference collection of ranked gene lists. Each GEP is generated from samples treated with a chemical of interest. With its demonstrated flexibility and effectiveness [
44], the Cmap algorithm could be an excellent analytical tool for ecotoxicological applications. The main barrier to its application is the need to generate and assemble a large collection of GEPs linked to a wide variety of chemical compounds of environmental significance. Conceivably, Cmap could be used in conjunction with preexisting gene classifiers to make predictions on field samples with unknown exposures, validate candidate gene classifiers using pre-established reference GEPs, categorize environmental stressors into a more manageable number of chemical classes based on pathways affected, and link chemical stressors to their apical biological endpoints.