We have analyzed the gene expression profile of 9 cell lines cultured in a normoxic or hypoxic environment in order to identify the hypoxia signature. We demonstrated that, differently from unsupervised approaches, l1
regularization with double optimization identified a cluster of 11 stable probesets separating hypoxic from normoxic cell lines even when hidden in the neuroblastoma cells transcriptome characterized by the high disturbance of genetic alterations. Biological signatures can be derived from cell lines based datasets using many different informatics approaches (for review see [19
]). This is the first report describing the use of the l1
regularization with double optimization protocol described in [38
] to distinguish datasets based on the biological status of the cells.
The first attempts to identify the hypoxia signature relied on three different unsupervised clustering analyses. These approaches detected major differences among the transcriptome of the cell lines driven by the characteristics of the cell lines themselves of which the cascade of events triggered by MYCN expression was a major component. However, the disturbance generated by these transcriptional patterns was such that the detection of more subtle changes induced by hypoxia was completely masked. This conclusion is supported by the results obtained with the supervised univariate analysis t-test which was able to identify a strong response associated to MYCN expression and, to a lesser extent to MYCN amplification, but not the hypoxia dependent response. The heterogeneity of neuroblastoma and neuroblastoma cell lines has been previously observed [43
The impossibility to obtain a hypoxia signature by unsupervised approaches prompted us to consider different algorithms based on a robust supervised variable selection technique, capable of detecting the hypoxia-induced transcriptome even in the presence of the disturbance of a strong competing signal. The l1
regularization allowed us to build a powerful discriminative rule and to define a signature of probesets also taking into account the presence of variables (probesets) correlated (collinear) with each other. The use of cross validation allows the selection protocol to generate an unbiased and objective output [21
] beyond the theoretical results that guarantee the robustness of the core algorithm [36
]. The strong discriminative power is proven by the 17% classification error that is a very low value when dealing with 18 samples and nearly 50 thousands variables.
We adopted a validation framework based on a double loop of leave-one-out cross-validation in order to extract unbiased estimates of the classification error. The outer loop produces 18 lists of relevant variables, from which we extract a common list by setting a frequency threshold. It can be appreciated from visual inspection of the frequency distribution that, for values lower than 30%, a large number of probesets is included, which are extremely unstable. For frequency above 70%, the number of selected probesets slowly decreases, and a plateau is present between 30 and 70%. Therefore, we set our frequency threshold to 50% that is the intermediate value of such a frequency plateau. The correlation parameter ε can be potentially tuned between 0 and +∞ in order to extract lists of probesets with different correlation degree. However, values of ε equal to or smaller than 1 provide the same minimal list which comprises 9 probesets. This list is minimal in that it does not include correlated probesets, and it can be viewed as the smallest set of variables needed to predict the hypoxic status without any prior information. Conversely, by increasing the correlation parameter ε we are able to expand the list to 11 probesets, which is obtained for ε ≥ 100. Since we are interested in all genes involved in the hypoxic condition, we define such a correlation-aware list as our hypoxia signature in neuroblastoma cell lines. This signature identifies also hypoxic status in astrocyte cell lines. We estimated a similar low cross validation error demonstrating that its application goes beyond the neuroblastoma lineage and providing an additional proof of its discriminatory power on out-of-set data. However, the neuroblastoma hypoxia signature is less efficient in discriminating hypoxic dendritic cells indicating the existence of a limited spectrum of hypoxic cell types that can be identified by this signature. Dendritic cells are terminally differentiated mononuclear phagocytes, biologically very far from the other cell types, deriving from hematopoietic precursors, short lived and programmed to serve as immunomodulatory cells [44
]. This conclusion supports the concept of the heterogeneity of the response to hypoxia in different cell types [13
]. The enrichment of the 11 probeset signature in the hypoxic phenotype of neuroblastoma cells and astrocytes provides biological validation of our approach by establishing a clear link between our signature and the hypoxic microenvironment.
The 11 probesets represents 8 genes all of which are known from the literature to be modulated by hypoxia in different cell types and to be part of key biological processes associated with the response to hypoxia, indicating, once more, the biological roots of our signature. ALDOC and PDK1 belong to the glycolytic pathway and are known to be up regulated by hypoxia in neuroblastoma [45
]. Potentiating of the oxygen-independent glycolytic pathway to comply with the energy demand, is one of the major cellular response to hypoxia [46
]. The energetic balance in cell metabolism has to be controlled by different mechanisms. For example, AK3L1 is known to be modulated by hypoxia and catalyzes the interconversion of adenine nucleotides playing an important role in cellular energy homeostasis in mitochondria [47
]. DDIT4, induced by hypoxic stimulus, is an inhibitor of the mTOR signalling pathway, that results in inhibition of protein synthesis which, in turn, may affect the cellular tolerance to hypoxia by promoting energy homeostasis [48
]. VEGF, a direct target of HIF-1, is secreted by a large variety of different hypoxic cells and promotes angiogenesis thereby favouring tumor growth and metastasis [4
]. BNIP3 and E2IG5 are two genes promoting hypoxia-induced apoptosis observed mainly at very low oxygen concentrations [49
]. E2IG5 is localized to mitochondria and facilitates apoptotic cell death via permeability transition, cytochrome c release, and caspase 9 activation [50
]. Hypoxia is also known to increase histone H3 methylation through histone methyltransferase G9a [51
]. WDR5B encodes for a protein that is the core component of histone methylation complexes, which are essential for histone H3 methylation. Thus, hypoxia might regulate chromatin organization and gene transcription by modulating WDR5B. Finally, the GenBank entry W57613
is part of the signature it is associated with a transcribed hypothetical protein FLJ11267 and was not previously known to be induced by hypoxia.
The novelty of our work is to introduce a rigorous and robust feature selection technique that can be exported to other experimental models and that is able to identify discriminative genes even in an adverse setting where the cell lines express great heterogeneity. The hypoxia signatures present in the literature show different sizes and composition [4
]. The MSigDB [55
] represents a valuable source of gene sets associated to the response to hypoxia. A first attempt to discuss our 11 selected probesets within the contest of 9 hypoxia signatures contained in the MSigDB is based on the analysis of the overlap among signatures (Table ). One limitation of this comparison is that it must be based on gene names rather than probesets, because of the heterogeneity of the platforms. All the genes of our hypoxia signature, but one, are represented at least once in the 9 signatures. DDIT4 is the only hypoxia inducible gene that is included only in our signature. This comparison lends further support to the conclusion that the l1
algorithm selected biologically relevant genes that have been included in other hypoxia signatures. There are at least three major reasons for the variability among the hypoxia signatures. The first is the diversity of the cell types as shown by Chi et al
] and Mense et al
] and supported by the observations on the heterogeneity among neuroblastoma cell lines by Fredlund et al
] and ourselves in this paper. Each cell responds to hypoxia on the bases of its own genetic make up, epigenetic constrains and differentiation stage. In fact, we show that our signature does not apply to dendritic cells, that are biologically very different from astrocytes and neuroblastoma. The second reason is the difference in the experimental setting and gene expression platforms. The need to collapse the microarray probes to gene names for comparisons is a direct consequence of this problem. The third, and more important issue, is the criterion used for assembling the signature. The majority of the signatures described so far, are based upon the representation of hypoxia associated biochemical pathways or the inclusion of differentially expressed genes rather than the essentiality and the discriminating power that we have chosen. Having defined the hypoxia signature as the minimal number of probesets capable of distinguishing normoxic and hypoxic gene expression profiles, our list is relatively short, not specific for a biochemical pathway, not relying on prior biological knowledge, but endowed with high discriminating power.
Hypoxia gene signatures overlapping
The classification performance of our signature is evident in representations that indicate as first approximation the up-regulation of the signature in hypoxic condition. However, the multidimensional visualization is needed to fully appreciate the strong discriminative power of our signature because it takes into account its multivariate nature. In fact, when projecting over the individual probesets of the signature, the two classes are only approximately separated, since they appear either partially overlapping or very close. Indeed, since the l1-l2 regularization is a multivariate method, there is no need to expect a single probeset to have perfect discriminatory power on the classes, but one has to take into account the 11-dimensional model. While the normoxic cell lines are highly grouped and close to low expression values, the hypoxic lines are well spread over the multidimensional space, though well separated by the normoxic ones. Again, this behavior can be detected only by means of a multivariate analysis, since the analysis of individually regulated genes allows detecting only those probesets which multidimensional representation would see the hypoxic cell lines very well lumped together.
The advances in genome biology provide a growing and impressive amount of data. The challenge is to unmask specific, biologically relevant gene clusters that may be hidden by the disturbance of changes in an overwhelming number of unrelated genes. Our study demonstrated that the l1-l2 regularization framework is able to discriminate between two statuses of a cell that, albeit biologically very different, does not elicit a modulation of gene expression comparable in magnitude to that induced, for example, by genetic alterations. This scenario mimics the situation occurring in the tumor mass in which the signal will be perceived by cell differing in their genetic makeup, differentiation and progression in the cell cycle. The strategy described here can be readily applied to the detection of the response to other environmental signals such as small metabolites or pH changes to allow the creation of a database of tissue environment related variables that will ultimately be a great asset in unraveling the biology of the tumor and the possibly the description of better prognostic signatures.