Our results with the sporulation data confirm that PCA can find a reduced set of variables that are useful for understanding the experiments. The application of PCA to time series is somewhat controversial because of the problems with uneven time intervals and the dependencies between data points. In this case, PCA identifies basic temporal patterns, such as magnitude, change, and the concavity of overall expression as the important features that characterize genes. Application of PCA (unpublished result) to the publicly available cell division cycle data2
also reveals that PCA can also identify periodic patterns in time series data (Spellman et al. 1998
). For example, the cell cycle data reveals a 110 min period for the cdc15 synchronized experiment, consistent with the cell cycle duration.
Reduction of dimensionality in the sporulation data aids in data visualization; we can immediately see the unimodal quality of the sporulation data (). The unimodal distribution of expression in the most informative two dimensions suggests the genes do not fall into well-defined clusters.
In the initial presentation of the data the investigators used clustering techniques to identify several gene classes relevant to sporulation: “metabolic”, “early I”, “early II”, “middle early”, “middle”, “middle late”, and “late” (Chu et al. 1998
). For each class a canonical expression profile was calculated from a set of sample genes. These classes are plotted in ; each ellipse in the plot represents a class. The location and dimensions of each ellipse was calculated from the average and standard deviation of the sample genes of the class. They are drawn so that approximately 68% (+/− 1SD in both dimensions) of the genes in the class are enclosed; in they are drawn to enclose 95% (+/− 1.96SD) of the genes in the class.
Figure 4 A. All genes plotted with respect to first and second principal components. Ellipses represent clusters identified in the original publication of the sporulation data. Ellipses are drawn to include 68% of the genes in the cluster. B. Ellipses are labelled (more ...)
An approximate understanding of a class’s expression dynamic can be obtained quickly by looking at its location in space. For example, genes occupying the lower right quadrant (high PCA1, low PCA2) are up-regulated early but return to background later in sporulation. These genes have expression levels that decrease over time but maintain a high overall expression level relative to the control. Examples of these genes are ZIP1 (synaptonemal complex formation), IME2 (meiosis regulator), and HOP1 (homologous chromosome pairing), classified as “early I” or ”metabolic” genes.
Exploring other quadrants can rapidly identify genes of potential interest. Genes with low overall expression levels that decrease over the course of sporulation can be found in the lower left quadrant. Many genes involved in metabolic or catabolic processes such as ERG6 (ergosterol synthesis), FBP1 (gluconeogenesis), and SAM2 (methionine biosynthesis) are found in this quadrant. Genes in the upper left are initially repressed and return to normal. Many of these genes are involved in protein synthesis. Examples include ISF1 (RNA splicing), BAP3 (valine transporter), and DBP3 (RNA helicase). The early repression may correspond with the cells’ initial cessation of protein synthesis and growth; the renewed expression may function to pack the maturing spores with translation machinery (Chu et al. 1998
Principal components analysis is often used as a preprocessing step to clustering (Everitt 1993
). However, our work suggests that clustering genes with certain expression data sets may not be appropriate. In , the genes are not located in clusters - rather they are spread throughout this space. Focusing on the upper right quadrant in , it can be seen that the clusters presented in the original publication have a considerable amount of overlap. For unimodal or other smoothly varying distributions, distinctions drawn by clustering methodologies maybe more confusing than helpful. In particular, these clusters highlight the potential biases used in analyzing clusters using traditional cognitive categories. This observation corroborates the original investigators’ finding that the clusters are somewhat arbitrary; many genes were found to have high correlation with multiple cluster representatives (Chu et al. 1998
). Perhaps it is more useful to determine the closest neighbors of a gene, rather than to seek well defined clusters.
When we choose the largest principal components, we lose information about experiments that explains the remaining variance in the data (5% of the sporulation variability is not explained by the first three components). However, our analysis identifies the variables that should be used for overall classification of genes, and thereby allows investigators to focus on the other, more subtle, variables whose values may be helpful in understanding the differences in gene expression under different conditions.