PMCCPMCCPMCC

Search tips
Search criteria 

Advanced

 
Logo of nihpaAbout Author manuscriptsSubmit a manuscriptHHS Public Access; Author Manuscript; Accepted for publication in peer reviewed journal;
 
Ann Appl Stat. Author manuscript; available in PMC 2010 October 8.
Published in final edited form as:
Ann Appl Stat. 2010 March 1; 4(1): 396–421.
doi:  10.1214/09-AOAS279
PMCID: PMC2951685
NIHMSID: NIHMS239265

Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications*

Abstract

Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.

Keywords: Food authenticity studies, headlong search, model-based discriminant analysis, normal mixture models, semi-supervised learning, updating classification rules, variable selection

1 Introduction

Foods that are expensive are subject to potential fraud where rogue suppliers may attempt to provide a cheaper inauthentic alternative in place of the more expensive authentic food. Food authenticity studies are concerned with assessing the veracity of the labeling of food samples. Discriminant analysis methods are of prime importance in food authenticity studies where samples whose authenticity is being assessed are classified using a discriminant analysis method and the labeling and classification are compared. Samples determined to have potentially inaccurate labeling can be sent for further testing to determine if fraudulent labeling has been used.

Model-based discriminant analysis (Bensmail and Celeux 1996; Fraley and Raftery 2002) provides a framework for discriminant analysis based on parsimonious normal mixture models. This approach to discriminant analysis has been shown to be effective in practice and being based on a statistical model it allows for uncertainty to be treated appropriately.

In many applications, only a subset of the variables in a discriminant analysis contain any group membership information and including variables which have no group information increases the complexity of the analysis, potentially degrading the classification performance. Therefore, there is a need for including variable selection as part of any discriminant analysis procedure. Additionally, if a subset of variables is found to be important for classification purposes, then it suggests the potential for collecting a smaller subset of variables using inexpensive methods rather than the full high dimensional data.

Variable selection can be completed as a preprocessing step prior to discriminant analysis (a filtering approach) or as part of the analysis procedure (a wrapper approach). Completing variable selection prior to the discriminant analysis can lead to variables that have weak individual classification performance being omitted from the subsequent analysis. However, such variables could be important for classification purposes when jointly considered with others. Hence, performing variable selection as part of the discriminant analysis procedure is preferred.

Combining variable selection and linear or quadratic discriminant analysis has been considered previously in the literature; see McLachlan (1992, Chapter 12) for a review. Many of these methods are based on measuring the Mahalanobis distance between groups before and after the inclusion of a variable into the discriminant analysis model. In the machine learning literature, Kohavi and John (1997) developed a wrapper approach for combining variable selection in supervised learning, of which discriminant analysis is a special case.

Variable selection is of particular importance in situations where there are more variables than observations available; that is, large p, small n (np) problems (West 2003). These situations arise with increasing frequency in statistical applications, including genetics, proteomics, image processing and food science. The two food science applications considered in Section 2 involve data sets with many more variables than observations.

In this paper, a version of model-based discriminant analysis is developed by adapting the model-based clustering with variable selection method of Raftery and Dean (2006). This method of discriminant analysis builds a discriminant rule in a stepwise manner by considering the inclusion of extra variables into the model and also considering removing existing variables from the model based on their importance. The stepwise selection procedure is iterated until convergence.

A brief review of model-based clustering and discriminant analysis is given in Section 3. The underlying model for model-based clustering with variable selection is reviewed in Section 3.1 and this model is extended to model-based discriminant analysis with variable selection in Section 3.2. In Section 3.3, the fitting of the discriminant analysis model is extended to incorporate semi-supervised updating using both the labeled and unlabeled observations (Dean et al. 2006) in order to improve the classification performance.

Search strategies for selecting the variables for inclusion and exclusion are discussed in Section 3.4. A headlong search strategy is proposed that combines good classification performance and computational efficiency. The proposed methodology is applied to the high dimensional datasets in Section 4 and the methodology and results are discussed in Section 5.

2 Data

2.1 Food Authenticity & Near Infrared Spectroscopy

An authentic food is one that is what it claims to be. Important aspects of food description include its process history, geographic origin, species/variety and purity. Food producers, regulators, retailers and consumers need to be assured of the authenticity of food products.

Food authenticity studies are concerned with establishing whether foods are authentic or not. Many analytical chemistry techniques are used in food authenticity studies, including gas chromatography, mass spectroscopy, and vibrational spectroscopic techniques (Raman, ultraviolet, mid-infrared, near-infrared and visible). All of these techniques have been shown to be capable of discriminating between certain sets of similar biological materials. Downey (1996) and Reid et al. (2006) provide reviews of food authenticity studies with an emphasis on spectroscopic methods. Near infrared (NIR) spectroscopy provides a quick and efficient method of collecting data for use in food authenticity studies (Downey 1996). It is particularly useful because it requires very little sample preparation and is non-destructive to the samples being tested.

We consider two food authenticity data sets which consist of combined visible and near-infrared spectroscopic measurements from food samples of different types. The aim of the food authenticity study is to classify the food samples into known groups. The two studies are outlined in detail in Sections 2.2 and 2.3:

  • Classifying meats into species (Beef, Chicken, Lamb, Pork, Turkey)
  • Classifying olive oils into geographic origin (Crete, Peloponese, Other).

In both studies, combined visible and near infrared spectra were collected in reflectance mode using an NIRSystems 6500 instrument over the wavelength range 400–2498 nm at 2 nm intervals. The visible portion of the spectrum is the range 400–800 nm and the near-infrared region is the range 800–2498 nm. Hence, the values collected for each food sample consist of 1050 reflectance values taken at 2 nm intervals (see, for example, Figure 1). For the meat samples, twenty five separate scans were collected during a single passage of the spectrophotometer and averaged, after which the mean spectrum of a reference ceramic tile (16 scans) was recorded and subtracted from the mean spectrum. A similar process was used for the olive oil data, but fewer scans were used. Full details of the spectral data collection process are given in McElhinney et al. (1999) and Downey et al. (2003).

Figure 1
The near-infrared spectra recorded for three examples of each meat species in the study. The discontinuity at 1100 nm is due to a sensor change at that value. The samples are colored as Beef=red, Lamb=green, Pork=blue, Turkey=orange, Chicken=yellow.

The reflectance values in the visible and near-infrared region are produced by vibrations in the chemical bonds in the substance being analyzed. The data are highly correlated due to the presence of a large number of overlapping broad peaks in this region of the electromagnetic spectrum and the presence of combinations and overtones. As a result, even though the data are very highly correlated, the reflectance values at adjacent wavelengths can have different sources and reflectance values at very different wavelengths can have the same source. So, the information encoded in each spectrum is recorded in a complex manner and spread over a range of locations. Osborne et al. (1993) provide an extensive review of the chemical and technological aspects of near-infrared spectroscopy and its application. Further information on the combined spectra and their structure is given in Section 4 where the results of the analysis of the data are given.

Because of the complex nature of the combined spectroscopic data, there is interest in determining if a small subset of reflectance values contain as much information for authentication purposes as the whole spectrum does. If a small number of variables contain sufficient information for authentication purposes, then this indicates the possibility of developing portable sensors for food authenticity studies that are more rapid and have a lower cost than recording the combined visible and near-infrared spectrum. In fact, portable sensors have been developed on a commercial basis for the authentication of Scotish whiskys (Connolly 2006) using ultraviolet spectroscopic technology. Hence, there are motivations for incorporating feature selection in the classification methods used on these data from the application and the modeling viewpoints.

The problem of feature selection is especially difficult because the number of possible subsets of wavelengths that could be selected in this problem is 21050. So, efficient search strategies need to be used so that a good set of features can be selected without searching over all possible subsets.

2.2 Homogenized Meat Data

McElhinney et al. (1999) constructed a collection of combined visible and near-infrared spectra from 231 homogenized meat samples in order to assess the effectiveness of visible and near-infrared spectroscopy as a tool for determining the correct species of the samples. The samples collected for this study consist of 55 Chicken, 55 Turkey, 55 Pork, 32 Beef and 34 Lamb samples. The samples were collected over an extended period of time and from a number of sources in order to ensure a representative sample of meats.

For each sample, a spectrum consisting of 1050 reflectance measurements was recorded (as outlined in Section 2.1). A plot of all of the spectra is shown in Figure 1. We can see that there is a discrimination between the red meats (beef and lamb) and the white meats (chicken, turkey and pork) over some of the visible region (400–800 nm), but discrimination within meat colors is less clear.

2.3 Greek Olive Oils Data

Downey et al. (2003) recorded near-infrared spectra from a total of 65 extra virgin olive oil samples that were collected from three different regions in Greece (18 Crete, 28 Peloponese, 19 Other). Each data value consists of 1050 reflectance values over the visible and near-infrared range. The aim of their study was to assess the effectiveness of near-infrared spectroscopy in determining the geographical origin (see Figure 2) of the oils.

Figure 2
Regions of Greece where the olive oil samples were collected.

3 Model-based Clustering and Discriminant Analysis

Model-based clustering (Banfield and Raftery 1993; Fraley and Raftery 1998, 2002; McLachlan and Peel 2000) uses mixture models as a framework for cluster analysis. The underlying model in model-based clustering is a normal mixture model with G components, that is,

f(x)=g=1Gτgf(xμg,Σg),

where f(·|μg, Σg) is a multivariate normal density with mean μg and covariance Σg.

A central idea in model-based clustering is the use of constraints on the group covariance matrices Σg; these constraints use the eigenvalue decomposition of the covariance matrices to impose shape restrictions on the groups. The decomposition is of the form, Σg=λgDgAgDgT, where λg is the largest eigenvalue, Dg is an orthonormal matrix of eigenvectors, and Ag is a diagonal matrix of scaled eigenvalues. Interpretations for the parameters in the covariance decomposition are: λg= Volume; Ag= Shape; Dg= Orientation. These parameters can be constrained in various ways to be equal or variable across groups. Additionally, the shape and orientation matrices can be set equal to the identity matrix.

Bensmail and Celeux (1996) developed model-based discriminant analysis methods using the same covariance decomposition. An extension of model-based discriminant analysis that allows for updating of the classification rule using the unlabeled data was developed by Dean et al. (2006) and will be described in more detail in Section 3.3. Model-based clustering and discriminant analysis can be implemented in the statistics package R (R Development Core Team 2007) using the mclust package (Fraley and Raftery 1999, 2003, 2007).

3.1 Model-based Clustering with Variable selection

We argue that variable selection needs to be part of the discriminant analysis procedure, because completing variable selection prior to discriminant analysis may lose important grouping information. This argument is supported by the result of Chang (1983), who showed that the principal components corresponding to the larger eigenvalues do not necessarily contain information about group structure. This suggests that the commonly used filter approach of selecting the first few principal components to explain a minimum percentage of variation can be suboptimal. A similar argument can be made that selecting discriminating variables without reference to the grouping variable may miss important variables. In addition, some variables may contain strong group information when used in combination with other variables, but not on their own. Another critique of completing a variable (or feature) selection step before supervised learning (filtering) is given by Kohavi and John (1997, Section 2.4).

Raftery and Dean (2006) developed a stepwise variable selection wrapper for model-based clustering. With their method, variables are selected in a stepwise manner. Their method involves the stages:

  • A variable is proposed for addition to the set of selected clustering variables. The Bayesian Information Criterion (BIC) is used to compare a model in which the variable contains extra information about the clustering beyond the information in the already selected variables versus a model where the variable doesn't contain additional information about the clustering beyond the information in the already selected variables. The variable with the greatest positive BIC difference is added to the model. If proposed variable has a positive BIC difference, then no variable is added.
  • BIC is used to consider whether a variable should be removed from the model; This step is the reverse of the variable addition step. If all of the selected variables contain clustering information, then none is removed from the set of selected clustering variables.

This process is iterated until no further variables are added or removed. This approach, that combines variable selection and cluster analysis, avoids the problems of completing variable selection independently of the clustering. While the stepwise variable selection wrapper proposed in Raftery and Dean (2006) and other wrapper approaches can give excellent clustering results, there is a considerable computational burden with wrapper approaches when compared to filtering approaches; this is because the model needs to be fitted each time a variable is added or removed from the set of selected clustering variables.

3.2 Model-based Discriminant Analysis With Variable Selection

We adapt the ideas of Raftery and Dean (2006) to produce a discriminant analysis technique that includes a stepwise variable selection wrapper. This discriminant analysis method uses a stepwise variable selection procedure to find a subset of variables that gives good classification results.

Each stage of the algorithm involves two steps:

  • Determine if a variable (not already selected) would contribute to the discriminant analysis model. In order to do this, a model comparison using BIC is used to compare a discriminant analysis model where the variable contains group information beyond the information in the already selected variables versus a model where the variable does not contain group information beyond the information in the already selected variables. Variables where the BIC different is positive are candidates for addition to the set of selected variables; the procedure for searching for variables to add to the model is given in Section 3.4
  • Determine if any selected variables should be removed from the discriminant analysis model. This step is the reverse of the variable addition step. Variables where the BIC model comparison suggests that the variable does not contain group information are candidates for removing from the set of selected variables; the procedure for searching for variables to remove from the model is outlined in Section 3.4.

Let (x1, x2,...,xn) be the observed data values and let (l1, l2,...,ln) be the group indicator variables for these observations where lig = 1 if observation i belongs to group g and lig = 0 otherwise.

Suppose that the observation xi is partitioned into three parts: xi(c) are the variables already chosen; xi(p) is the variable being proposed; xi(o) are the remaining variables. The decision on whether to include or exclude a proposed variable is based on the comparison of two models:

  • Grouping: p(xili)=p(xi(c),xi(p),xi(o)li)=p(xi(o)xi(p),xi(c))p(xi(p),xi(c)li).
  • No Grouping: p(xili)=p(xi(c),xi(p),xi(o)li)=p(xi(o)xi(p),xi(c))p(xi(p)xi(c))p(xi(c)li).

Figure 3 shows the difference between the “Grouping” and “No Grouping” models for xi. If the Grouping model holds, xi(p) provides information about which group the data value belongs to beyond that provided by xi(c), while if the No Grouping model holds, xi(p) provides no extra information.

Figure 3
A graphical model representation of the Grouping and the No Grouping models.

The Grouping and No Grouping models are specified as follows:

  • Grouping: We let p(xi(p),xi(c)li) be a normal density with parsimonious covariance structure as described in Table 1. That is,
    (xi(p),xi(c))(lig=1)~N(μg(p,c),Σg(p,c)),li~Multinomial(1,τ),
    where τ = (τ1, τ2,..., τG).
    Table 1
    Constrained covariance structures in model-based clustering as implemented in the mclust package for R.
  • No Grouping: We let p(xi(c)li) be a normal density with parsimonious covariance structure. In addition, p(xi(p)xi(c)) is assumed to have a linear regression model structure. That is,
    xi(c)(lig=1)~N(μg(c),Σg(c)),li~Multinomial(1,τ),xi(p)xi(c)~N(α+βTxi(c),σ2),
    where τ = (τ1, τ2,..., τG).

The same model structure is assumed for p(xi(o)xi(c),xi(p)) in the Grouping model as in the No Grouping model. Therefore, this part of the model does not influence the choice to include xi(p) in the model or not.

The decision as to whether the Grouping or No Grouping model is appropriate is made using the BIC approximation of the log Bayes factor. The logarithm of the Bayes factor is

log(Bayes Factor)=logp(xiMG)p(xiMNG),
(1)

where MG is the Grouping model, MNG is the No Grouping model and

p(xiMk)=p(xiθk,Mk)p(θkMk)dθk

is the integrated likelihood of model Mk. We use the BIC approximation of the integrated likelihood in the form

BIC=log maximized likelihoodd2log(n),

where d is the number of parameters in the model and n is the sample size (Schwarz 1978). Following Raftery and Dean (2006), the log Bayes factor (1) can be reduced to

log(Bayes Factor)=logp(xi(p),xi(c)MG)p(xi(p)xi(c),MNG)p(xi(c)MNG)BIC(Grouping)-BIC(No Grouping),
(2)

which only involves (xi(c),xi(p)) and not xi(o). Variables with a positive difference in BIC(Grouping)-BIC(No Grouping) are candidates for being added to the model.

At each variable addition stage, the BIC of the grouping model is calculated using each of the ten covariance structures given in Table 1 and the model with the highest BIC is selected for the Grouping model for model comparison purposes.

At each stage, we also check if an already chosen variable should be removed from the model. This decision is made on the basis of the BIC difference in a similar way to previously. In this case, xi(p) takes the role of the variable to be dropped, xi(c) takes the role of the remaining chosen variables and xi(o) are the other variables. The variables with a positive difference in BIC(No Grouping)-BIC(Grouping) are candidates for removal from the model; in this case, the BIC for the no grouping models are computed for all covariance structures from Table 1 and the model with the highest BIC is selected as the No Grouping model.

3.3 Discriminant Analysis with Updating

In standard discriminant analysis, the unlabeled data are not used in the model fitting procedure. However, these data contain information that is potentially important, especially when very few labeled data values are available. We can model both the labeled and unlabeled data as coming from the same model, but where the unlabeled data is missing the labeling variable; this leads to a mixture model for the unlabeled data. Hence, the unlabeled data can then be used to help fit a model to the data. This idea has been investigated by many authors including Ganesalingam and McLachlan (1978) and O'Neill (1978) and more recently by Dean et al. (2006), Chapelle et al. (2006), Toher et al. (2007) and Liang et al. (2007).

Let (x1, l1), (x2, l2),...,(xN, lN) be the labeled data and y1, y2,...,yM be the unlabeled data. We let z = (z1, z2,...,zM) be the unobserved (missing) labels for the unlabeled data. In this framework, the Grouping and No Grouping models for the observed data are of the form:

  • Grouping: We let p(xi(p),xi(c)li) be a normal density with parsimonious covariance structure as described in Table 1, namely
    (xi(p),xi(c))(lig=1)~N(μg(p,c),Σg(p,c)),li~Multinomial(1,τ).
    Also, p(yj(p),yj(c)) is a mixture of normals with parsimonious covariance structures, namely
    (yj(p),yj(c))~g=1GτgN(μg(p,c),Σg(p,c))
  • No Grouping: We let p(xi(c)li) be a normal density with parsimonious covariance structure, namely
    xi(c)(lig=1)~N(μg(c),Σg(c)),li~Multinomial(1,τ).
    We also let p(yj(c)) be a mixture of normal densities with parsimonious covariance structure, namely
    yj(c)~g=1GτgN(μg(c),Σg(c)).
    In addition, we assume a linear regression model for p(xi(p)xi(c)) and p(yj(p)yj(c)), namely
    xi(p)xi(c)~N(α+βTxi(c),σ2)andyj(p)yj(c)~N(α+βTyj(c),σ2).

In both models, we assume an identical model structure for p(xi(o)xi(c),xi(p)) and p(yj(o)yj(c),yj(p)), and this doesn't affect the choice to include a variable in the model or not.

This model can be fitted using the EM algorithm (Dempster et al. 1977) by introducing the missing labels z into the model. The calculations involved in fitting the model including the labeled and unlabeled data follow those outlined in Dean et al. (2006). The maximum likelihood estimates for the regression part of the model correspond to least squares estimates of the regression parameters.

The final estimates of the posterior probability of group memberships produced by the EM algorithm are used to classify the unlabeled observations. Thus each observation j is classified into the group g that maximizes zjg over g, where

z^jg=τ^gp(yj(c)μ^g(c),Σ^g(c))g=1Gτ^gp(yj(c)μ^g(c),Σ^g(c)),

yj(c) is the set of chosen variables, and {(τ^g,μ^g(c),Σ^g(c)):g=1,2,,G} are the maximum likelihood estimates for the unknown model parameters for this set of chosen variables.

3.3.1 Example

An illustrative example of the BIC calculations when the proposed algorithm is applied to the meat spectroscopy data is shown in Figures 46; half the data of each type were randomly selected as training data in this example.

Figure 4
A plot of the BIC difference for each wavelength. The wavelength with the greatest BIC difference is 626 nm.
Figure 6
A plot of the BIC difference for each wavelength given that the first two wavelengths chosen (626 nm and 814 nm) are already accepted. The wavelength with the greatest BIC difference is 774 nm.

The variable selection algorithm begins by selecting 626 nm as the wavelength with the greatest difference between the Grouping and No Grouping models (Figure 4) and the E covariance structure was chosen. It is worth noting that wavelengths close to 626 nm still have strong evidence of grouping even though the spectra are smoothly varying. This phenomenon is due to the fact that the spectrum consists of a number of overlapping peaks and the reflectances at adjacent locations can have different sources. As a result extra grouping information can be available at wavelengths that are very close.

Subsequently, the 814 nm wavelength is added to the model (Figure 5) and the EEV covariance structure was chosen. At the third stage, the 774 nm wavelength is selected (Figure 6) and the VEV covariance structure was chosen. The procedure continues until thirteen wavelengths are selected (details of the iterations are given in Table 2) and the VEV covariance structure is chosen at all subsequent stages.

Figure 5
A plot of the BIC difference for each wavelength given that wavelength 626 nm is already accepted. The wavelength with the greatest BIC difference is 814 nm. Note that wavelengths close to 626 nm still have positive BIC difference values.
Table 2
A full example of the variable selection procedure used to classify the meat samples into five types. The updating procedure was used in this example.

Interestingly, many of the chosen wavelengths are in the visible range (400–800 nm) of the spectrum indicating that color is important when separating the meat samples. The closest two wavelengths that were selected were 2310 nm and 2316 nm and a number of wavelengths that were selected are approximately 20 nm apart. In summary, the selected wavelengths are spread out mainly in the visible region but some wavelengths were selected in the near-infrared region.

3.4 Headlong Model Search Strategy

The variable selection algorithm demonstrated in Section 3.3.1 is a greedy search strategy. At the variable addition stages of the algorithm, the variable with the greatest BIC difference is added and at variable removal stages the variable with the greatest BIC difference is removed. The process of finding the variable with the greatest BIC difference involves calculating the BIC difference for all variables under consideration; for the spectroscopic data there are typically just under 1050 variables under consideration at the variable addition stages. Hence, this search strategy is computationally demanding; this feature is shared by other wrapper variable selection methods too.

A less computationally expensive afilternative is to use a headlong search strategy (Badsberg 1992). The variable added or removed in the headlong search strategy need not be the best in terms of having the greatest BIC difference; it merely needs to be the first variable considered whose difference is greater than some pre-specified value (here min.evidence). We found that min.evidence = 0 gave good results for the applications in this paper. The headlong strategy has close connections to the “first-improvement” moves used in local search algorithms (eg, Hoos and Stützle 2005, Chapter 2.1). This means that instead of adding the variable with the greatest evidence for Grouping versus No Grouping, the first variable found to have a certain amount of evidence for Grouping versus No Grouping would be added. At the variable addition stages of the algorithm, the remaining variables are examined in turn from an ordered list. The initial order of the list is based on the variables’ original BIC differences at the univariate addition stage; this ordering was used in a similar context in Yeung et al. (2005). We experimented with the initial ordering and also tried using increasing wavelength and decreasing wavelength. The classification performance was not sensitive to the initial ordering but the selected variables did depend on the ordering. In the context of increasing and decreasing wavelength there was a bias towards selecting low and high wavelengths, respectively.

Here is a summary of the algorithm.

  1. Select the first variable that is added to be the one that has the most evidence for Grouping versus No Grouping in terms of greatest BIC difference (the same as the first step of the greedy search algorithm). Create a list of the remaining variables in decreasing order of BIC differences.
  2. Select the second variable that is added to be the first variable in the list of remaining variables with BIC difference for Grouping versus No Grouping, including the first variable selected, greater than min.evidence. Any variable checked and not used at this stage is placed at the end of the list of remaining variables.
  3. Select the next variable that is added to be the first variable in the list of remaining variables with BIC difference for Grouping versus No Grouping, including the previous variables selected, greater than min.evidence. If no variable has BIC difference greater than min.evidence then no variable is added at this stage. Any variable checked and not used at this stage is placed, in turn, at the end of the list of remaining variables.
  4. Check in turn each variable currently selected (in reverse order of inclusion) for evidence of No Grouping (versus Grouping), including the other selected variables, and remove the first variable with BIC difference greater than min.evidence. If no variable has BIC difference greater than min.evidence then no variable is removed at this stage. The removed variable is placed at the end of the list of other remaining variables.
  5. Iterate steps 3 and 4 until two consecutive steps have been rejected, then stop.

4 Results

The proposed methodology was applied to the two food authenticity data sets outlined in Section 2.1. In each case, the data were split so that 50% of the data were used as labeled data and 50% as unlabeled. The methodology was applied to 50 random splits of labeled and unlabeled data and the mean and standard deviation of the classification rate were computed.

The results were compared to previously reported performance results for these data and several widely used afilternative techniques: Random Forests (Breiman 2001), AdaBoost (Freund and Schapire 1997), Bayesian Multinomial Regression (Madigan et al. 2005), and Transductive Support Vector Machines (Vapnik 1995; Joachims 1999; Collobert et al. 2006).

We used the default settings in the R (R Development Core Team 2007) implementations of Random Forests (randomForest version 4.5-30) (Liaw and Wiener 2002) and AdaBoost (adabag version 1.1) (Cortés et al. 2007). The use of various parameter settings was explored but the results did not vary to a large extent with respect to the choice of parameter values. For Bayesian Multinomial Regression we used cross validation to choose between the choice of prior variance values {10p : p = −4, −3, −2, −1, 0, 1, 2, 3, 4} as suggested in Genkin et al. (2005). For the Transductive SVM analysis we used the UniverSVM software version 1.1 (Sinz and Roffilli 2007) with a linear kernel and parameters (c, s, z) = (100, −0.3, 0.1); other parameter values were considered but the values reported yielded the best classifications.

4.1 Meats Data

The results achieved on the homogenized meat data (Section 2.2) are reported in Table 3. These results show that the variable selection and updating method gives comparable or better performance than previous analyses of these data; an improved classification rate has been achieved relative to those achieved by McElhinney et al. (1999) who used factorial discriminant analysis (FDA), k-nearest neighbors (kNN), discriminant partial least squares regression (PLS) and soft independent modeling of class analogy (SIMCA). Furthermore, a comparable classification performance has been achieved relative to Dean et al. (2006) who used model-based discriminant analysis with updating on a reduced form of the data derived from wavelet thresholding. The variable selection and updating procedure gave substantially better performance than other competing methods for classification.

Table 3
Classification performance on the Meats data for the variable selection algorithm with updating and for previous analyses of these data. Mean classification performance for the 50 random splits of the data are reported with standard deviations in parentheses. ...

An examination of the misclassification table (Table 4) for the variable selection and updating method shows that many of the misclassifications were due to the difficulty in separating the chicken and turkey groups. Interestingly, no misclassifications were made between the red and white meats.

Table 4
Average classification results for the different meat types for the Variable Selection and Updating classification method.

The chosen wavelengths show us which parts of the spectrum are of importance when classifying samples into different species. We recorded the chosen wavelengths for each of the 50 sets of results and these are shown in Figure 7. We can see that a large proportion (51%) of the chosen wavelengths are in the visible region (400 nm–800 nm) but some regions in the near-infrared spectrum are also chosen. Liu and Chen (2000, Table 1) assign many of the spectral features in the visible part of the spectrum to different forms of myoglobin such as deoxymyoglobin (430, 440, 445 nm), oxymyoglobin (545, 560, 575, 585 nm), metmyoglobin (485, 495, 500, 505 nm) and sulfmyoglobin (635 nm). Sulfmyoglobin is a product of the reaction of myoglobin with H2S generated by bacteria, and Arnalds et al. (2004) found the region of the spectrum close to 635 nm to be important when separating the red and white meat samples. The peak at 1100 nm is the wavelength where the sensor changes in the near-infrared spectrometer and the peak at 1068 nm can be attributed to third overtones of C-H stretch mode and C-H combination bonds from meat constituents other than oxymyoglobin (Liu et al. 2000). The near infrared region consisting of wavelengths near 1510 nm has been attributed to protein, and a cluster of chosen wavelengths is close to this region. In all cases, between 13 and 19 wavelengths were chosen for classification purposes.

Figure 7
Wavelengths chosen in the five meat classification problem for the variable selection and updating method. The height of the bars shows how many times the wavelength was chosen in 50 random splits of the data.

Following McElhinney et al. (1999) and Dean et al. (2006), we combined the chicken and turkey groups into a poultry group to determine how well we can classify the homogenized meat samples into four types. The classification results are reported in Table 5 and the misclassifications from the variable selection method with updating are shown in Table 6. There is a significant improvement in classification performance from all of the methods. Again, the white and red meats are separated with zero error.

Table 5
Classification performance on the Meats data for the variable selection algorithm with updating and for previous analyses of these data after combining the chicken and turkey into a poultry group. Mean classification performance for the 50 random splits ...
Table 6
Average classification results for the different meat types after combining the chicken and turkey into a poultry group. The results shown are for the variable selection and updating method.

The wavelengths chosen for the four group classification problem (Figure 8) still have a substantial proportion chosen from the visible part of the spectrum (52%). In this application, between 13 and 21 wavelengths were chosen for classification purposes. The VEV covariance structure was chosen in almost every run as the final model for both the four and five group meat classification problems.

Figure 8
Wavelengths chosen in the four meat classification problem for the variable selection and updating method.

4.2 Greek Olive Oil Data

The methods were applied to the Greek olive oil data (Section 2.3) with 50% of the data being treated as training data and 50% as test data. Fifty random splits of training and test data were used. The misclassification rates achieved on these data are reported in Table 7. Variable selection and updating provides one of the best classification rates for these data. Downey et al. (2003) did report a better misclassification rate (6.1%) using factorial discriminant analysis (FDA) but the choice of a subset of wavelengths, data pre-processing method and classification method (from partial least squares, factorial discriminant analysis and k-nearest neighbors) was made with reference to the test data classification performance. In contrast, our model selection was done without any reference to the test data classification performance.

Table 7
Classification performance on the Olive Oil data for the variable selection algorithm with updating and for previous analyses of these data. Mean classification performance for the 50 random splits of the data are reported with standard deviations in ...

A cross tabulation of the classifications with the true origin of the olive oils (Table 8) reveals the difficulty in classifying the oils.

Table 8
Average classification results for the olive oil groups. The results shown are for the variable selection and updating method.

In contrast to the meat classification problem, the chosen wavelengths for this problem (Figure 9) are concentrated in the near-infrared region (800–2498 nm) but some wavelengths in the visible region are also selected. The most commonly chosen wavelength is 2080 nm which has been attributed to an O-H stretching/O-H bend combination (Osborne et al. 1984). Wavelengths near 2310, 2346 and 2386 nm are due to C-H stretching vibrations and other vibrational modes. In particular, wavelengths in the 2310 nm region have previously been assigned to fat content. In all cases, between 6 and 29 wavelengths were selected with a mean of 15 wavelengths being chosen. The EEE covariance structure was chosen for every final model for the olive oil classification problem.

Figure 9
Wavelengths chosen in the olive oil classification problem using variable selection and updating method. The height of the bars shows how many times the wavelength was chosen in 50 random splits of the data.

4.3 Sensitivity to Spectral Resolution

In order to determine the sensitivity of the selected wavelengths to the resolution of the spectrometer used in this study, we investigated the effect of reducing the number of reflectance values by computing the mean reflectance value over sets of adjacent wavelengths and using these as inputs into the variable selection model. The results of this analysis are outlined for the olive oil authentication problem, and similar results were found for the meat species authenticity study.

We found that the classification error of the olive oil samples increases slightly as soon as any adjacent wavelengths are aggregated (Table 9). However, once the wavelengths are aggregated, the classification error remained steady for aggregating between 2 and 30 adjacent wavelengths. Thereafter, there was a serious deterioration in the classification performance when more than 30 adjacent wavelengths were aggregated. This suggests that a considerable amount of the group information is maintained at even low resolutions, but that there is more information in the raw data themselves.

Table 9
The change in classification performance for the variable selection and updating method as the number of adjacent wavelengths being aggregated increases.

The spectral regions selected when analyzing the data in aggregated form were found to be stable. In both applications, the selected regions were very similar for the aggregated data, but fewer variables tended to be selected because of the aggregation process. Figure 10 shows the chosen wavelengths when the raw spectra, two adjacent wavelengths and three adjacent wavelengths are aggregated and then analyzed for the olive oil classification problem. This shows that the selection procedure chooses very specific spectral regions on both the raw and aggregated scale.

Figure 10
The chosen wavelengths when the raw olive oil spectra are analyzed and when adjacent wavelengths are aggregated.

5 Discussion

The discriminant analysis method presented in this paper gave much better results than those given by popular statistical and machine learning techniques such as Random Forests (Breiman 2001), AdaBoost (Freund and Schapire 1997) and Bayesian Multinomial Regression (Genkin et al. 2005; Madigan et al. 2005) and Transductive SVMs (Vapnik 1995; Joachims 1999) for the high-dimensional food authenticity datasets analysed here. This improvement is further enhanced by the addition of the updating procedure for including the unlabeled data in the estimation method. The results show that the headlong search method for variable selection is an efficient method for selecting wavelengths.

In addition to the improvement in classification results in the example data sets given, the number of variables needed for classification was substantially reduced from 1050 to less than thirty. The variable selection results in the food authenticity application suggest the possibility of developing authenticity sensors that only use reflectance values over a carefully selected subset of the near-infrared and visible spectral range. The regions of the spectrum selected by the method can be interpreted in terms of the underlying chemical properties of the foods under analysis.

We have compared our method with four established leading classification methods from statistics and machine learning for which standard software implementations are available. One of these, AdaBoost, was identified by Leo Breiman as “the best off-the-shelf classifier in the world” (Hastie et al. 2001). It is possible that the large improvement in performance of our method relative to the established methods we have compared it with is due to the fact that our data have many variables of which only a very small proportion (1-3%) are useful. The variables that are not useful may introduce a great deal of noise and degrade performance, and so other methods that do not reduce the number of variables may suffer from this.

Although the methods were developed for the food authenticity application outlined herein, the method could be applied in contexts such as the analysis of gene expression data and document classification. The results of the variable selection procedure could mean a substantial savings in terms of time for data collection and space for future data storage.

A range of recent approaches to variable selection in a classification context include the DALASS approach of Trendafilov and Jolliffe (2007), variable selection for kernel Fisher discriminant analysis (Louw and Steep 2006), the stepwise stopping rule approach of Munita et al. (2006). A number of different search algorithms (proposed as afilternatives to backward/forward/stepwise search) wrapped around different discriminant functions are compared by Pacheco et al. (2006), and genetic search algorithms wrapped around Fisher discriminant analysis are considered by Chiang and Pell (2004). Another example of variable selection methods in the context of classification using spectroscopic data is given by Indahl and Naes (2004).

In terms of other approaches to variable selection, a good review of recent work on the problem of variable or feature selection in classification was given by Guyon and Elisseeff (2003) from a machine learning perspective. A good review of methods involving Support Vector Machines (SVMs) (along with a proposed criterion for exhaustive variable selection) is given by Mary-Huard et al. (2007). An extension allowing variable selection for the multiclass problem using SVMs is given by Wang and Xiatong (2007). An afilternative approach for combining pairwise classifiers, based on Hastie and Tishirani (1998), is given by Szepannek and Weihs (2006) . Greenshtein (2006) looks at theoretical aspects of the np classification and variable selection problem in terms of empirical risk minimization subject to l1 constraints. Finally an afilternative to single subset variable selection through Bayesian Model Averaging (Madigan and Raftery 1994) is given by Dash and Cooper (2004).

Footnotes

*Murphy was supported by Science Foundation of Ireland Basic Research Grant (04/BR/M0057) and Research Frontiers Programme Grant (2007/RFP/MATF281). Raftery was supported by NICHD grant R01 HD054511 and NSF grant ATM 0724721. All three authors were supported by NIH grant 8 R01 EB002137-02.

Contributor Information

Thomas Brendan Murphy, School of Mathematical Sciences University College Dublin, Ireland.

Nema Dean, Department of Statistics University of Glasgow, Scotland.

Adrian E. Raftery, Department of Statistics University of Washington, Seattle, USA.

References

  • Arnalds T, McElhinney J, Fearn T, Downey G. A Hierarchical Discriminant Analysis for Species Identification in Raw Meat by Visible and Near Infrared Spectroscopy. Journal of Near Infrared Spectroscopy. 2004;12:183–188.
  • Badsberg JH. Dodge Y, Whittaker J, editors. Model search in contingency tables by CoCo. Computational Statistics. Heidelberg: Physica Verlag. 1992;1:251–256.
  • Banfield JD, Raftery AE. Model-based Gaussian and non-Gaussian clustering. Biometrics. 1993;49:803–821.
  • Bensmail H, Celeux G. Regularized Gaussian discriminant analysis through eigenvalue decomposition. Journal of the American Statistical Association. 1996;91:1743–1748.
  • Breiman L. Random Forests. Machine Learning. 2001;45:5–32.
  • Chang W-C. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society. Series C. Applied Statistics. 1983;32:267–275.
  • Chapelle O, Schölkopf B, Zien A, editors. Semi-Supervised Learning. MIT Press; Cambridge, MA: 2006.
  • Chiang LH, Pell RJ. Genetic algorithms combined with discriminant analysis for key variable identification. Journal of Process Control. 2004;14:143–155.
  • Collobert R, Sinz F, JasonWeston, Bottou L. Large Scale Transductive SVMs. Journal of Machine Learning Research. 2006;7:1687–1712.
  • Connolly C. Spectroscopic and Analytical Developments Ltd fingerprints brand spirits with ultraviolet spectrophotometry. Sensor Review. 2006;26:94–97.
  • Cortés EA, Martínez MG, Rubio NG. adabag: Applies Adaboost.M1 and Bagging, R package version 1.1. 2007.
  • Dash D, Cooper GF. Model Averaging for Prediction with Discrete Bayesian Networks. Journal of Machine Learning Research. 2004;5:1177–1203.
  • Dean N, Murphy TB, Downey G. Using unlabelled data to update classification rules with applications in food authenticity studies. Journal of the Royal Statistical Society, Series C: Applied Statistics. 2006;55:1–14.
  • Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B: Methodological. 1977;39:1–38. with discussion.
  • Downey G. Authentication of food and food ingredients by near infrared spectroscopy. Journal of Near Infrared Spectroscopy. 1996;4:47–61.
  • Downey G, McIntyre P, Davies AN. Geographical classification of extra virgin olive oils from the eastern Mediterranean by chemometric analysis of visible and near infrared spectroscopic data. Applied Spectroscopy. 2003;57:158–163. [PubMed]
  • Fraley C, Raftery AE. How many clusters? Which clustering method? Answers via Model-Based Cluster Analysis. Computer Journal. 1998;41:578–588.
  • Fraley C, Raftery AE. MCLUST: Software for model-based clustering. Journal of Classification. 1999;16:297–306.
  • Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–612.
  • Fraley C, Raftery AE. Enhanced model-based clustering, density estimation and discriminant analysis software: MCLUST. Journal of Classification. 2003;20:263–296.
  • Fraley C, Raftery AE. mclust: Model-Based Clustering / Normal Mixture Modeling, R package version 3.1-1. 2007.
  • Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences. 1997;55:119–139.
  • Ganesalingam S, McLachlan GJ. The efficiency of a linear discriminant function based on unclassified initial samples. Biometrika. 1978;65:658–662.
  • Genkin A, Lewis DD, Madigan D. BMR: Bayesian Multinomial Regression Software. 2005. http://www.stat.rutgers.edu/~madigan/BMR/
  • Greenshtein E. Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint. The Annals of Statistics. 2006;34:2367–2386.
  • Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. Journal of Machine Learning Research. 2003;3:1157–1182.
  • Hastie T, Tibshirani R, Friedman JH. The Elements of Statistical Learning. Springer-Verlag; New York: 2001.
  • Hastie T, Tishirani R. Classification by Pairwise Coupling. Annals of Statistics. 1998;26:451–471.
  • Hoos HH, Stöutzle T. Stochastic Local Search: Foundations and Applications. Morgan Kaufmann; San Francisco: 2005.
  • Indahl U, Naes T. A variable selection strategy for supervised classification with continuous spectroscopic data. Journal of Chemometrics. 2004;18:53–61.
  • Joachims T. ICML '99: Proceedings of the Sixteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.; San Francisco, CA, USA: 1999. Transductive Inference for Text Classification using Support Vector Machines; pp. 200–209.
  • Kohavi R, John G. Wrappers for feature selection. Artificial Intelligence. 1997;91:273–324.
  • Liang F, Mukherjee S, West M. The Use of Unlabeled Data in Predictive Modeling. Statistical Science. 2007;22:189–205.
  • Liaw A, Wiener M. Classification and Regression by randomForest. R News. 2002;2:18–22.
  • Liu Y, Chen YR. Two-Dimensional Correlation Spectroscopy Study of Visible and Near-Infrared Spectral Variations of Chicken Meats in Cold Storage. Applied Spectroscopy. 2000;54:1458–1470.
  • Liu Y, Chen YR, Ozaki Y. Two-Dimensional Visible/Near Infrared Correlation Spectroscopy Study of Thermal Treatment of Chicken Meat. Journal of Agricultural and Food Chemistry. 2000;48:901–908. [PubMed]
  • Louw N, Steep SJ. Variable selection in kernel Fisher discriminant analysis by means of recursive feature elimination,” Computational Statistics. Data Analysis. 2006;51:2043–2055.
  • Madigan D, Genkin A, Lewis DD, Fradkin D, Castle JP. Knuth KH, Abbas AE, Morris RD, editors. Bayesian Multinomial Logistic Regression for Author Identification. Bayesian Inference and Maximum Entropy Methods in Science and Engineering. 2005;803:509–516.
  • Madigan D, Raftery AE. Model selection and accounting for model uncertainty in graphical models using Occam's window. Journal of the American Statistical Association. 1994;89:1535–1546.
  • Mary-Huard T, Robin S, Daudin J-J. A penalized criterion for variable selection in classification. Journal of Multivariate Analysis. 2007;98:695–705.
  • McElhinney J, Downey G, Fearn T. Chemometric processing of visible and near infrared reflectance spectra for species identification in selected raw homogenised meats. Journal of Near Infrared Spectroscopy. 1999;7:145–154.
  • McLachlan GJ. Discriminant Analysis and Statistical Pattern Recognition. Wiley; New York: 1992.
  • McLachlan GJ, Peel D. Finite Mixture Models. Wiley Inter-science; New York: 2000.
  • Munita CS, Barroso LP, Oliveira PMS. Stopping rule for variable selection using stepwise discriminant analysis. Journal of Radioanalytical and Nuclear Chemistry. 2006;269:335–338.
  • O'Neill TJ. Normal discrimination with unclassified observations. Journal of the American Statistical Association. 1978;73:821–826.
  • Osborne BG, Fearn T, Hindle PH. Practical NIR Spectroscopy with Applications in Food and Beverage Analysis. Longman Scientific & Technical; Harlow, UK: 1993.
  • Osborne BG, Fearn T, Miller AR, Douglas S. Application of near infrared reflectance spectroscopy to the compositional analysis of biscuits and biscuit doughs. Journal of the Science of Food and Agriculture. 1984;35:99–105.
  • Pacheco J, Casado S, Núñez L, Gómez O. Analysis of new variable selection methods for discriminant analysis. Computational Statistics & Data Analysis. 2006;51:1463–1478.
  • R Development Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2007. ISBN 3-900051-07-0.
  • Raftery AE, Dean N. Variable Selection for Model-Based Clustering. Journal of the American Statistical Association. 2006;101:168–178.
  • Reid LM, O'Donnell CP, Downey G. Recent technological advances in the determination of food authenticity. Trends in Food Science and Technology. 2006;17:344–353.
  • Schwarz G. Estimating the dimension of a model. Annals of Statistics. 1978. pp. 461–464.
  • Sinz F, Roffilli M. UniverSVM software. Version 1.1. 2007 http://mloss.org/software/view/19/
  • Szepannek G, Weihs C. Variable selection for discrimination of more than two classes where data are sparse. In: Spiliopoulou M, Kruse R, Borgelt C, Nurnberger A, Gaul W, editors. From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis and Knowledge Organization; 2006. pp. 700–707.
  • Toher D, Downey G, Murphy TB. A comparison of model-based and regression classification techniques applied to near infrared spectroscopic data in food authentication studies. Chemometrics and Intelligent Laboratory Systems. 2007;89:102–115.
  • Trendafilov NT, Jolliffe IT. DALASS: Variable selection in discriminant analysis via the LASSO. Computational Statistics & Data Analysis. 2007;51:3718–3736.
  • Vapnik V. The Nature of Statistical Learning Theory. 2nd ed Springer; 1995.
  • Wang L, Xiatong S. Journal of the American Statistical Association. Vol. 102. 2007. On L1-Norm Multiclass Support Vector Machines: Methodology and Theory; pp. 583–594.
  • West M. Bayesian Statistics 7. Oxford University Press; 2003. Bayesian factor regression models in the “large p, small n” paradigm; pp. 723–732.
  • Yeung KY, Bumgarner R, Raftery AE. Bayesian Model Averaging: Development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005;21:2394–2402. [PubMed]