AdpA is the key transcriptional activator for a number of genes of various functions in the A-factor regulatory cascade in Streptomyces griseus, forming an AdpA regulon. Trypsin-like activity was detected at a late stage of growth in the wild-type strain but not in an A-factor-deficient mutant. Consistent with these observations, two trypsin genes, sprT and sprU, in S. griseus were found to be members of the AdpA regulon; AdpA activated the transcription of both genes by binding to the operators located at about −50 nucleotide positions with respect to the transcriptional start point. The transcription of sprT and sprU, induced by AdpA, was most active at the onset of sporulation. Most trypsin activity exerted by S. griseus was attributed to SprT, because trypsin activity in an sprT-disrupted mutant was greatly reduced but that in an sprU-disrupted mutant was only slightly reduced. This was consistent with the observation that the amount of the sprT mRNA was much greater than that of the sprU transcript. Disruption of both sprT and sprU (mutant ΔsprTU) reduced trypsin activity to almost zero, indicating that no trypsin genes other than these two were present in S. griseus. Even the double mutant ΔsprTU grew normally and developed aerial hyphae and spores over the same time course as the wild-type strain.
Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment.
MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory.
The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/.
Gene expression in a developing embryo occurs in particular cells (spatial patterns) in a time-specific manner (temporal patterns), which leads to the differentiation of cell fates. Images of a Drosophila melanogaster embryo at a given developmental stage, showing a particular gene expression pattern revealed by a gene-specific probe, can be compared for spatial overlaps. The comparison is fundamentally important to formulating and testing gene interaction hypotheses. Expression pattern comparison is most biologically meaningful when images from a similar time point (developmental stage) are compared. In this paper, we present LdaPath, a novel formulation of Linear Discriminant Analysis (LDA) for automatic developmental stage range classification. It employs multivariate linear regression with the L1-norm penalty controlled by a regularization parameter for feature extraction and visualization. LdaPath computes an entire solution path for all values of regularization parameter with essentially the same computational cost as fitting one LDA model. Thus, it facilitates efficient model selection. It is based on the equivalence relationship between LDA and the least squares method for multi-class classifications. This equivalence relationship is established under a mild condition, which we show empirically to hold for many high-dimensional datasets, such as expression pattern images. Our experiments on a collection of 2705 expression pattern images show the effectiveness of the proposed algorithm. Results also show that the LDA model resulting from LdaPath is sparse, and irrelevant features may be removed. Thus, LdaPath provides a general framework for simultaneous feature selection and feature extraction.
Gene expression pattern image; dimensionality reduction; Linear Discriminant Analysis; linear regression
Linear discriminant analysis (LDA) is a classical statistical approach for dimensionality reduction and classification. In many cases, the projection direction of the classical and extended LDA methods is not considered optimal for special applications. Herein we combine the Partial Least Squares (PLS) method with LDA algorithm, and then propose two improved methods, named LDA-PLS and ex-LDA-PLS, respectively. The LDA-PLS amends the projection direction of LDA by using the information of PLS, while ex-LDA-PLS is an extension of LDA-PLS by combining the result of LDA-PLS and LDA, making the result closer to the optimal direction by an adjusting parameter. Comparative studies are provided between the proposed methods and other traditional dimension reduction methods such as Principal component analysis (PCA), LDA and PLS-LDA on two data sets. Experimental results show that the proposed method can achieve better classification performance.
More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data.
The classification performance of linear discriminant analysis (LDA) and its modification methods was evaluated by applying these methods to six public cancer gene expression datasets. These methods included linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA). The procedures were performed by software R 2.80.
PAM picked out fewer feature genes than other methods from most datasets except from Brain dataset. For the two methods of shrinkage discriminant analysis, SLDA selected more genes than SDDA from most datasets except from 2-class lung cancer dataset. When comparing SLDA with SCRDA, SLDA selected more genes than SCRDA from 2-class lung cancer, SRBCT and Brain dataset, the result was opposite for the rest datasets. The average test error of LDA modification methods was lower than LDA method.
The classification performance of LDA modification methods was superior to that of traditional LDA with respect to the average error and there was no significant difference between theses modification methods.
Linear discriminant analysis (LDA) is one of the most popular classification algorithms for brain-computer interfaces (BCI). LDA assumes Gaussian distribution of the data, with equal covariance matrices for the concerned classes, however, the assumption is not usually held in actual BCI applications, where the heteroscedastic class distributions are usually observed. This paper proposes an enhanced version of LDA, namely z-score linear discriminant analysis (Z-LDA), which introduces a new decision boundary definition strategy to handle with the heteroscedastic class distributions. Z-LDA defines decision boundary through z-score utilizing both mean and standard deviation information of the projected data, which can adaptively adjust the decision boundary to fit for heteroscedastic distribution situation. Results derived from both simulation dataset and two actual BCI datasets consistently show that Z-LDA achieves significantly higher average classification accuracies than conventional LDA, indicating the superiority of the new proposed decision boundary definition strategy.
When designing programs or software for the implementation of Monte Carlo (MC) hypothesis tests, we can save computation time by using sequential stopping boundaries. Such boundaries imply stopping resampling after relatively few replications if the early replications indicate a very large or very small p-value. We study a truncated sequential probability ratio test (SPRT) boundary and provide a tractable algorithm to implement it. We review two properties desired of any MC p-value, the validity of the p-value and a small resampling risk, where resampling risk is the probability that the accept/reject decision will be different than the decision from complete enumeration. We show how the algorithm can be used to calculate a valid p-value and confidence intervals for any truncated SPRT boundary. We show that a class of SPRT boundaries is minimax with respect to resampling risk and recommend a truncated version of boundaries in that class by comparing their resampling risk (RR) to the RR of fixed boundaries with the same maximum resample size. We study the lack of validity of some simple estimators of p-values and offer a new simple valid p-value for the recommended truncated SPRT boundary. We explore the use of these methods in a practical example and provide the MChtest R package to perform the methods.
Bootstrap; B-value; Permutation; Resampling Risk; Sequential Design; Sequential Probability Ratio Test
Automated adverse outcome surveillance tools and methods have potential utility in quality improvement and medical product surveillance activities. Their use for assessing hospital performance on the basis of patient outcomes has received little attention. We compared risk-adjusted sequential probability ratio testing (RA-SPRT) implemented in an automated tool to Massachusetts public reports of 30-day mortality after isolated coronary artery bypass graft surgery.
A total of 23,020 isolated adult coronary artery bypass surgery admissions performed in Massachusetts hospitals between January 1, 2002 and September 30, 2007 were retrospectively re-evaluated. The RA-SPRT method was implemented within an automated surveillance tool to identify hospital outliers in yearly increments. We used an overall type I error rate of 0.05, an overall type II error rate of 0.10, and a threshold that signaled if the odds of dying 30-days after surgery was at least twice than expected. Annual hospital outlier status, based on the state-reported classification, was considered the gold standard. An event was defined as at least one occurrence of a higher-than-expected hospital mortality rate during a given year.
We examined a total of 83 hospital-year observations. The RA-SPRT method alerted 6 events among three hospitals for 30-day mortality compared with 5 events among two hospitals using the state public reports, yielding a sensitivity of 100% (5/5) and specificity of 98.8% (79/80).
The automated RA-SPRT method performed well, detecting all of the true institutional outliers with a small false positive alerting rate. Such a system could provide confidential automated notification to local institutions in advance of public reporting providing opportunities for earlier quality improvement interventions.
Making an accurate diagnosis of schizophrenia and related psychoses early in the course of the disease is important for initiating treatment and counseling patients and families. In this study, we developed classification models for early disease diagnosis using structural MRI (sMRI) and neuropsychological (NP) testing. We used sMRI measurements and NP test results from 28 patients with recent-onset schizophrenia and 47 healthy subjects, drawn from the larger sample of the Mind Clinical Imaging Consortium. We developed diagnostic models based on Linear Discriminant Analysis (LDA) following two approaches; namely, (a) stepwise (STP) LDA on the original measurements, and (b) LDA on variables created through Principal Component Analysis (PCA) and selected using the Humphrey-Ilgen parallel analysis. Error estimation of the modeling algorithms was evaluated by leave-one-out external cross-validation. These analyses were performed on sMRI and NP variables separately and in combination. The following classification accuracy was obtained for different variables and modeling algorithms. sMRI only: (a) STP-LDA: 64.3% sensitivity and 76.6% specificity, (b) PCA-LDA: 67.9% sensitivity and 72.3% specificity. NP only: (a) STP-LDA: 71.4% sensitivity and 80.9% specificity, (b) PCA-LDA: 78.5% sensitivity and 91.5% specificity. Combined sMRI-NP: (a) STP-LDA: 64.3% sensitivity and 83.0% specificity, (b) PCA-LDA: 89.3% sensitivity and 93.6% specificity. (i) Maximal diagnostic accuracy was achieved by combining sMRI and NP variables. (ii) NP variables were more informative than sMRI, indicating that cognitive deficits can be detected earlier than volumetric structural abnormalities. (iii) PCA-LDA yielded more accurate classification than STP-LDA. As these sMRI and NP tests are widely available, they can increase accuracy of early intervention strategies and possibly be used in evaluating treatment response.
Schizophrenia; Schizophreniform; Schizoaffective; PCA; LDA; Biomarkers; Neuropsychology; MRI; Cross-validation; Diagnosis; MCIC
This study investigated the feasibility of using near infrared hyperspectral imaging (NIR-HSI) technique for non-destructive identification of sesame oil. Hyperspectral images of four varieties of sesame oil were obtained in the spectral region of 874–1734 nm. Reflectance values were extracted from each region of interest (ROI) of each sample. Competitive adaptive reweighted sampling (CARS), successive projections algorithm (SPA) and x-loading weights (x-LW) were carried out to identify the most significant wavelengths. Based on the sixty-four, seven and five wavelengths suggested by CARS, SPA and x-LW, respectively, two classified models (least squares-support vector machine, LS-SVM and linear discriminant analysis,LDA) were established. Among the established models, CARS-LS-SVM and CARS-LDA models performed well with the highest classification rate (100%) in both calibration and prediction sets. SPA-LS-SVM and SPA-LDA models obtained better results (95.59% and 98.53% of classification rate in prediction set) with only seven wavelengths (938, 1160, 1214, 1406, 1656, 1659 and 1663 nm). The x-LW-LS-SVM and x-LW-LDA models also obtained satisfactory results (>80% of classification rate in prediction set) with the only five wavelengths (921, 925, 995, 1453 and 1663 nm). The results showed that NIR-HSI technique could be used to identify the varieties of sesame oil rapidly and non-destructively, and CARS, SPA and x-LW were effective wavelengths selection methods.
To determine the prognostic significance of data collected early after starting certolizumab pegol (CZP) to predict low disease activity (LDA) at Week 52.
Data through Week 12 from 703 CZP-treated patients in the RA PreventIon of structural Damage (RAPID 1) trial were used as variables to predict LDA (DAS28 [ESR] ≤3.2) at Week 52. We identified variables, developed prediction models using classification trees, and tested performance using training and testing datasets. Additional prediction models were constructed using CDAI and an alternate outcome definition (composite of LDA or ACR50).
Using Week 6 and 12 data and across several different prediction models, response (LDA) and nonresponse at 1 year was predicted with relatively high accuracy (70–90%) for most patients. The best performing model predicting nonresponse by 12 weeks was 90% accurate and applied to 46% of the population. Model accuracy for predicted responders (30% of the RAPID1 population) was 74%. The area under the receiver operator curve was 0.76. Depending on the desired certainty of prediction at 12 weeks, ~12–24% of patients required >12 weeks of treatment to be accurately classified. CDAI-based models, and those evaluating the composite outcome (LDA or ACR50), achieved comparable accuracy.
We could accurately predict within 12 weeks of starting CZP whether most established RA patients with high baseline disease activity would likely achieve/not achieve LDA at 1 year. Decision trees may be useful to guide prospective management for RA patients treated with CZP and other biologics.
Cancer diagnosis is one of the most important tasks of biomedical research and has become the main objective of medical investigations.
The present paper proposed an analytical strategy for distinguishing between normal and malignant colorectal tissues
by combining the use of near-infrared (NIR) spectroscopy with chemometrics. The successive projection algorithm-linear discriminant analysis
(SPA-LDA) was used to seek a reduced subset of variables/wavenumbers and build a diagnostic model of LDA. For comparison, the partial least
squares-discriminant analysis (PLS-DA) based on full-spectrum classification was also used as the reference. Principal component analysis (PCA)
was used for a preliminary analysis. A total of 186 spectra from 20 patients with partial colorectal resection were collected and divided into three subsets for training,
optimizing, and testing the model. The results showed that, compared to PLS-DA, SPA-LDA provided more parsimonious model using only three
wavenumbers/variables (4065, 4173, and 5758 cm−1) to achieve the sensitivity of 84.6%, 92.3%, and 92.3%
for the training, validation, and test sets, respectively, and the specificity of 100% for each subset. It indicated that the combination of
NIR spectroscopy and SPA-LDA algorithm can serve as a potential tool for distinguishing between normal and malignant colorectal tissues.
Objective. This study aimed at evaluating linear discriminant analysis (LDA) and support vector machine (SVM) classifiers for estimating final Gleason score preoperatively using multiparametric magnetic resonance imaging (mp-MRI) and clinical parameters. Materials and Methods. Thirty-three patients who underwent mp-MRI on a 3T clinical MR scanner and radical prostatectomy were enrolled in this study. The input features for classifiers were age, the presence of a palpable prostate abnormality, prostate specific antigen (PSA) level, index lesion size, and Likert scales of T2 weighted MRI (T2w-MRI), diffusion weighted MRI (DW-MRI), and dynamic contrast enhanced MRI (DCE-MRI) estimated by an experienced radiologist. SVM based recursive feature elimination (SVM-RFE) was used for eliminating features. Principal component analysis (PCA) was applied for data uncorrelation. Results. Using a standard PCA before final Gleason score classification resulted in mean sensitivities of 51.19% and 64.37% and mean specificities of 72.71% and 39.90% for LDA and SVM, respectively. Using a Gaussian kernel PCA resulted in mean sensitivities of 86.51% and 87.88% and mean specificities of 63.99% and 56.83% for LDA and SVM, respectively. Conclusion. SVM classifier resulted in a slightly higher sensitivity but a lower specificity than LDA method for final Gleason score prediction for prostate cancer for this limited patient population.
To evaluate the effects of electric cortical stimulation in the experimentally induced focal traumatic brain injury (TBI) rat model on motor recovery and plasticity of the injured brain.
Twenty male Sprague-Dawley rats were pre-trained on a single pellet reaching task (SPRT) and on a Rotarod task (RRT) for 14 days. Then, the TBI model was induced by a weight drop device (40 g in weight, 25 cm in height) on the dominant motor cortex, and the electrode was implanted over the perilesional cortical surface. All rats were divided into two groups as follows: Electrical stimulation (ES) group with anodal continuous stimulation (50 Hz and 194 µs duration) or Sham-operated control (SOC) group with no electrical stimulation. The rats were trained SPRT and RRT for 14 days for rehabilitation and measured Garcia's neurologic examination. Histopathological and immunostaining evaluations were performed after the experiment.
There were no differences in the slice number in the histological analysis. Garcia's neurologic scores & SPRT were significantly increased in the ES group (p<0.05), yet, there was no difference in RRT in both groups. The ES group showed more expression of c-Fos around the brain injured area than the SOC group.
Electric cortical stimulation with rehabilitation is considered to be one of the trial methods for motor recovery in TBI. However, more studies should be conducted for the TBI model in order to establish better stimulation methods.
Cortical stimulation; Traumatic brain injury; Rehabilitation; Motor recovery
In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.
In this paper, we propose a recursive gene selection method using the discriminant vector of the maximum margin criterion (MMC), which is a variant of classical linear discriminant analysis (LDA). To overcome the computational drawback of classical LDA and the problem of high dimensionality, we present efficient and stable algorithms for MMC-based RFE (MMC-RFE). The MMC-RFE algorithms naturally extend to multi-class cases. The performance of MMC-RFE was extensively compared with that of SVM-RFE using nine cancer microarray datasets, including four multi-class datasets.
Our extensive comparison has demonstrated that for binary-class datasets MMC-RFE tends to show intermediate performance between hard-margin SVM-RFE and SVM-RFE with a properly chosen soft-margin parameter. Notably, MMC-RFE achieves significantly better performance with a smaller number of genes than SVM-RFE for multi-class datasets. The results suggest that MMC-RFE is less sensitive to noise and outliers due to the use of average margin, and thus may be useful for biomarker discovery from noisy data.
AIM: To evaluate the usefulness of differentially expressed proteins from colorectal cancer (CRC) tissues for differentiating cancer and normal tissues.
METHODS: A Proteomic approach was used to identify the differentially expressed proteins between CRC and normal tissues. The proteins were extracted using Tris buffer and thiourea lysis buffer (TLB) for extraction of aqueous soluble and membrane-associated proteins, respectively. Chemometrics, namely principal component analysis (PCA) and linear discriminant analysis (LDA), were used to assess the usefulness of these proteins for identifying the cancerous state of tissues.
RESULTS: Differentially expressed proteins identified were 37 aqueous soluble proteins in Tris extracts and 24 membrane-associated proteins in TLB extracts. Based on the protein spots intensity on 2D-gel images, PCA by applying an eigenvalue > 1 was successfully used to reduce the number of principal components (PCs) into 12 and seven PCs for Tris and TLB extracts, respectively, and subsequently six PCs, respectively from both the extracts were used for LDA. The LDA classification for Tris extract showed 82.7% of original samples were correctly classified, whereas 82.7% were correctly classified for the cross-validated samples. The LDA for TLB extract showed that 78.8% of original samples and 71.2% of the cross-validated samples were correctly classified.
CONCLUSION: The classification of CRC tissues by PCA and LDA provided a promising distinction between normal and cancer types. These methods can possibly be used for identification of potential biomarkers among the differentially expressed proteins identified.
Colorectal cancer; Proteomics; Marker protein; Principal component analysis; Linear discriminant analysis
Raman spectroscopy is a molecular vibrational spectroscopic technique that is capable of optically probing the biomolecular changes associated with diseased transformation. The purpose of this study was to explore near-infrared (NIR) Raman spectroscopy for identifying dysplasia from normal gastric mucosa tissue. A rapid-acquisition dispersive-type NIR Raman system was utilised for tissue Raman spectroscopic measurements at 785 nm laser excitation. A total of 76 gastric tissue samples obtained from 44 patients who underwent endoscopy investigation or gastrectomy operation were used in this study. The histopathological examinations showed that 55 tissue specimens were normal and 21 were dysplasia. Both the empirical approach and multivariate statistical techniques, including principal components analysis (PCA), and linear discriminant analysis (LDA), together with the leave-one-sample-out cross-validation method, were employed to develop effective diagnostic algorithms for classification of Raman spectra between normal and dysplastic gastric tissues. High-quality Raman spectra in the range of 800–1800 cm−1 can be acquired from gastric tissue within 5 s. There are specific spectral differences in Raman spectra between normal and dysplasia tissue, particularly in the spectral ranges of 1200–1500 cm−1 and 1600–1800 cm−1, which contained signals related to amide III and amide I of proteins, CH3CH2 twisting of proteins/nucleic acids, and the C=C stretching mode of phospholipids, respectively. The empirical diagnostic algorithm based on the ratio of the Raman peak intensity at 875 cm−1 to the peak intensity at 1450 cm−1 gave the diagnostic sensitivity of 85.7% and specificity of 80.0%, whereas the diagnostic algorithms based on PCA-LDA yielded the diagnostic sensitivity of 95.2% and specificity 90.9% for separating dysplasia from normal gastric tissue. Receiver operating characteristic (ROC) curves further confirmed that the most effective diagnostic algorithm can be derived from the PCA-LDA technique. Therefore, NIR Raman spectroscopy in conjunction with multivariate statistical technique has potential for rapid diagnosis of dysplasia in the stomach based on the optical evaluation of spectral features of biomolecules.
dysplasia; near-infrared Raman spectroscopy; optical diagnosis; stomach; principal components analysis; linear discriminant analysis
Extracting relevant information from microarray data is a very complex task due to the characteristics of the data sets, as they comprise a large number of features while few samples are generally available. In this sense, feature selection is a very important aspect of the analysis helping in the tasks of identifying relevant genes and also for maximizing predictive information.
Due to its simplicity and speed, Stepwise Forward Selection (SFS) is a widely used feature selection technique. In this work, we carry a comparative study of SFS and Genetic Algorithms (GA) as general frameworks for the analysis of microarray data with the aim of identifying group of genes with high predictive capability and biological relevance. Six standard and machine learning-based techniques (Linear Discriminant Analysis (LDA), Support Vector Machines (SVM), Naive Bayes (NB), C-MANTEC Constructive Neural Network, K-Nearest Neighbors (kNN) and Multilayer perceptron (MLP)) are used within both frameworks using six free-public datasets for the task of predicting cancer outcome.
Better cancer outcome prediction results were obtained using the GA framework noting that this approach, in comparison to the SFS one, leads to a larger selection set, uses a large number of comparison between genetic profiles and thus it is computationally more intensive. Also the GA framework permitted to obtain a set of genes that can be considered to be more biologically relevant. Regarding the different classifiers used standard feedforward neural networks (MLP), LDA and SVM lead to similar and best results, while C-MANTEC and k-NN followed closely but with a lower accuracy. Further, C-MANTEC, MLP and LDA permitted to obtain a more limited set of genes in comparison to SVM, NB and kNN, and in particular C-MANTEC resulted in the most robust classifier in terms of changes in the parameter settings.
This study shows that if prediction accuracy is the objective, the GA-based approach lead to better results respect to the SFS approach, independently of the classifier used. Regarding classifiers, even if C-MANTEC did not achieve the best overall results, the performance was competitive with a very robust behaviour in terms of the parameters of the algorithm, and thus it can be considered as a candidate technique for future studies.
Microarray; Genetic algorithms; Constructive neural networks; Feature Selection
In the field of computer-aided mammographic mass detection, many different features and classifiers have been tested. Frequently, the relevant features and optimal topology for the artificial neural network (ANN)-based approaches at the classification stage are unknown, and thus determined by trial-and-error experiments. In this study, we analyzed a classifier that evolves ANNs using genetic algorithms (GAs), which combines feature selection with the learning task. The classifier named “Phased Searching with NEAT in a Time-Scaled Framework” was analyzed using a dataset with 800 malignant and 800 normal tissue regions in a 10-fold cross-validation framework. The classification performance measured by the area under a receiver operating characteristic (ROC) curve was 0.856 ± 0.029. The result was also compared with four other well-established classifiers that include fixed-topology ANNs, support vector machines (SVMs), linear discriminant analysis (LDA), and bagged decision trees. The results show that Phased Searching outperformed the LDA and bagged decision tree classifiers, and was only significantly outperformed by SVM. Furthermore, the Phased Searching method required fewer features and discarded superfluous structure or topology, thus incurring a lower feature computational and training and validation time requirement. Analyses performed on the network complexities evolved by Phased Searching indicate that it can evolve optimal network topologies based on its complexification and simplification parameter selection process. From the results, the study also concluded that the three classifiers – SVM, fixed-topology ANN, and Phased Searching with NeuroEvolution of Augmenting Topologies (NEAT) in a Time-Scaled Framework – are performing comparably well in our mammographic mass detection scheme.
Computer-aided detection (CAD); Machine Learning; Mammographic mass detection; NeuroEvolution of Augmenting Topologies (NEAT); Optimal Feature Selection
The combination of Brain-Computer Interface (BCI) technology, allowing online monitoring and decoding of brain activity, with virtual and mixed reality (MR) systems may help to shape and guide implicit and explicit learning using ecological scenarios. Real-time information of ongoing brain states acquired through BCI might be exploited for controlling data presentation in virtual environments. Brain states discrimination during mixed reality experience is thus critical for adapting specific data features to contingent brain activity. In this study we recorded electroencephalographic (EEG) data while participants experienced MR scenarios implemented through the eXperience Induction Machine (XIM). The XIM is a novel framework modeling the integration of a sensing system that evaluates and measures physiological and psychological states with a number of actuators and effectors that coherently reacts to the user's actions. We then assessed continuous EEG-based discrimination of spatial navigation, reading and calculation performed in MR, using linear discriminant analysis (LDA) and support vector machine (SVM) classifiers. Dynamic single trial classification showed high accuracy of LDA and SVM classifiers in detecting multiple brain states as well as in differentiating between high and low mental workload, using a 5 s time-window shifting every 200 ms. Our results indicate overall better performance of LDA with respect to SVM and suggest applicability of our approach in a BCI-controlled MR scenario. Ultimately, successful prediction of brain states might be used to drive adaptation of data representation in order to boost information processing in MR.
mental states decoding; EEG; mixed reality; XIM
As more diagnostic testing options become available to physicians, it becomes more difficult to combine various types of medical information together in order to optimize the overall diagnosis. To improve diagnostic performance, here we introduce an approach to optimize a decision-fusion technique to combine heterogeneous information, such as from different modalities, feature categories, or institutions. For classifier comparison we used two performance metrics: The receiving operator characteristic (ROC) area under the curve [area under the ROC curve (AUC)] and the normalized partial area under the curve (pAUC). This study used four classifiers: Linear discriminant analysis (LDA), artificial neural network (ANN), and two variants of our decision-fusion technique, AUC-optimized (DF-A) and pAUC-optimized (DF-P) decision fusion. We applied each of these classifiers with 100-fold cross-validation to two heterogeneous breast cancer data sets: One of mass lesion features and a much more challenging one of microcalcification lesion features. For the calcification data set, DF-A outperformed the other classifiers in terms of AUC (p<0.02) and achieved AUC=0.85±0.01. The DF-P surpassed the other classifiers in terms of pAUC (p<0.01) and reached pAUC=0.38±0.02. For the mass data set, DF-A outperformed both the ANN and the LDA (p<0.04) and achieved AUC=0.94±0.01. Although for this data set there were no statistically significant differences among the classifiers' pAUC values (pAUC=0.57±0.07 to 0.67±0.05, p>0.10), the DF-P did significantly improve specificity versus the LDA at both 98% and 100% sensitivity (p<0.04). In conclusion, decision fusion directly optimized clinically significant performance measures, such as AUC and pAUC, and sometimes outperformed two wellknown machine-learning techniques when applied to two different breast cancer data sets.
decision fusion; heterogeneous data; receiver operating characteristic (ROC) curve; area under the curve (AUC); partial area under the curve (pAUC); classification; machine learning; breast cancer
Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.
The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.
We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
There are many instances where it is desirable to determine, at a distance, whether a subject is carrying a hidden load. Automated detection systems based on gait analysis have been proposed to detect subjects that carry hidden loads. However, very little baseline gait kinematic analysis has been performed to determine the load carriage effect while ambulating with evenly distributed (front to back) loads on human gait. The work in this paper establishes, via high resolution motion capture trials, the baseline separability of load carriage conditions into loaded and unloaded categories using several standard lower body kinematic parameters. A total of 23 participants (19 for training and 4 for testing) were studied. Satisfactory classification of participants into the correct loading condition was achieved by employing linear discriminant analysis (LDA). Six lower body kinematic parameters including ranges of motion and path lengths from the phase portraits were used to train the LDA to discriminate loaded and unloaded walking conditions. Baseline performance from 4 participants who were not included in training data sets show that the use of LDA provides a 92.5% correct classification over two loaded and unloaded walking conditions. The results suggest that there are gait pattern changes due to external loads, and LDA could be applied successfully to classify the gait patterns with an unknown load condition.
Locomotion; Gait analysis; External loads; Linear discriminant analysis
Low density arrays (LDAs) have recently been introduced as a novel approach to gene expression profiling. Based on real time quantitative RT-PCR (QRT-PCR), these arrays enable a more focused and sensitive approach to the study of gene expression than gene chips, while offering higher throughput than more established approaches to QRT-PCR. We have now evaluated LDAs as a means of determining the expression of multiple genes simultaneously in human tissues and cells.
Comparisons between LDAs reveal low variability, with correlation coefficients close to 1. By performing 2-fold and 10-fold serial dilutions of cDNA samples in the LDAs we determined a clear linear relationship between the gene expression data points over 5 orders of magnitude. We also showed that it is possible to use LDAs to accurately and quantitatively detect 2-fold changes in target copy number as well as measuring genes that are expressed with low and high copy numbers in the range of 1 × 102 – 1 × 106 copies. Furthermore, the data generated by the LDA from a cell based pharmacological study were comparable to data generated by conventional QRT-PCR.
LDAs represent a valuable new approach for sensitive and quantitative gene expression profiling.
Recent studies suggest that gene expression profiles are a promising alternative for clinical cancer classification. One major problem in applying DNA microarrays for classification is the dimension of obtained data sets. In this paper we propose a multiclass gene selection method based on Partial Least Squares (PLS) for selecting genes for classification. The new idea is to solve multiclass selection problem with the PLS method and decomposition to a set of two-class sub-problems: one versus rest (OvR) and one versus one (OvO). We use OvR and OvO two-class decomposition for other recently published gene selection method. Ranked gene lists are highly unstable in the sense that a small change of the data set often leads to big changes in the obtained ordered lists. In this paper, we take a look at the assessment of stability of the proposed methods. We use the linear support vector machines (SVM) technique in different variants: one versus one, one versus rest, multiclass SVM (MSVM) and the linear discriminant analysis (LDA) as a classifier. We use balanced bootstrap to estimate the prediction error and to test the variability of the obtained ordered lists.
This paper focuses on effective identification of informative genes. As a result, a new strategy to find a small subset of significant genes is designed. Our results on real multiclass cancer data show that our method has a very high accuracy rate for different combinations of classification methods, giving concurrently very stable feature rankings.
This paper shows that the proposed strategies can improve the performance of selected gene sets substantially. OvR and OvO techniques applied to existing gene selection methods improve results as well. The presented method allows to obtain a more reliable classifier with less classifier error. In the same time the method generates more stable ordered feature lists in comparison with existing methods.
This article was reviewed by Prof Marek Kimmel, Dr Hans Binder (nominated by Dr Tomasz Lipniacki) and Dr Yuriy Gusev