Search tips
Search criteria

Results 1-25 (850327)

Clipboard (0)

Related Articles

1.  A classification approach for genotyping viral sequences based on multidimensional scaling and linear discriminant analysis 
BMC Bioinformatics  2010;11:434.
Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment.
MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory.
The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at
PMCID: PMC2936400  PMID: 20727194
2.  Neuropsychological Testing and Structural Magnetic Resonance Imaging as Diagnostic Biomarkers Early in the Course of Schizophrenia and Related Psychoses 
Neuroinformatics  2011;9(4):321-333.
Making an accurate diagnosis of schizophrenia and related psychoses early in the course of the disease is important for initiating treatment and counseling patients and families. In this study, we developed classification models for early disease diagnosis using structural MRI (sMRI) and neuropsychological (NP) testing. We used sMRI measurements and NP test results from 28 patients with recent-onset schizophrenia and 47 healthy subjects, drawn from the larger sample of the Mind Clinical Imaging Consortium. We developed diagnostic models based on Linear Discriminant Analysis (LDA) following two approaches; namely, (a) stepwise (STP) LDA on the original measurements, and (b) LDA on variables created through Principal Component Analysis (PCA) and selected using the Humphrey-Ilgen parallel analysis. Error estimation of the modeling algorithms was evaluated by leave-one-out external cross-validation. These analyses were performed on sMRI and NP variables separately and in combination. The following classification accuracy was obtained for different variables and modeling algorithms. sMRI only: (a) STP-LDA: 64.3% sensitivity and 76.6% specificity, (b) PCA-LDA: 67.9% sensitivity and 72.3% specificity. NP only: (a) STP-LDA: 71.4% sensitivity and 80.9% specificity, (b) PCA-LDA: 78.5% sensitivity and 91.5% specificity. Combined sMRI-NP: (a) STP-LDA: 64.3% sensitivity and 83.0% specificity, (b) PCA-LDA: 89.3% sensitivity and 93.6% specificity. (i) Maximal diagnostic accuracy was achieved by combining sMRI and NP variables. (ii) NP variables were more informative than sMRI, indicating that cognitive deficits can be detected earlier than volumetric structural abnormalities. (iii) PCA-LDA yielded more accurate classification than STP-LDA. As these sMRI and NP tests are widely available, they can increase accuracy of early intervention strategies and possibly be used in evaluating treatment response.
PMCID: PMC3116989  PMID: 21246418
Schizophrenia; Schizophreniform; Schizoaffective; PCA; LDA; Biomarkers; Neuropsychology; MRI; Cross-validation; Diagnosis; MCIC
3.  Transcriptional Control by A-Factor of Two Trypsin Genes in Streptomyces griseus 
Journal of Bacteriology  2005;187(1):286-295.
AdpA is the key transcriptional activator for a number of genes of various functions in the A-factor regulatory cascade in Streptomyces griseus, forming an AdpA regulon. Trypsin-like activity was detected at a late stage of growth in the wild-type strain but not in an A-factor-deficient mutant. Consistent with these observations, two trypsin genes, sprT and sprU, in S. griseus were found to be members of the AdpA regulon; AdpA activated the transcription of both genes by binding to the operators located at about −50 nucleotide positions with respect to the transcriptional start point. The transcription of sprT and sprU, induced by AdpA, was most active at the onset of sporulation. Most trypsin activity exerted by S. griseus was attributed to SprT, because trypsin activity in an sprT-disrupted mutant was greatly reduced but that in an sprU-disrupted mutant was only slightly reduced. This was consistent with the observation that the amount of the sprT mRNA was much greater than that of the sprU transcript. Disruption of both sprT and sprU (mutant ΔsprTU) reduced trypsin activity to almost zero, indicating that no trypsin genes other than these two were present in S. griseus. Even the double mutant ΔsprTU grew normally and developed aerial hyphae and spores over the same time course as the wild-type strain.
PMCID: PMC538825  PMID: 15601713
4.  A New Method Combining LDA and PLS for Dimension Reduction 
PLoS ONE  2014;9(5):e96944.
Linear discriminant analysis (LDA) is a classical statistical approach for dimensionality reduction and classification. In many cases, the projection direction of the classical and extended LDA methods is not considered optimal for special applications. Herein we combine the Partial Least Squares (PLS) method with LDA algorithm, and then propose two improved methods, named LDA-PLS and ex-LDA-PLS, respectively. The LDA-PLS amends the projection direction of LDA by using the information of PLS, while ex-LDA-PLS is an extension of LDA-PLS by combining the result of LDA-PLS and LDA, making the result closer to the optimal direction by an adjusting parameter. Comparative studies are provided between the proposed methods and other traditional dimension reduction methods such as Principal component analysis (PCA), LDA and PLS-LDA on two data sets. Experimental results show that the proposed method can achieve better classification performance.
PMCID: PMC4018361  PMID: 24820185
5.  Comparison of linear discriminant analysis methods for the classification of cancer based on gene expression data 
More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data.
The classification performance of linear discriminant analysis (LDA) and its modification methods was evaluated by applying these methods to six public cancer gene expression datasets. These methods included linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA). The procedures were performed by software R 2.80.
PAM picked out fewer feature genes than other methods from most datasets except from Brain dataset. For the two methods of shrinkage discriminant analysis, SLDA selected more genes than SDDA from most datasets except from 2-class lung cancer dataset. When comparing SLDA with SCRDA, SLDA selected more genes than SCRDA from 2-class lung cancer, SRBCT and Brain dataset, the result was opposite for the rest datasets. The average test error of LDA modification methods was lower than LDA method.
The classification performance of LDA modification methods was superior to that of traditional LDA with respect to the average error and there was no significant difference between theses modification methods.
PMCID: PMC2800110  PMID: 20003274
6.  Z-Score Linear Discriminant Analysis for EEG Based Brain-Computer Interfaces 
PLoS ONE  2013;8(9):e74433.
Linear discriminant analysis (LDA) is one of the most popular classification algorithms for brain-computer interfaces (BCI). LDA assumes Gaussian distribution of the data, with equal covariance matrices for the concerned classes, however, the assumption is not usually held in actual BCI applications, where the heteroscedastic class distributions are usually observed. This paper proposes an enhanced version of LDA, namely z-score linear discriminant analysis (Z-LDA), which introduces a new decision boundary definition strategy to handle with the heteroscedastic class distributions. Z-LDA defines decision boundary through z-score utilizing both mean and standard deviation information of the projected data, which can adaptively adjust the decision boundary to fit for heteroscedastic distribution situation. Results derived from both simulation dataset and two actual BCI datasets consistently show that Z-LDA achieves significantly higher average classification accuracies than conventional LDA, indicating the superiority of the new proposed decision boundary definition strategy.
PMCID: PMC3772882  PMID: 24058565
7.  On Using Truncated Sequential Probability Ratio Test Boundaries for Monte Carlo Implementation of Hypothesis Tests1 
When designing programs or software for the implementation of Monte Carlo (MC) hypothesis tests, we can save computation time by using sequential stopping boundaries. Such boundaries imply stopping resampling after relatively few replications if the early replications indicate a very large or very small p-value. We study a truncated sequential probability ratio test (SPRT) boundary and provide a tractable algorithm to implement it. We review two properties desired of any MC p-value, the validity of the p-value and a small resampling risk, where resampling risk is the probability that the accept/reject decision will be different than the decision from complete enumeration. We show how the algorithm can be used to calculate a valid p-value and confidence intervals for any truncated SPRT boundary. We show that a class of SPRT boundaries is minimax with respect to resampling risk and recommend a truncated version of boundaries in that class by comparing their resampling risk (RR) to the RR of fixed boundaries with the same maximum resample size. We study the lack of validity of some simple estimators of p-values and offer a new simple valid p-value for the recommended truncated SPRT boundary. We explore the use of these methods in a practical example and provide the MChtest R package to perform the methods.
PMCID: PMC2467508  PMID: 18633453
Bootstrap; B-value; Permutation; Resampling Risk; Sequential Design; Sequential Probability Ratio Test
8.  Evaluation of an automated safety surveillance system using risk adjusted sequential probability ratio testing 
Automated adverse outcome surveillance tools and methods have potential utility in quality improvement and medical product surveillance activities. Their use for assessing hospital performance on the basis of patient outcomes has received little attention. We compared risk-adjusted sequential probability ratio testing (RA-SPRT) implemented in an automated tool to Massachusetts public reports of 30-day mortality after isolated coronary artery bypass graft surgery.
A total of 23,020 isolated adult coronary artery bypass surgery admissions performed in Massachusetts hospitals between January 1, 2002 and September 30, 2007 were retrospectively re-evaluated. The RA-SPRT method was implemented within an automated surveillance tool to identify hospital outliers in yearly increments. We used an overall type I error rate of 0.05, an overall type II error rate of 0.10, and a threshold that signaled if the odds of dying 30-days after surgery was at least twice than expected. Annual hospital outlier status, based on the state-reported classification, was considered the gold standard. An event was defined as at least one occurrence of a higher-than-expected hospital mortality rate during a given year.
We examined a total of 83 hospital-year observations. The RA-SPRT method alerted 6 events among three hospitals for 30-day mortality compared with 5 events among two hospitals using the state public reports, yielding a sensitivity of 100% (5/5) and specificity of 98.8% (79/80).
The automated RA-SPRT method performed well, detecting all of the true institutional outliers with a small false positive alerting rate. Such a system could provide confidential automated notification to local institutions in advance of public reporting providing opportunities for earlier quality improvement interventions.
PMCID: PMC3262755  PMID: 22168892
9.  Identification of Different Varieties of Sesame Oil Using Near-Infrared Hyperspectral Imaging and Chemometrics Algorithms 
PLoS ONE  2014;9(5):e98522.
This study investigated the feasibility of using near infrared hyperspectral imaging (NIR-HSI) technique for non-destructive identification of sesame oil. Hyperspectral images of four varieties of sesame oil were obtained in the spectral region of 874–1734 nm. Reflectance values were extracted from each region of interest (ROI) of each sample. Competitive adaptive reweighted sampling (CARS), successive projections algorithm (SPA) and x-loading weights (x-LW) were carried out to identify the most significant wavelengths. Based on the sixty-four, seven and five wavelengths suggested by CARS, SPA and x-LW, respectively, two classified models (least squares-support vector machine, LS-SVM and linear discriminant analysis,LDA) were established. Among the established models, CARS-LS-SVM and CARS-LDA models performed well with the highest classification rate (100%) in both calibration and prediction sets. SPA-LS-SVM and SPA-LDA models obtained better results (95.59% and 98.53% of classification rate in prediction set) with only seven wavelengths (938, 1160, 1214, 1406, 1656, 1659 and 1663 nm). The x-LW-LS-SVM and x-LW-LDA models also obtained satisfactory results (>80% of classification rate in prediction set) with the only five wavelengths (921, 925, 995, 1453 and 1663 nm). The results showed that NIR-HSI technique could be used to identify the varieties of sesame oil rapidly and non-destructively, and CARS, SPA and x-LW were effective wavelengths selection methods.
PMCID: PMC4039481  PMID: 24879306
10.  Predicting Future Response to Certolizumab Pegol in Rheumatoid Arthritis Patients: Features at 12 Weeks Associated With Low Disease Activity at 1 Year 
Arthritis care & research  2012;64(5):658-667.
To determine the prognostic significance of data collected early after starting certolizumab pegol (CZP) to predict low disease activity (LDA) at Week 52.
Data through Week 12 from 703 CZP-treated patients in the RA PreventIon of structural Damage (RAPID 1) trial were used as variables to predict LDA (DAS28 [ESR] ≤3.2) at Week 52. We identified variables, developed prediction models using classification trees, and tested performance using training and testing datasets. Additional prediction models were constructed using CDAI and an alternate outcome definition (composite of LDA or ACR50).
Using Week 6 and 12 data and across several different prediction models, response (LDA) and nonresponse at 1 year was predicted with relatively high accuracy (70–90%) for most patients. The best performing model predicting nonresponse by 12 weeks was 90% accurate and applied to 46% of the population. Model accuracy for predicted responders (30% of the RAPID1 population) was 74%. The area under the receiver operator curve was 0.76. Depending on the desired certainty of prediction at 12 weeks, ~12–24% of patients required >12 weeks of treatment to be accurately classified. CDAI-based models, and those evaluating the composite outcome (LDA or ACR50), achieved comparable accuracy.
We could accurately predict within 12 weeks of starting CZP whether most established RA patients with high baseline disease activity would likely achieve/not achieve LDA at 1 year. Decision trees may be useful to guide prospective management for RA patients treated with CZP and other biologics.
PMCID: PMC3330194  PMID: 22231904
11.  The Effect of Electric Cortical Stimulation after Focal Traumatic Brain Injury in Rats 
Annals of Rehabilitation Medicine  2012;36(5):596-608.
To evaluate the effects of electric cortical stimulation in the experimentally induced focal traumatic brain injury (TBI) rat model on motor recovery and plasticity of the injured brain.
Twenty male Sprague-Dawley rats were pre-trained on a single pellet reaching task (SPRT) and on a Rotarod task (RRT) for 14 days. Then, the TBI model was induced by a weight drop device (40 g in weight, 25 cm in height) on the dominant motor cortex, and the electrode was implanted over the perilesional cortical surface. All rats were divided into two groups as follows: Electrical stimulation (ES) group with anodal continuous stimulation (50 Hz and 194 µs duration) or Sham-operated control (SOC) group with no electrical stimulation. The rats were trained SPRT and RRT for 14 days for rehabilitation and measured Garcia's neurologic examination. Histopathological and immunostaining evaluations were performed after the experiment.
There were no differences in the slice number in the histological analysis. Garcia's neurologic scores & SPRT were significantly increased in the ES group (p<0.05), yet, there was no difference in RRT in both groups. The ES group showed more expression of c-Fos around the brain injured area than the SOC group.
Electric cortical stimulation with rehabilitation is considered to be one of the trial methods for motor recovery in TBI. However, more studies should be conducted for the TBI model in order to establish better stimulation methods.
PMCID: PMC3503934  PMID: 23185723
Cortical stimulation; Traumatic brain injury; Rehabilitation; Motor recovery
12.  Developmental Stage Annotation of Drosophila Gene Expression Pattern Images via an Entire Solution Path for LDA 
Gene expression in a developing embryo occurs in particular cells (spatial patterns) in a time-specific manner (temporal patterns), which leads to the differentiation of cell fates. Images of a Drosophila melanogaster embryo at a given developmental stage, showing a particular gene expression pattern revealed by a gene-specific probe, can be compared for spatial overlaps. The comparison is fundamentally important to formulating and testing gene interaction hypotheses. Expression pattern comparison is most biologically meaningful when images from a similar time point (developmental stage) are compared. In this paper, we present LdaPath, a novel formulation of Linear Discriminant Analysis (LDA) for automatic developmental stage range classification. It employs multivariate linear regression with the L1-norm penalty controlled by a regularization parameter for feature extraction and visualization. LdaPath computes an entire solution path for all values of regularization parameter with essentially the same computational cost as fitting one LDA model. Thus, it facilitates efficient model selection. It is based on the equivalence relationship between LDA and the least squares method for multi-class classifications. This equivalence relationship is established under a mild condition, which we show empirically to hold for many high-dimensional datasets, such as expression pattern images. Our experiments on a collection of 2705 expression pattern images show the effectiveness of the proposed algorithm. Results also show that the LDA model resulting from LdaPath is sparse, and irrelevant features may be removed. Thus, LdaPath provides a general framework for simultaneous feature selection and feature extraction.
PMCID: PMC2491494  PMID: 18769656
Gene expression pattern image; dimensionality reduction; Linear Discriminant Analysis; linear regression
13.  Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE 
BMC Bioinformatics  2006;7:543.
In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data.
In this paper, we propose a recursive gene selection method using the discriminant vector of the maximum margin criterion (MMC), which is a variant of classical linear discriminant analysis (LDA). To overcome the computational drawback of classical LDA and the problem of high dimensionality, we present efficient and stable algorithms for MMC-based RFE (MMC-RFE). The MMC-RFE algorithms naturally extend to multi-class cases. The performance of MMC-RFE was extensively compared with that of SVM-RFE using nine cancer microarray datasets, including four multi-class datasets.
Our extensive comparison has demonstrated that for binary-class datasets MMC-RFE tends to show intermediate performance between hard-margin SVM-RFE and SVM-RFE with a properly chosen soft-margin parameter. Notably, MMC-RFE achieves significantly better performance with a smaller number of genes than SVM-RFE for multi-class datasets. The results suggest that MMC-RFE is less sensitive to noise and outliers due to the use of average margin, and thus may be useful for biomarker discovery from noisy data.
PMCID: PMC1790716  PMID: 17187691
14.  Chemometrics of differentially expressed proteins from colorectal cancer patients 
AIM: To evaluate the usefulness of differentially expressed proteins from colorectal cancer (CRC) tissues for differentiating cancer and normal tissues.
METHODS: A Proteomic approach was used to identify the differentially expressed proteins between CRC and normal tissues. The proteins were extracted using Tris buffer and thiourea lysis buffer (TLB) for extraction of aqueous soluble and membrane-associated proteins, respectively. Chemometrics, namely principal component analysis (PCA) and linear discriminant analysis (LDA), were used to assess the usefulness of these proteins for identifying the cancerous state of tissues.
RESULTS: Differentially expressed proteins identified were 37 aqueous soluble proteins in Tris extracts and 24 membrane-associated proteins in TLB extracts. Based on the protein spots intensity on 2D-gel images, PCA by applying an eigenvalue > 1 was successfully used to reduce the number of principal components (PCs) into 12 and seven PCs for Tris and TLB extracts, respectively, and subsequently six PCs, respectively from both the extracts were used for LDA. The LDA classification for Tris extract showed 82.7% of original samples were correctly classified, whereas 82.7% were correctly classified for the cross-validated samples. The LDA for TLB extract showed that 78.8% of original samples and 71.2% of the cross-validated samples were correctly classified.
CONCLUSION: The classification of CRC tissues by PCA and LDA provided a promising distinction between normal and cancer types. These methods can possibly be used for identification of potential biomarkers among the differentially expressed proteins identified.
PMCID: PMC3084394  PMID: 21547128
Colorectal cancer; Proteomics; Marker protein; Principal component analysis; Linear discriminant analysis
15.  Diagnostic potential of near-infrared Raman spectroscopy in the stomach: differentiating dysplasia from normal tissue 
British Journal of Cancer  2008;98(2):457-465.
Raman spectroscopy is a molecular vibrational spectroscopic technique that is capable of optically probing the biomolecular changes associated with diseased transformation. The purpose of this study was to explore near-infrared (NIR) Raman spectroscopy for identifying dysplasia from normal gastric mucosa tissue. A rapid-acquisition dispersive-type NIR Raman system was utilised for tissue Raman spectroscopic measurements at 785 nm laser excitation. A total of 76 gastric tissue samples obtained from 44 patients who underwent endoscopy investigation or gastrectomy operation were used in this study. The histopathological examinations showed that 55 tissue specimens were normal and 21 were dysplasia. Both the empirical approach and multivariate statistical techniques, including principal components analysis (PCA), and linear discriminant analysis (LDA), together with the leave-one-sample-out cross-validation method, were employed to develop effective diagnostic algorithms for classification of Raman spectra between normal and dysplastic gastric tissues. High-quality Raman spectra in the range of 800–1800 cm−1 can be acquired from gastric tissue within 5 s. There are specific spectral differences in Raman spectra between normal and dysplasia tissue, particularly in the spectral ranges of 1200–1500 cm−1 and 1600–1800 cm−1, which contained signals related to amide III and amide I of proteins, CH3CH2 twisting of proteins/nucleic acids, and the C=C stretching mode of phospholipids, respectively. The empirical diagnostic algorithm based on the ratio of the Raman peak intensity at 875 cm−1 to the peak intensity at 1450 cm−1 gave the diagnostic sensitivity of 85.7% and specificity of 80.0%, whereas the diagnostic algorithms based on PCA-LDA yielded the diagnostic sensitivity of 95.2% and specificity 90.9% for separating dysplasia from normal gastric tissue. Receiver operating characteristic (ROC) curves further confirmed that the most effective diagnostic algorithm can be derived from the PCA-LDA technique. Therefore, NIR Raman spectroscopy in conjunction with multivariate statistical technique has potential for rapid diagnosis of dysplasia in the stomach based on the optical evaluation of spectral features of biomolecules.
PMCID: PMC2361456  PMID: 18195711
dysplasia; near-infrared Raman spectroscopy; optical diagnosis; stomach; principal components analysis; linear discriminant analysis
16.  Optimized approach to decision fusion of heterogeneous data for breast cancer diagnosis 
Medical physics  2006;33(8):2945-2954.
As more diagnostic testing options become available to physicians, it becomes more difficult to combine various types of medical information together in order to optimize the overall diagnosis. To improve diagnostic performance, here we introduce an approach to optimize a decision-fusion technique to combine heterogeneous information, such as from different modalities, feature categories, or institutions. For classifier comparison we used two performance metrics: The receiving operator characteristic (ROC) area under the curve [area under the ROC curve (AUC)] and the normalized partial area under the curve (pAUC). This study used four classifiers: Linear discriminant analysis (LDA), artificial neural network (ANN), and two variants of our decision-fusion technique, AUC-optimized (DF-A) and pAUC-optimized (DF-P) decision fusion. We applied each of these classifiers with 100-fold cross-validation to two heterogeneous breast cancer data sets: One of mass lesion features and a much more challenging one of microcalcification lesion features. For the calcification data set, DF-A outperformed the other classifiers in terms of AUC (p<0.02) and achieved AUC=0.85±0.01. The DF-P surpassed the other classifiers in terms of pAUC (p<0.01) and reached pAUC=0.38±0.02. For the mass data set, DF-A outperformed both the ANN and the LDA (p<0.04) and achieved AUC=0.94±0.01. Although for this data set there were no statistically significant differences among the classifiers' pAUC values (pAUC=0.57±0.07 to 0.67±0.05, p>0.10), the DF-P did significantly improve specificity versus the LDA at both 98% and 100% sensitivity (p<0.04). In conclusion, decision fusion directly optimized clinically significant performance measures, such as AUC and pAUC, and sometimes outperformed two wellknown machine-learning techniques when applied to two different breast cancer data sets.
PMCID: PMC2569003  PMID: 16964873
decision fusion; heterogeneous data; receiver operating characteristic (ROC) curve; area under the curve (AUC); partial area under the curve (pAUC); classification; machine learning; breast cancer
17.  A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model 
BMC Bioinformatics  2008;9:241.
Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.
The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.
We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
PMCID: PMC2409339  PMID: 18489778
18.  Gait analysis to classify external load conditions using linear discriminant analysis 
Human movement science  2009;28(2):226-235.
There are many instances where it is desirable to determine, at a distance, whether a subject is carrying a hidden load. Automated detection systems based on gait analysis have been proposed to detect subjects that carry hidden loads. However, very little baseline gait kinematic analysis has been performed to determine the load carriage effect while ambulating with evenly distributed (front to back) loads on human gait. The work in this paper establishes, via high resolution motion capture trials, the baseline separability of load carriage conditions into loaded and unloaded categories using several standard lower body kinematic parameters. A total of 23 participants (19 for training and 4 for testing) were studied. Satisfactory classification of participants into the correct loading condition was achieved by employing linear discriminant analysis (LDA). Six lower body kinematic parameters including ranges of motion and path lengths from the phase portraits were used to train the LDA to discriminate loaded and unloaded walking conditions. Baseline performance from 4 participants who were not included in training data sets show that the use of LDA provides a 92.5% correct classification over two loaded and unloaded walking conditions. The results suggest that there are gait pattern changes due to external loads, and LDA could be applied successfully to classify the gait patterns with an unknown load condition.
PMCID: PMC2908309  PMID: 19162355
Locomotion; Gait analysis; External loads; Linear discriminant analysis
19.  Evaluation of low density array technology for quantitative parallel measurement of multiple genes in human tissue 
BMC Genomics  2006;7:34.
Low density arrays (LDAs) have recently been introduced as a novel approach to gene expression profiling. Based on real time quantitative RT-PCR (QRT-PCR), these arrays enable a more focused and sensitive approach to the study of gene expression than gene chips, while offering higher throughput than more established approaches to QRT-PCR. We have now evaluated LDAs as a means of determining the expression of multiple genes simultaneously in human tissues and cells.
Comparisons between LDAs reveal low variability, with correlation coefficients close to 1. By performing 2-fold and 10-fold serial dilutions of cDNA samples in the LDAs we determined a clear linear relationship between the gene expression data points over 5 orders of magnitude. We also showed that it is possible to use LDAs to accurately and quantitatively detect 2-fold changes in target copy number as well as measuring genes that are expressed with low and high copy numbers in the range of 1 × 102 – 1 × 106 copies. Furthermore, the data generated by the LDA from a cell based pharmacological study were comparable to data generated by conventional QRT-PCR.
LDAs represent a valuable new approach for sensitive and quantitative gene expression profiling.
PMCID: PMC1403755  PMID: 16504128
20.  Comparison of Predicted Probabilities of Proportional Hazards Regression and Linear Discriminant Analysis Methods Using a Colorectal Cancer Molecular Biomarker Database 
Cancer Informatics  2007;3:115-122.
Although a majority of studies in cancer biomarker discovery claim to use proportional hazards regression (PHREG) to the study the ability of a biomarker to predict survival, few studies use the predicted probabilities obtained from the model to test the quality of the model. In this paper, we compared the quality of predictions by a PHREG model to that of a linear discriminant analysis (LDA) in both training and test set settings.
The PHREG and LDA models were built on a 491 colorectal cancer (CRC) patient dataset comprised of demographic and clinicopathologic variables, and phenotypic expression of p53 and Bcl-2. Two variable selection methods, stepwise discriminant analysis and the backward selection, were used to identify the final models. The endpoint of prediction in these models was five-year post-surgery survival. We also used linear regression model to examine the effect of bin size in the training set on the accuracy of prediction in the test set.
The two variable selection techniques resulted in different models when stage was included in the list of variables available for selection. However, the proportion of survivors and non-survivors correctly identified was identical in both of these models. When stage was excluded from the variable list, the error rate for the LDA model was 42% as compared to an error rate of 34% for the PHREG model.
This study suggests that a PHREG model can perform as well or better than a traditional classifier such as LDA to classify patients into prognostic classes. Also, this study suggests that in the absence of the tumor stage as a variable, Bcl-2 expression is a strong prognostic molecular marker of CRC.
PMCID: PMC2675853  PMID: 19455238
predictive models; linear discriminant analysis; proportional hazards regression; colorectal cancer; survival
21.  Stable feature selection and classification algorithms for multiclass microarray data 
Biology Direct  2012;7:33.
Recent studies suggest that gene expression profiles are a promising alternative for clinical cancer classification. One major problem in applying DNA microarrays for classification is the dimension of obtained data sets. In this paper we propose a multiclass gene selection method based on Partial Least Squares (PLS) for selecting genes for classification. The new idea is to solve multiclass selection problem with the PLS method and decomposition to a set of two-class sub-problems: one versus rest (OvR) and one versus one (OvO). We use OvR and OvO two-class decomposition for other recently published gene selection method. Ranked gene lists are highly unstable in the sense that a small change of the data set often leads to big changes in the obtained ordered lists. In this paper, we take a look at the assessment of stability of the proposed methods. We use the linear support vector machines (SVM) technique in different variants: one versus one, one versus rest, multiclass SVM (MSVM) and the linear discriminant analysis (LDA) as a classifier. We use balanced bootstrap to estimate the prediction error and to test the variability of the obtained ordered lists.
This paper focuses on effective identification of informative genes. As a result, a new strategy to find a small subset of significant genes is designed. Our results on real multiclass cancer data show that our method has a very high accuracy rate for different combinations of classification methods, giving concurrently very stable feature rankings.
This paper shows that the proposed strategies can improve the performance of selected gene sets substantially. OvR and OvO techniques applied to existing gene selection methods improve results as well. The presented method allows to obtain a more reliable classifier with less classifier error. In the same time the method generates more stable ordered feature lists in comparison with existing methods.
This article was reviewed by Prof Marek Kimmel, Dr Hans Binder (nominated by Dr Tomasz Lipniacki) and Dr Yuriy Gusev
PMCID: PMC3599581  PMID: 23031190
22.  Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics 
PLoS ONE  2014;9(1):e82119.
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
PMCID: PMC3886981  PMID: 24416136
23.  Sentinel lymph node biopsy for breast cancer using methylene blue dye manifests a short learning curve among experienced surgeons: a prospective tabular cumulative sum (CUSUM) analysis 
BMC Surgery  2009;9:2.
The benefits of sentinel lymph node biopsy (SLNB) for breast cancer patients with histologically negative axillary nodes, in whom axillary lymph node dissection (ALND) is thereby avoided, are now established. Low false negative rate, certainly with blue dye technique, mostly reflects the established high inherent accuracy of SLNB and low axillary nodal metastatic load (subject to patient selection). SLN identification rate is influenced by volume, injection site and choice of mapping agent, axillary nodal metastatic load, SLN location and skill at axillary dissection. Being more subject to technical failure, SLN identification seems to be a more reasonable variable for learning curve assessment than false negative rate.
Methylene blue is as good an SLN mapping agent as Isosulfan blue and is much cheaper. Addition of radio-colloid mapping to blue dye does not achieve a sufficiently higher identification rate to justify the cost. Methylene blue is therefore the agent of choice for SLN mapping in developing countries.
The American Society of Breast Surgeons recommends that, for competence, surgeons should perform 20 SLNB but admits that the learning curve with a standardized technique may be "much shorter". One appropriate remedy for this dilemma is to plot individual learning curves.
Using methylene blue dye, experienced breast surgeons performed SLNB in selected patients with breast cancer (primary tumor < 5 cm and clinically negative ipsilateral axilla). Intraoperative assessment and completion ALND were performed for standardization on the first 13 of 24 cases. SLN identification was plotted for each surgeon on a tabular cumulative sum (CUSUM) chart with sequential probability ratio test (SPRT) limits based on a target identification rate of 85%.
The CUSUM plot crossed the SPRT limit line after 8 consecutive, positively identified SLN, signaling achievement of an acceptable level of competence.
Tabular CUSUM charting, based on a justified choice of parameters, indicates that the learning curve for SLNB using methylene blue dye is completed after 8 consecutive, positively identified SLN. CUSUM charting may be used to plot individual learning curves for trainee surgeons by applying a proxy parameter for failure in the presence of a mentor (such as failed SLN identification within 15 minutes).
PMCID: PMC2640353  PMID: 19173714
24.  Predicting low disease activity and remission using early treatment response to anti-TNF therapy in patients with rheumatoid arthritis: Exploratory analyses from the TEMPO trial 
Annals of the rheumatic diseases  2011;71(2):206-212.
To derive and validate decision trees to categorize rheumatoid arthritis (RA) patients 12 weeks after starting etanercept with or without methotrexate into three groups: patients predicted to achieve low disease activity (LDA) at 1 year; patients predicted to not achieve LDA at 1 year; and patients who needed additional time on therapy to be categorized.
Data from RA patients enrolled in TEMPO were analyzed. Classification and Regression Trees were used to develop and validate decision-tree models with week 12 and earlier assessments that predicted long-term LDA. LDA, defined as DAS28 ≤ 3.2 or Clinical Disease Activity Index (CDAI) ≤ 10.0, was measured at 52 or 48 weeks. Demographics, laboratory data, and clinical data at baseline and through week 12 were analyzed as predictors of response.
Thirty-nine percent (67/172) of patients receiving etanercept and 60% (115/193) of patients receiving etanercept plus methotrexate achieved LDA at week 52. For patients receiving etanercept, 53% were predicted to have LDA, 39% were predicted to not have LDA, and 8% could not be categorized using DAS28 criteria at week 12. For patients receiving etanercept plus methotrexate, 63% were predicted to have LDA, 25% were predicted to not have LDA, and 12% could not be categorized.
Most (80%–90%) patients in TEMPO initiating etanercept with or without methotrexate could be predicted within 12 weeks of starting therapy as likely to have LDA or not at week 52. However, approximately 10%–20% of patients needed additional time on therapy to decide whether to continue treatment.
PMCID: PMC3698970  PMID: 21998118
etanercept; methotrexate; arthritis; rheumatoid; decision tree; prediction
25.  The PCA and LDA Analysis on the Differential Expression of Proteins in Breast Cancer 
Disease markers  2011;29(5):231-242.
Breast cancer is a leading cause of mortality in women. In Malaysia, it is the most common cancer to affect women. The most common form of breast cancer is infiltrating ductal carcinoma (IDC). A proteomic approach was undertaken to identify protein profile changes between cancerous and normal breast tissues from 18 patients. Two protein extracts; aqueous soluble and membrane associated protein extracts were studied. Thirty four differentially expressed proteins were identified. The intensities of the proteins were used as variables in PCA and reduced data of six principal components (PC) were subjected to LDA in order to evaluate the potential of these proteins as collective biomarkers for breast cancer. The protein intensities of SEC13-like 1 (isoform b) and calreticulin contributed the most to the first PC while the protein intensities of fibrinogen beta chain precursor and ATP synthase D chain contributed the most to the second PC. Transthyretin precursor and apolipoprotein A-1 precursor contributed the most to the third PC. The results of LDA indicated good classification of samples into normal and cancerous types when the first 6 PCs were used as the variables. The percentage of correct classification was 91.7% for the originally grouped tissue samples and 88.9% for cross-validated samples.
PMCID: PMC3835531  PMID: 21206008
Breast cancer; proteomics; biomarkers; PCA; LDA

Results 1-25 (850327)