Linear discriminant analysis (LDA) is one of the most popular classification algorithms for brain-computer interfaces (BCI). LDA assumes Gaussian distribution of the data, with equal covariance matrices for the concerned classes, however, the assumption is not usually held in actual BCI applications, where the heteroscedastic class distributions are usually observed. This paper proposes an enhanced version of LDA, namely z-score linear discriminant analysis (Z-LDA), which introduces a new decision boundary definition strategy to handle with the heteroscedastic class distributions. Z-LDA defines decision boundary through z-score utilizing both mean and standard deviation information of the projected data, which can adaptively adjust the decision boundary to fit for heteroscedastic distribution situation. Results derived from both simulation dataset and two actual BCI datasets consistently show that Z-LDA achieves significantly higher average classification accuracies than conventional LDA, indicating the superiority of the new proposed decision boundary definition strategy.
When designing programs or software for the implementation of Monte Carlo (MC) hypothesis tests, we can save computation time by using sequential stopping boundaries. Such boundaries imply stopping resampling after relatively few replications if the early replications indicate a very large or very small p-value. We study a truncated sequential probability ratio test (SPRT) boundary and provide a tractable algorithm to implement it. We review two properties desired of any MC p-value, the validity of the p-value and a small resampling risk, where resampling risk is the probability that the accept/reject decision will be different than the decision from complete enumeration. We show how the algorithm can be used to calculate a valid p-value and confidence intervals for any truncated SPRT boundary. We show that a class of SPRT boundaries is minimax with respect to resampling risk and recommend a truncated version of boundaries in that class by comparing their resampling risk (RR) to the RR of fixed boundaries with the same maximum resample size. We study the lack of validity of some simple estimators of p-values and offer a new simple valid p-value for the recommended truncated SPRT boundary. We explore the use of these methods in a practical example and provide the MChtest R package to perform the methods.
Bootstrap; B-value; Permutation; Resampling Risk; Sequential Design; Sequential Probability Ratio Test
Making an accurate diagnosis of schizophrenia and related psychoses early in the course of the disease is important for initiating treatment and counseling patients and families. In this study, we developed classification models for early disease diagnosis using structural MRI (sMRI) and neuropsychological (NP) testing. We used sMRI measurements and NP test results from 28 patients with recent-onset schizophrenia and 47 healthy subjects, drawn from the larger sample of the Mind Clinical Imaging Consortium. We developed diagnostic models based on Linear Discriminant Analysis (LDA) following two approaches; namely, (a) stepwise (STP) LDA on the original measurements, and (b) LDA on variables created through Principal Component Analysis (PCA) and selected using the Humphrey-Ilgen parallel analysis. Error estimation of the modeling algorithms was evaluated by leave-one-out external cross-validation. These analyses were performed on sMRI and NP variables separately and in combination. The following classification accuracy was obtained for different variables and modeling algorithms. sMRI only: (a) STP-LDA: 64.3% sensitivity and 76.6% specificity, (b) PCA-LDA: 67.9% sensitivity and 72.3% specificity. NP only: (a) STP-LDA: 71.4% sensitivity and 80.9% specificity, (b) PCA-LDA: 78.5% sensitivity and 91.5% specificity. Combined sMRI-NP: (a) STP-LDA: 64.3% sensitivity and 83.0% specificity, (b) PCA-LDA: 89.3% sensitivity and 93.6% specificity. (i) Maximal diagnostic accuracy was achieved by combining sMRI and NP variables. (ii) NP variables were more informative than sMRI, indicating that cognitive deficits can be detected earlier than volumetric structural abnormalities. (iii) PCA-LDA yielded more accurate classification than STP-LDA. As these sMRI and NP tests are widely available, they can increase accuracy of early intervention strategies and possibly be used in evaluating treatment response.
Schizophrenia; Schizophreniform; Schizoaffective; PCA; LDA; Biomarkers; Neuropsychology; MRI; Cross-validation; Diagnosis; MCIC
In this paper, we propose a sequential probability ratio test (SPRT) to overcome the problem of limited samples in studies related to complex genetic diseases. The results of this novel approach are compared with the ones obtained from the traditional transmission disequilibrium test (TDT) on simulated data. Although TDT classifies single-nucleotide polymorphisms (SNPs) to only two groups (SNPs associated with the disease and the others), SPRT has the flexibility of assigning SNPs to a third group, that is, those for which we do not have enough evidence and should keep sampling. It is shown that SPRT results in smaller ratios of false positives and negatives, as well as better accuracy and sensitivity values for classifying SNPs when compared with TDT. By using SPRT, data with small sample size become usable for an accurate association analysis.
transmission disequilibrium test; sequential probability ratio test; SNPs; simulation study; family-based association study
Accurate classification into genotypes is critical in understanding evolution of divergent viruses. Here we report a new approach, MuLDAS, which classifies a query sequence based on the statistical genotype models learned from the known sequences. Thus, MuLDAS utilizes full spectra of well characterized sequences as references, typically of an order of hundreds, in order to estimate the significance of each genotype assignment.
MuLDAS starts by aligning the query sequence to the reference multiple sequence alignment and calculating the subsequent distance matrix among the sequences. They are then mapped to a principal coordinate space by multidimensional scaling, and the coordinates of the reference sequences are used as features in developing linear discriminant models that partition the space by genotype. The genotype of the query is then given as the maximum a posteriori estimate. MuLDAS tests the model confidence by leave-one-out cross-validation and also provides some heuristics for the detection of 'outlier' sequences that fall far outside or in-between genotype clusters. We have tested our method by classifying HIV-1 and HCV nucleotide sequences downloaded from NCBI GenBank, achieving the overall concordance rates of 99.3% and 96.6%, respectively, with the benchmark test dataset retrieved from the respective databases of Los Alamos National Laboratory.
The highly accurate genotype assignment coupled with several measures for evaluating the results makes MuLDAS useful in analyzing the sequences of rapidly evolving viruses such as HIV-1 and HCV. A web-based genotype prediction server is available at http://www.muldas.org/MuLDAS/.
A previously developed neural-machine interface (NMI) based on neuromuscular-mechanical fusion has showed promise for recognizing user locomotion modes; however, errors of NMI during mode transitions were observed, which may challenge its real application. This study aimed to investigate whether or not the prior knowledge of walking environment could further improve the NMI performance. Linear Discriminant Analysis (LDA)-based classifiers were designed to identify user intent based on electromyographic (EMG) signals from residual muscles of leg amputees and ground reaction force (GRF) measured from the prosthetic leg. The prior knowledge of the terrain in front of the user adjusted the prior possibility in the discriminant function. Therefore, the boundaries of LDA were adaptive to the prior knowledge of the walking environment. This algorithm was evaluated on a dataset collected from one patient with a transfemoral (TF) amputation. The preliminary results showed that the NMI with adaptive prior possibilities outperformed the NMI without using the prior knowledge; it produced 98.7% accuracy for identifying tested locomotion modes, accurately predicted all the task transitions with 261–390 ms prediction time, and generated stable decision during task transitions. These results indicate the potential of using prior knowledge about walking environment to further improve the NMI for prosthetic legs.
Dementia is one of the most common neurological disorders among the elderly. Identifying those who are of high risk suffering dementia is important to the administration of early treatment in order to slow down the progression of dementia symptoms. However, to achieve accurate classification, significant amount of subject feature information are involved. Hence identification of demented subjects can be transformed into a pattern recognition problem with high-dimensional nonlinear datasets. In this paper, we introduce trace ratio linear discriminant analysis (TR-LDA) for dementia diagnosis. An improved ITR algorithm (iITR) is developed to solve the TR-LDA problem. This novel method can be integrated with advanced missing value imputation method and utilized for the analysis of the nonlinear datasets in many real-world medical diagnosis problems. Finally, extensive simulations are conducted to show the effectiveness of the proposed method. The results demonstrate that our method can achieve higher accuracies for identifying the demented patients than other state-of-art algorithms.
Dimensionality reduction; feature extraction; medical diagnosis
More studies based on gene expression data have been reported in great detail, however, one major challenge for the methodologists is the choice of classification methods. The main purpose of this research was to compare the performance of linear discriminant analysis (LDA) and its modification methods for the classification of cancer based on gene expression data.
The classification performance of linear discriminant analysis (LDA) and its modification methods was evaluated by applying these methods to six public cancer gene expression datasets. These methods included linear discriminant analysis (LDA), prediction analysis for microarrays (PAM), shrinkage centroid regularized discriminant analysis (SCRDA), shrinkage linear discriminant analysis (SLDA) and shrinkage diagonal discriminant analysis (SDDA). The procedures were performed by software R 2.80.
PAM picked out fewer feature genes than other methods from most datasets except from Brain dataset. For the two methods of shrinkage discriminant analysis, SLDA selected more genes than SDDA from most datasets except from 2-class lung cancer dataset. When comparing SLDA with SCRDA, SLDA selected more genes than SCRDA from 2-class lung cancer, SRBCT and Brain dataset, the result was opposite for the rest datasets. The average test error of LDA modification methods was lower than LDA method.
The classification performance of LDA modification methods was superior to that of traditional LDA with respect to the average error and there was no significant difference between theses modification methods.
Automated adverse outcome surveillance tools and methods have potential utility in quality improvement and medical product surveillance activities. Their use for assessing hospital performance on the basis of patient outcomes has received little attention. We compared risk-adjusted sequential probability ratio testing (RA-SPRT) implemented in an automated tool to Massachusetts public reports of 30-day mortality after isolated coronary artery bypass graft surgery.
A total of 23,020 isolated adult coronary artery bypass surgery admissions performed in Massachusetts hospitals between January 1, 2002 and September 30, 2007 were retrospectively re-evaluated. The RA-SPRT method was implemented within an automated surveillance tool to identify hospital outliers in yearly increments. We used an overall type I error rate of 0.05, an overall type II error rate of 0.10, and a threshold that signaled if the odds of dying 30-days after surgery was at least twice than expected. Annual hospital outlier status, based on the state-reported classification, was considered the gold standard. An event was defined as at least one occurrence of a higher-than-expected hospital mortality rate during a given year.
We examined a total of 83 hospital-year observations. The RA-SPRT method alerted 6 events among three hospitals for 30-day mortality compared with 5 events among two hospitals using the state public reports, yielding a sensitivity of 100% (5/5) and specificity of 98.8% (79/80).
The automated RA-SPRT method performed well, detecting all of the true institutional outliers with a small false positive alerting rate. Such a system could provide confidential automated notification to local institutions in advance of public reporting providing opportunities for earlier quality improvement interventions.
To investigate the effect of electrical stimulation (ES) on the recovery of motor skill and neuronal cell proliferation.
The male Sprague-Dawley rats were implanted with an epidural electrode over the peri-ischemic area after photothrombotic stroke in the dominant sensorimotor cortex. All rats were randomly assigned into the ES group and control group. The behavioral test of a single pellet reaching task (SPRT) and neurological examinations including the Schabitz's photothrombotic neurological score and the Menzies test were conducted for 2 weeks. After 14 days, coronal sections were obtained and immunostained for neuronal cell differentiation markers including bromodeoxyuridine (BrdU), neuron-specific nuclear protein (NeuN), and doublecortin (DCX).
On the SPRT, the motor function in paralytic forelimbs of the ES group was significantly improved. There were no significant differences in neurological examinations and neuronal cell differentiation markers except for the significantly increased number of DCX+ cells in the corpus callosum of the ES group (p<0.05). But in the ES group, the number of NeuN+ cells in the ischemic cortex and the number of NeuN+ cells and DCX+ cells in the ischemic striatum tended to increase. In the ES group, NeuN+ cells in the ischemic hemisphere and DCX+ cells and BrdU+ cells in the opposite hemisphere tended to increase compared to those in the contralateral.
The continuous epidural ES of the ischemic sensorimotor cortex induced a significant improvement in the motor function and tended to increase neural cell proliferation in the ischemic hemisphere and the neural regeneration in the opposite hemisphere.
Cerebral ischemia; Electrical stimulation; Stroke; Cell proliferation; Motor skills
Metabolomic data analysis becomes increasingly challenging when dealing with clinical samples with diverse demographic and genetic backgrounds and various pathological conditions or treatments. Although many classification tools, such as projection to latent structures (PLS), support vector machine (SVM), linear discriminant analysis (LDA), and random forest (RF), have been successfully used in metabolomics, their performance including strengths and limitations in clinical data analysis has not been clear to researchers due to the lack of systematic evaluation of these tools. In this paper we comparatively evaluated the four classifiers, PLS, SVM, LDA, and RF, in the analysis of clinical metabolomic data derived from gas chromatography mass spectrometry platform of healthy subjects and patients diagnosed with colorectal cancer, where cross-validation, R2/Q2 plot, receiver operating characteristic curve, variable reduction, and Pearson correlation were performed. RF outperforms the other three classifiers in the given clinical data sets, highlighting its comparative advantages as a suitable classification and biomarker selection tool for clinical metabolomic data analysis.
To determine the prognostic significance of data collected early after starting certolizumab pegol (CZP) to predict low disease activity (LDA) at Week 52.
Data through Week 12 from 703 CZP-treated patients in the RA PreventIon of structural Damage (RAPID 1) trial were used as variables to predict LDA (DAS28 [ESR] ≤3.2) at Week 52. We identified variables, developed prediction models using classification trees, and tested performance using training and testing datasets. Additional prediction models were constructed using CDAI and an alternate outcome definition (composite of LDA or ACR50).
Using Week 6 and 12 data and across several different prediction models, response (LDA) and nonresponse at 1 year was predicted with relatively high accuracy (70–90%) for most patients. The best performing model predicting nonresponse by 12 weeks was 90% accurate and applied to 46% of the population. Model accuracy for predicted responders (30% of the RAPID1 population) was 74%. The area under the receiver operator curve was 0.76. Depending on the desired certainty of prediction at 12 weeks, ~12–24% of patients required >12 weeks of treatment to be accurately classified. CDAI-based models, and those evaluating the composite outcome (LDA or ACR50), achieved comparable accuracy.
We could accurately predict within 12 weeks of starting CZP whether most established RA patients with high baseline disease activity would likely achieve/not achieve LDA at 1 year. Decision trees may be useful to guide prospective management for RA patients treated with CZP and other biologics.
We consider the supervised classification setting, in which the data consist of p features measured on n observations, each of which belongs to one of K classes. Linear discriminant analysis (LDA) is a classical method for this problem. However, in the high-dimensional setting where p ≫ n, LDA is not appropriate for two reasons. First, the standard estimate for the within-class covariance matrix is singular, and so the usual discriminant rule cannot be applied. Second, when p is large, it is difficult to interpret the classification rule obtained from LDA, since it involves all p features. We propose penalized LDA, a general approach for penalizing the discriminant vectors in Fisher’s discriminant problem in a way that leads to greater interpretability. The discriminant problem is not convex, so we use a minorization-maximization approach in order to efficiently optimize it when convex penalties are applied to the discriminant vectors. In particular, we consider the use of L1 and fused lasso penalties. Our proposal is equivalent to recasting Fisher’s discriminant problem as a biconvex problem. We evaluate the performances of the resulting methods on a simulation study, and on three gene expression data sets. We also survey past methods for extending LDA to the high-dimensional setting, and explore their relationships with our proposal.
classification; feature selection; high dimensional; lasso; linear discriminant analysis; supervised learning
Classical statistical approaches for multiclass probability estimation are typically based on regression techniques such as multiple logistic regression, or density estimation approaches such as linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These methods often make certain assumptions on the form of probability functions or on the underlying distributions of subclasses. In this article, we develop a model-free procedure to estimate multiclass probabilities based on large-margin classifiers. In particular, the new estimation scheme is employed by solving a series of weighted large-margin classifiers and then systematically extracting the probability information from these multiple classification rules. A main advantage of the proposed probability estimation technique is that it does not impose any strong parametric assumption on the underlying distribution and can be applied for a wide range of large-margin classification methods. A general computational algorithm is developed for class probability estimation. Furthermore, we establish asymptotic consistency of the probability estimates. Both simulated and real data examples are presented to illustrate competitive performance of the new approach and compare it with several other existing methods.
Fisher consistency; Hard classification; Multicategory classification; Probability estimation; Soft classification; SVM
To derive and validate decision trees to categorize rheumatoid arthritis (RA) patients 12 weeks after starting etanercept with or without methotrexate into three groups: patients predicted to achieve low disease activity (LDA) at 1 year; patients predicted to not achieve LDA at 1 year; and patients who needed additional time on therapy to be categorized.
Data from RA patients enrolled in TEMPO were analyzed. Classification and Regression Trees were used to develop and validate decision-tree models with week 12 and earlier assessments that predicted long-term LDA. LDA, defined as DAS28 ≤ 3.2 or Clinical Disease Activity Index (CDAI) ≤ 10.0, was measured at 52 or 48 weeks. Demographics, laboratory data, and clinical data at baseline and through week 12 were analyzed as predictors of response.
Thirty-nine percent (67/172) of patients receiving etanercept and 60% (115/193) of patients receiving etanercept plus methotrexate achieved LDA at week 52. For patients receiving etanercept, 53% were predicted to have LDA, 39% were predicted to not have LDA, and 8% could not be categorized using DAS28 criteria at week 12. For patients receiving etanercept plus methotrexate, 63% were predicted to have LDA, 25% were predicted to not have LDA, and 12% could not be categorized.
Most (80%–90%) patients in TEMPO initiating etanercept with or without methotrexate could be predicted within 12 weeks of starting therapy as likely to have LDA or not at week 52. However, approximately 10%–20% of patients needed additional time on therapy to decide whether to continue treatment.
etanercept; methotrexate; arthritis; rheumatoid; decision tree; prediction
In the preceding decade, various studies on glioblastoma (Gb) demonstrated that
signatures obtained from gene expression microarrays correlate better with survival than
with histopathological classification. However, there is not a universal consensus
formula to predict patient survival.
We developed a gene signature using the expression profile of 47 Gbs through an
unsupervised procedure and two groups were obtained. Subsequent to a training procedure
through leave-one-out cross-validation, we fitted a discriminant (linear discriminant
analysis (LDA)) equation using the four most discriminant probesets. This was repeated
for two other published signatures and the performance of LDA equations was evaluated on
an independent test set, which contained status of IDH1 mutation, EGFR
amplification, MGMT methylation and gene VEGF expression, among other
clinical and molecular information.
The unsupervised local signature was composed of 69 probesets and clearly defined two
Gb groups, which would agree with primary and secondary Gbs. This hypothesis was
confirmed by predicting cases from the independent data set using the equations
developed by us. The high survival group predicted by equations based on our local and
one of the published signatures contained a significantly higher percentage of cases
displaying IDH1 mutation and non-amplification of EGFR. In contrast,
only the equation based on the published signature showed in the poor survival group a
significant high percentage of cases displaying a hypothesised methylation of
MGMT gene promoter and overexpression of gene VEGF.
We have produced a robust equation to confidently discriminate Gb subtypes based in the
normalised expression level of only four genes.
In order to describe multiclass classification performance, several figures of merit (FOM) have been proposed. Among the earliest and most widely known of these is the three-class Hotelling trace (3-HT). The goal of this paper is to present theoretical and empirical data demonstrating the failure of 3-HT as a measure of three-class task performance. To help do this, we contrast it to a newly proposed three-class FOM, the volume under the three-class receiver operating characteristic (ROC) surface (VUS). The VUS is obtained from a decision theory based three-class ROC analysis method which has been proved to extend the decision theoretic, linear discriminant analysis (LDA), and psychophysical foundations of binary ROC analysis to a three-class paradigm. We demonstrate empirically that the VUS and 3-HT do not have a monotonic relationship in general when describing three-class task performance. Numerical experiments demonstrated that the VUS provided reasonable results, while the 3-HT failed to distinguish between the case where all objects could be perfectly classified from the case where only one pair of the classes could be perfectly classified. We have provided theoretical explanations of this failure of 3-HT. The significance of this work goes beyond merely demonstrating the problems of the 3-HT, it demonstrates that a FOM that is mathematically correct and has a strong theoretical basis can provide results that violate a common sense understanding of three-class task performance. This fact raises the question of “how to evaluate a classification performance evaluation method?” We believe the answer to this question lies in the theoretical foundations of binary ROC analysis. We have thus contrasted the two FOMs in terms of three fundamental theories underlying binary ROC analysis: decision theory, binary linear discriminant analysis, and the equivalence of two psychophysical classification procedures. These theoretical investigations demonstrated the importance of extending and unifying all the fundamental theories of binary classification in the development of a three-class FOM; violating one of theses fundamental binary classification theories may, as it did for the L-HT, provide predictions of three-class task performance that do not agree with a common sense understanding of three-class task performance.
L-class Hotelling trace; L-class linear discriminant analysis; receiver operating characteristic (ROC) analysis; three-class classification
Purpose of review
In theory, use of aspirin in IVF is based on its anti-inflammatory, vasodilatory, and platelet aggregation inhibition properties, which improve blood flow to a woman's implantation site. It is hypothesized that this effect on blood flow will improve success rates.
Clinical studies investigating the use of low-dose aspirin (LDA) as an adjuvant therapy to IVF have produced conflicting results. The conflicting results have come as a consequence of the heterogeneous mixture of clinical trials with lack of adequate power. Even after multiple meta-analyses, differing estimates of effect were calculated as to whether aspirin should be used in conjunction with IVF.
Conflicting results leave the question of the effects of LDA in IVF unanswered. More trials are required for analysis to have adequate statistical power and until then the data remain unclear. At this point, there are not enough data to show that aspirin has a beneficial effect on the outcomes of IVF, but absence of effect is not adequate grounds to overturn the current clinical practice for those using LDA in efforts aimed at achieving success with IVF.
aspirin; implantation; in-vitro fertilization; pregnancy
Translesion synthesis (TLS) employs low fidelity polymerases to replicate past damaged DNA in a potentially error-prone process. Regulatory mechanisms that prevent TLS-associated mutagenesis are unknown; however, our recent studies suggest that the PCNA-binding protein Spartan plays a role in suppression of damage-induced mutagenesis. Here, we show that Spartan negatively regulates error-prone TLS that is dependent on POLD3, the accessory subunit of the replicative DNA polymerase Pol δ. We demonstrate that the putative zinc metalloprotease domain SprT in Spartan directly interacts with POLD3 and contributes to suppression of damage-induced mutagenesis. Depletion of Spartan induces complex formation of POLD3 with Rev1 and the error-prone TLS polymerase Pol ζ, and elevates mutagenesis that relies on POLD3, Rev1 and Pol ζ. These results suggest that Spartan negatively regulates POLD3 function in Rev1/Pol ζ-dependent TLS, revealing a previously unrecognized regulatory step in error-prone TLS.
AdpA is the key transcriptional activator for a number of genes of various functions in the A-factor regulatory cascade in Streptomyces griseus, forming an AdpA regulon. Trypsin-like activity was detected at a late stage of growth in the wild-type strain but not in an A-factor-deficient mutant. Consistent with these observations, two trypsin genes, sprT and sprU, in S. griseus were found to be members of the AdpA regulon; AdpA activated the transcription of both genes by binding to the operators located at about −50 nucleotide positions with respect to the transcriptional start point. The transcription of sprT and sprU, induced by AdpA, was most active at the onset of sporulation. Most trypsin activity exerted by S. griseus was attributed to SprT, because trypsin activity in an sprT-disrupted mutant was greatly reduced but that in an sprU-disrupted mutant was only slightly reduced. This was consistent with the observation that the amount of the sprT mRNA was much greater than that of the sprU transcript. Disruption of both sprT and sprU (mutant ΔsprTU) reduced trypsin activity to almost zero, indicating that no trypsin genes other than these two were present in S. griseus. Even the double mutant ΔsprTU grew normally and developed aerial hyphae and spores over the same time course as the wild-type strain.
AIM: To clarify the gender differences about the clinical features and risk factors of low-dose aspirin (LDA) (81-100 mg daily)-associated peptic ulcer in Japanese patients.
METHODS: There were 453 patients under treatment with LDA (298 males, 155 females) who underwent esophagogastroduodenoscopy at the Department of Gastroenterology and Hepatology of Hiratsuka City Hospital between January 2003 and December 2007. They had kept taking the LDA or started treatment during the study period and kept taking LDA during the whole period of observation. Of these, 119 patients (87 males, 32 females) were diagnosed as having LDA-associated peptic ulcer. We examined the clinical factors associated with LDA-associated peptic ulcer in both sexes.
RESULTS: A history of peptic ulcer was found to be the risk factor for LDA-associated peptic ulcer common to both sexes. In female patients, age greater than 70 years (prevalence ORs 8.441, 95% CI: 1.797-33.649, P = 0.0069) was found to be another significant risk factor, and the time to diagnosis as having LDA-associated peptic ulcer by endoscopy was significantly shorter than that in the male patients (P = 0.0050).
CONCLUSION: We demonstrated gender differences about the clinical features and risk factors of LDA-associated peptic ulcer. Special attention should be paid to aged female patients taking LDA.
Low-dose aspirin; Gender; Peptic ulcer
Bioactivity profiling using high-throughput in vitro assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex in vitro/in vivo datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.
The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated in vitro assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.
We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.
Experimental pEC50s for 216 selective respiratory syncytial virus (RSV) inhibitors are used to develop classification models as a potential screening tool for a large library of target compounds. Variable selection algorithm coupled with random forests (VS-RF) is used to extract the physicochemical features most relevant to the RSV inhibition. Based on the selected small set of descriptors, four other widely used approaches, i.e., support vector machine (SVM), Gaussian process (GP), linear discriminant analysis (LDA) and k nearest neighbors (kNN) routines are also employed and compared with the VS-RF method in terms of several of rigorous evaluation criteria. The obtained results indicate that the VS-RF model is a powerful tool for classification of RSV inhibitors, producing the highest overall accuracy of 94.34% for the external prediction set, which significantly outperforms the other four methods with the average accuracy of 80.66%. The proposed model with excellent prediction capacity from internal to external quality should be important for screening and optimization of potential RSV inhibitors prior to chemical synthesis in drug development.
RSV; variable selection; Mold2 descriptors; random forest
Computer-aided detection (CADe) and diagnosis (CAD) has been a rapidly growing, active area of research in medical imaging. Machine leaning (ML) plays an essential role in CAD, because objects such as lesions and organs may not be represented accurately by a simple equation; thus, medical pattern recognition essentially require “learning from examples.” One of the most popular uses of ML is the classification of objects such as lesion candidates into certain classes (e.g., abnormal or normal, and lesions or non-lesions) based on input features (e.g., contrast and area) obtained from segmented lesion candidates. The task of ML is to determine “optimal” boundaries for separating classes in the multidimensional feature space which is formed by the input features. ML algorithms for classification include linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), multilayer perceptrons, and support vector machines (SVM). Recently, pixel/voxel-based ML (PML) emerged in medical image processing/analysis, which uses pixel/voxel values in images directly, instead of features calculated from segmented lesions, as input information; thus, feature calculation or segmentation is not required. In this paper, ML techniques used in CAD schemes for detection and diagnosis of lung nodules in thoracic CT and for detection of polyps in CT colonography (CTC) are surveyed and reviewed.
machine learning in medical imaging; computer-aided diagnosis; classification; pixel-based machine learning; lung nodule; colorectal polyp; CT colonography
There are many instances where it is desirable to determine, at a distance, whether a subject is carrying a hidden load. Automated detection systems based on gait analysis have been proposed to detect subjects that carry hidden loads. However, very little baseline gait kinematic analysis has been performed to determine the load carriage effect while ambulating with evenly distributed (front to back) loads on human gait. The work in this paper establishes, via high resolution motion capture trials, the baseline separability of load carriage conditions into loaded and unloaded categories using several standard lower body kinematic parameters. A total of 23 participants (19 for training and 4 for testing) were studied. Satisfactory classification of participants into the correct loading condition was achieved by employing linear discriminant analysis (LDA). Six lower body kinematic parameters including ranges of motion and path lengths from the phase portraits were used to train the LDA to discriminate loaded and unloaded walking conditions. Baseline performance from 4 participants who were not included in training data sets show that the use of LDA provides a 92.5% correct classification over two loaded and unloaded walking conditions. The results suggest that there are gait pattern changes due to external loads, and LDA could be applied successfully to classify the gait patterns with an unknown load condition.
Locomotion; Gait analysis; External loads; Linear discriminant analysis