Recently a series of algorithms have been developed, providing automatic tools for tracing C. elegans embryonic cell lineage. In these algorithms, 3D images collected from a confocal laser scanning microscope were processed, the output of which is cell lineage with cell division history and cell positions with time. However, current image segmentation algorithms suffer from high error rate especially after 350-cell stage because of low signal-noise ratio as well as low resolution along the Z axis (0.5-1 microns). As a result, correction of the errors becomes a huge burden. These errors are mainly produced in the segmentation of nuclei. Thus development of a more accurate image segmentation algorithm will alleviate the hurdle for automated analysis of cell lineage.
This paper presents a new type of nuclei segmentation method embracing an bi-directional prediction procedure, which can greatly reduce the number of false negative errors, the most common errors in the previous segmentation. In this method, we first use a 2D region growing technique together with the level-set method to generate accurate 2D slices. Then a modified gradient method instead of the existing 3D local maximum method is adopted to detect all the 2D slices located in the nuclei center, each of which corresponds to one nucleus. Finally, the bi-directional pred- iction method based on the images before and after the current time point is introduced into the system to predict the nuclei in low quality parts of the images. The result of our method shows a notable improvement in the accuracy rate. For each nucleus, its precise location, volume and gene expression value (gray value) is also obtained, all of which will be useful in further downstream analyses.
The result of this research demonstrates the advantages of the bi-directional prediction method in the nuclei segmentation over that of StarryNite/MatLab StarryNite. Several other modifications adopted in our nuclei segmentation system are also discussed.
The invariant lineage of the nematode Caenorhabditis elegans has potential as a powerful tool for the description of mutant phenotypes and gene expression patterns. We previously described procedures for the imaging and automatic extraction of the cell lineage from C. elegans embryos. That method uses time-lapse confocal imaging of a strain expressing histone-GFP fusions and a software package, StarryNite, processes the thousands of images and produces output files that describe the location and lineage relationship of each nucleus at each time point.
We have developed a companion software package, AceTree, which links the images and the annotations using tree representations of the lineage. This facilitates curation and editing of the lineage. AceTree also contains powerful visualization and interpretive tools, such as space filling models and tree-based expression patterning, that can be used to extract biological significance from the data.
By pairing a fast lineaging program written in C with a user interface program written in Java we have produced a powerful software suite for exploring embryonic development.
Comparative genomic analysis of important signaling pathways in C. briggase and C. elegans reveals both conserved features and also differences. To build a framework to address the significance of these features we determined the C. briggsae embryonic cell lineage, using the tools StarryNite and AceTree. We traced both cell divisions and cell positions for all cells through all but the last round of cell division and for selected cells through the final round. We found the lineage to be remarkably similar to that of C. elegans. Not only did the founder cells give rise to similar numbers of progeny, the relative cell division timing and positions were largely maintained. These lineage similarities appear to give rise to similar cell fates as judged both by the positions of lineally-equivalent cells and by the patterns of cell deaths in both species. However, some reproducible differences were seen, e.g., the P4 cell cycle length is more than 40% longer in C. briggsae than that in C. elegans (p < 0.01). The extensive conservation of embryonic development between such divergent species suggests that substantial evolutionary distance between these two species has not altered these early developmental cellular events, although the developmental defects of transpecies hybrids suggest that the details of the underlying molecular pathways have diverged sufficiently so as to not be interchangeable.
C. briggsae; C. elegans; embryo; cell lineage; signaling pathway
Motivation: Deciphering the regulatory and developmental mechanisms for multicellular organisms requires detailed knowledge of gene interactions and gene expressions. The availability of large datasets with both spatial and ontological annotation of the spatio-temporal patterns of gene expression in mouse embryo provides a powerful resource to discover the biological function of embryo organization. Ontological annotation of gene expressions consists of labelling images with terms from the anatomy ontology for mouse development. If the spatial genes of an anatomical component are expressed in an image, the image is then tagged with a term of that anatomical component. The current annotation is done manually by domain experts, which is both time consuming and costly. In addition, the level of detail is variable, and inevitably errors arise from the tedious nature of the task. In this article, we present a new method to automatically identify and annotate gene expression patterns in the mouse embryo with anatomical terms.
Results: The method takes images from in situ hybridization studies and the ontology for the developing mouse embryo, it then combines machine learning and image processing techniques to produce classifiers that automatically identify and annotate gene expression patterns in these images. We evaluate our method on image data from the EURExpress study, where we use it to automatically classify nine anatomical terms: humerus, handplate, fibula, tibia, femur, ribs, petrous part, scapula and head mesenchyme. The accuracy of our method lies between 70% and 80% with few exceptions. We show that other known methods have lower classification performance than ours. We have investigated the images misclassified by our method and found several cases where the original annotation was not correct. This shows our method is robust against this kind of noise.
Availability: The annotation result and the experimental dataset in the article can be freely accessed at http://www2.docm.mmu.ac.uk/STAFF/L.Han/geneannotation/.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Metabolic networks are represented by the set of metabolic pathways. Metabolic pathways are a series of biochemical reactions, in which the product (output) from one reaction serves as the substrate (input) to another reaction. Many pathways remain incompletely characterized. One of the major challenges of computational biology is to obtain better models of metabolic pathways. Existing models are dependent on the annotation of the genes. This propagates error accumulation when the pathways are predicted by incorrectly annotated genes. Pairwise classification methods are supervised learning methods used to classify new pair of entities. Some of these classification methods, e.g., Pairwise Support Vector Machines (SVMs), use pairwise kernels. Pairwise kernels describe similarity measures between two pairs of entities. Using pairwise kernels to handle sequence data requires long processing times and large storage. Rational kernels are kernels based on weighted finite-state transducers that represent similarity measures between sequences or automata. They have been effectively used in problems that handle large amount of sequence information such as protein essentiality, natural language processing and machine translations.
We create a new family of pairwise kernels using weighted finite-state transducers (called Pairwise Rational Kernel (PRK)) to predict metabolic pathways from a variety of biological data. PRKs take advantage of the simpler representations and faster algorithms of transducers. Because raw sequence data can be used, the predictor model avoids the errors introduced by incorrect gene annotations. We then developed several experiments with PRKs and Pairwise SVM to validate our methods using the metabolic network of Saccharomyces cerevisiae. As a result, when PRKs are used, our method executes faster in comparison with other pairwise kernels. Also, when we use PRKs combined with other simple kernels that include evolutionary information, the accuracy values have been improved, while maintaining lower construction and execution times.
The power of using kernels is that almost any sort of data can be represented using kernels. Therefore, completely disparate types of data can be combined to add power to kernel-based machine learning methods. When we compared our proposal using PRKs with other similar kernel, the execution times were decreased, with no compromise of accuracy. We also proved that by combining PRKs with other kernels that include evolutionary information, the accuracy can also also be improved. As our proposal can use any type of sequence data, genes do not need to be properly annotated, avoiding accumulation errors because of incorrect previous annotations.
Metabolic network; Pairwise rational kernels; Supervised network inference; Finite-state transducers; Pairwise support vector machine
Social media platforms such as Twitter are rapidly becoming key resources for public health surveillance applications, yet little is known about Twitter users’ levels of informedness and sentiment toward tobacco, especially with regard to the emerging tobacco control challenges posed by hookah and electronic cigarettes.
To develop a content and sentiment analysis of tobacco-related Twitter posts and build machine learning classifiers to detect tobacco-relevant posts and sentiment towards tobacco, with a particular focus on new and emerging products like hookah and electronic cigarettes.
We collected 7362 tobacco-related Twitter posts at 15-day intervals from December 2011 to July 2012. Each tweet was manually classified using a triaxial scheme, capturing genre, theme, and sentiment. Using the collected data, machine-learning classifiers were trained to detect tobacco-related vs irrelevant tweets as well as positive vs negative sentiment, using Naïve Bayes, k-nearest neighbors, and Support Vector Machine (SVM) algorithms. Finally, phi contingency coefficients were computed between each of the categories to discover emergent patterns.
The most prevalent genres were first- and second-hand experience and opinion, and the most frequent themes were hookah, cessation, and pleasure. Sentiment toward tobacco was overall more positive (1939/4215, 46% of tweets) than negative (1349/4215, 32%) or neutral among tweets mentioning it, even excluding the 9% of tweets categorized as marketing. Three separate metrics converged to support an emergent distinction between, on one hand, hookah and electronic cigarettes corresponding to positive sentiment, and on the other hand, traditional tobacco products and more general references corresponding to negative sentiment. These metrics included correlations between categories in the annotation scheme (phihookah-positive=0.39; phie-cigs-positive=0.19); correlations between search keywords and sentiment (χ2
4=414.50, P<.001, Cramer’s V=0.36), and the most discriminating unigram features for positive and negative sentiment ranked by log odds ratio in the machine learning component of the study. In the automated classification tasks, SVMs using a relatively small number of unigram features (500) achieved best performance in discriminating tobacco-related from unrelated tweets (F score=0.85).
Novel insights available through Twitter for tobacco surveillance are attested through the high prevalence of positive sentiment. This positive sentiment is correlated in complex ways with social image, personal experience, and recently popular products such as hookah and electronic cigarettes. Several apparent perceptual disconnects between these products and their health effects suggest opportunities for tobacco control education. Finally, machine classification of tobacco-related posts shows a promising edge over strictly keyword-based approaches, yielding an improved signal-to-noise ratio in Twitter data and paving the way for automated tobacco surveillance applications.
social media; twitter messaging; smoking; natural language processing
Computerized evaluation of histological preparations of prostate tissues involves identification of tissue components such as stroma (ST), benign/normal epithelium (BN) and prostate cancer (PCa). Image classification approaches have been developed to identify and classify glandular regions in digital images of prostate tissues; however their success has been limited by difficulties in cellular segmentation and tissue heterogeneity. We hypothesized that utilizing image pixels to generate intensity histograms of hematoxylin (H) and eosin (E) stains deconvoluted from H&E images numerically captures the architectural difference between glands and stroma. In addition, we postulated that joint histograms of local binary patterns and local variance (LBPxVAR) can be used as sensitive textural features to differentiate benign/normal tissue from cancer. Here we utilized a machine learning approach comprising of a support vector machine (SVM) followed by a random forest (RF) classifier to digitally stratify prostate tissue into ST, BN and PCa areas. Two pathologists manually annotated 210 images of low- and high-grade tumors from slides that were selected from 20 radical prostatectomies and digitized at high-resolution. The 210 images were split into the training (n = 19) and test (n = 191) sets. Local intensity histograms of H and E were used to train a SVM classifier to separate ST from epithelium (BN + PCa). The performance of SVM prediction was evaluated by measuring the accuracy of delineating epithelial areas. The Jaccard J = 59.5 ± 14.6 and Rand Ri = 62.0 ± 7.5 indices reported a significantly better prediction when compared to a reference method (Chen et al., Clinical Proteomics 2013, 10:18) based on the averaged values from the test set. To distinguish BN from PCa we trained a RF classifier with LBPxVAR and local intensity histograms and obtained separate performance values for BN and PCa: JBN = 35.2 ± 24.9, OBN = 49.6 ± 32, JPCa = 49.5 ± 18.5, OPCa = 72.7 ± 14.8 and Ri = 60.6 ± 7.6 in the test set. Our pixel-based classification does not rely on the detection of lumens, which is prone to errors and has limitations in high-grade cancers and has the potential to aid in clinical studies in which the quantification of tumor content is necessary to prognosticate the course of the disease. The image data set with ground truth annotation is available for public use to stimulate further research in this area.
Machine learning; Image analysis; Prostate cancer; Tissue classification; Tissue quantification
Introduce the notion of cross-sectional relatedness as an informational dependence relation between sentences in the conclusion section of a breast radiology report and sentences in the findings section of the same report. Assess inter-rater agreement of breast radiologists. Develop and evaluate a support vector machine (SVM) classifier for automatically detecting cross-sectional relatedness. A standard reference is manually created from 444 breast radiology reports by the first author. A subset of 37 reports is annotated by five breast radiologists. Inter-rater agreement is computed among their annotations and standard reference. Thirteen numerical features are developed to characterize pairs of sentences; the optimal feature set is sought through forward selection. Inter-rater agreement is F-measure 0.623. SVM classifier has F-measure of 0.699 in the 12-fold cross-validation protocol against standard reference. Report length does not correlate with the classifier’s performance (correlation coefficient = −0.073). SVM classifier has average F-measure of 0.505 against annotations by breast radiologists. Mediocre inter-rater agreement is possibly caused by: (1) definition is insufficiently actionable, (2) fine-grained nature of cross-sectional relatedness on sentence level, instead of, for instance, on paragraph level, and (3) higher-than-average complexity of 37-report sample. SVM classifier performs better against standard reference than against breast radiologists’s annotations. This is supportive of (3). SVM’s performance on standard reference is satisfactory. Since optimal feature set is not breast specific, results may transfer to non-breast anatomies. Applications include a smart report viewing environment and data mining.
Electronic supplementary material
The online version of this article (doi:10.1007/s10278-013-9612-9) contains supplementary material, which is available to authorized users.
Radiology reports; Information retrieval; Support vector machine; Text mining; Inter-rater agreement; Textual entailment
Efficient and accurate prediction of protein function from sequence is one of the standing problems in Biology. The generalised use of sequence alignments for inferring function promotes the propagation of errors, and there are limits to its applicability. Several machine learning methods have been applied to predict protein function, but they lose much of the information encoded by protein sequences because they need to transform them to obtain data of fixed length.
We have developed a machine learning methodology, called peptide programs (PPs), to deal directly with protein sequences and compared its performance with that of Support Vector Machines (SVMs) and BLAST in detailed enzyme classification tasks. Overall, the PPs and SVMs had a similar performance in terms of Matthews Correlation Coefficient, but the PPs had generally a higher precision. BLAST performed globally better than both methodologies, but the PPs had better results than BLAST and SVMs for the smaller datasets.
The higher precision of the PPs in comparison to the SVMs suggests that dealing with sequences is advantageous for detailed protein classification, as precision is essential to avoid annotation errors. The fact that the PPs performed better than BLAST for the smaller datasets demonstrates the potential of the methodology, but the drop in performance observed for the larger datasets indicates that further development is required.
Possible strategies to address this issue include partitioning the datasets into smaller subsets and training individual PPs for each subset, or training several PPs for each dataset and combining them using a bagging strategy.
To compare linear and Laplacian SVMs on a clinical text classification task; to evaluate the effect of unlabeled training data on Laplacian SVM performance.
The development of machine-learning based clinical text classifiers requires the creation of labeled training data, obtained via manual review by clinicians. Due to the effort and expense involved in labeling data, training data sets in the clinical domain are of limited size. In contrast, electronic medical record (EMR) systems contain hundreds of thousands of unlabeled notes that are not used by supervised machine learning approaches. Semi-supervised learning algorithms use both labeled and unlabeled data to train classifiers, and can outperform their supervised counterparts.
We trained support vector machines (SVMs) and Laplacian SVMs on a training reference standard of 820 abdominal CT, MRI, and Ultrasound reports labeled for the presence of potentially malignant liver lesions that require follow up (positive class prevalence 77%). The Laplacian SVM used 19,845 randomly sampled unlabeled notes in addition to the training reference standard. We evaluated SVMs and Laplacian SVMs on a test set of 520 labeled reports.
The Laplacian SVM trained on labeled and unlabeled radiology reports significantly outperformed supervised SVMs (Macro-F1 0.773 vs. 0.741, Sensitivity 0.943 vs. 0.911, Positive Predictive value 0.877 vs. 0.883). Performance improved with the number of labeled and unlabeled notes used to train the Laplacian SVM (pearson’s ρ=0.529 for correlation between number of unlabeled notes and macro-F1 score). These results suggest that practical semi-supervised methods such as the Laplacian SVM can leverage the large, unlabeled corpora that reside within EMRs to improve clinical text classification.
Semi-supervised learning; Support vector machine; Graph Laplacian; Natural language processing
Extraction of clinical information such as medications or problems from clinical text is an important task of clinical natural language processing (NLP). Rule-based methods are often used in clinical NLP systems because they are easy to adapt and customize. Recently, supervised machine learning methods have proven to be effective in clinical NLP as well. However, combining different classifiers to further improve the performance of clinical entity recognition systems has not been investigated extensively. Combining classifiers into an ensemble classifier presents both challenges and opportunities to improve performance in such NLP tasks.
We investigated ensemble classifiers that used different voting strategies to combine outputs from three individual classifiers: a rule-based system, a support vector machine (SVM) based system, and a conditional random field (CRF) based system. Three voting methods were proposed and evaluated using the annotated data sets from the 2009 i2b2 NLP challenge: simple majority, local SVM-based voting, and local CRF-based voting.
Evaluation on 268 manually annotated discharge summaries from the i2b2 challenge showed that the local CRF-based voting method achieved the best F-score of 90.84% (94.11% Precision, 87.81% Recall) for 10-fold cross-validation. We then compared our systems with the first-ranked system in the challenge by using the same training and test sets. Our system based on majority voting achieved a better F-score of 89.65% (93.91% Precision, 85.76% Recall) than the previously reported F-score of 89.19% (93.78% Precision, 85.03% Recall) by the first-ranked system in the challenge.
Our experimental results using the 2009 i2b2 challenge datasets showed that ensemble classifiers that combine individual classifiers into a voting system could achieve better performance than a single classifier in recognizing medication information from clinical text. It suggests that simple strategies that can be easily implemented such as majority voting could have the potential to significantly improve clinical entity recognition.
Growing evidence documents the potential of machine learning for developing brain based diagnostic methods for major depressive disorder (MDD). As symptom severity may influence brain activity, we investigated whether the severity of MDD affected the accuracies of machine learned MDD-vs-Control diagnostic classifiers.
Forty-five medication-free patients with DSM-IV defined MDD and 19 healthy controls participated in the study. Based on depression severity as determined by the Hamilton Rating Scale for Depression (HRSD), MDD patients were sorted into three groups: mild to moderate depression (HRSD 14–19), severe depression (HRSD 20–23), and very severe depression (HRSD ≥ 24). We collected functional magnetic resonance imaging (fMRI) data during both resting-state and an emotional-face matching task. Patients in each of the three severity groups were compared against controls in separate analyses, using either the resting-state or task-based fMRI data. We use each of these six datasets with linear support vector machine (SVM) binary classifiers for identifying individuals as patients or controls.
The resting-state fMRI data showed statistically significant classification accuracy only for the very severe depression group (accuracy 66%, p = 0.012 corrected), while mild to moderate (accuracy 58%, p = 1.0 corrected) and severe depression (accuracy 52%, p = 1.0 corrected) were only at chance. With task-based fMRI data, the automated classifier performed at chance in all three severity groups.
Binary linear SVM classifiers achieved significant classification of very severe depression with resting-state fMRI, but the contribution of brain measurements may have limited potential in differentiating patients with less severe depression from healthy controls.
•SVM binary classifiers achieved significant classification of very severe depression with resting state fMRI data.•Prefrontal, anterior cingulate and insula were the most discriminative brain regions.•No significant classification could be achieved for less severe MDD with resting state data.•With emotional task data, SVM classifier performed at chance for all MDD severity groups.
Major depression; Severity of symptoms; Diagnosis; Functional magnetic resonance imaging; Machine learning; Classification; Support vector machine
Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.
In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).
The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.
We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.
To harness the full potential of social media for epidemiological surveillance of drug abuse trends, the field needs a greater level of automation in processing and analyzing social media content.
The objective of the study is to describe the development of supervised machine-learning techniques for the eDrugTrends platform to automatically classify tweets by type/source of communication (personal, official/media, retail) and sentiment (positive, negative, neutral) expressed in cannabis- and synthetic cannabinoid–related tweets.
Tweets were collected using Twitter streaming Application Programming Interface and filtered through the eDrugTrends platform using keywords related to cannabis, marijuana edibles, marijuana concentrates, and synthetic cannabinoids. After creating coding rules and assessing intercoder reliability, a manually labeled data set (N=4000) was developed by coding several batches of randomly selected subsets of tweets extracted from the pool of 15,623,869 collected by eDrugTrends (May-November 2015). Out of 4000 tweets, 25% (1000/4000) were used to build source classifiers and 75% (3000/4000) were used for sentiment classifiers. Logistic Regression (LR), Naive Bayes (NB), and Support Vector Machines (SVM) were used to train the classifiers. Source classification (n=1000) tested Approach 1 that used short URLs, and Approach 2 where URLs were expanded and included into the bag-of-words analysis. For sentiment classification, Approach 1 used all tweets, regardless of their source/type (n=3000), while Approach 2 applied sentiment classification to personal communication tweets only (2633/3000, 88%). Multiclass and binary classification tasks were examined, and machine-learning sentiment classifier performance was compared with Valence Aware Dictionary for sEntiment Reasoning (VADER), a lexicon and rule-based method. The performance of each classifier was assessed using 5-fold cross validation that calculated average F-scores. One-tailed t test was used to determine if differences in F-scores were statistically significant.
In multiclass source classification, the use of expanded URLs did not contribute to significant improvement in classifier performance (0.7972 vs 0.8102 for SVM, P=.19). In binary classification, the identification of all source categories improved significantly when unshortened URLs were used, with personal communication tweets benefiting the most (0.8736 vs 0.8200, P<.001). In multiclass sentiment classification Approach 1, SVM (0.6723) performed similarly to NB (0.6683) and LR (0.6703). In Approach 2, SVM (0.7062) did not differ from NB (0.6980, P=.13) or LR (F=0.6931, P=.05), but it was over 40% more accurate than VADER (F=0.5030, P<.001). In multiclass task, improvements in sentiment classification (Approach 2 vs Approach 1) did not reach statistical significance (eg, SVM: 0.7062 vs 0.6723, P=.052). In binary sentiment classification (positive vs negative), Approach 2 (focus on personal communication tweets only) improved classification results, compared with Approach 1, for LR (0.8752 vs 0.8516, P=.04) and SVM (0.8800 vs 0.8557, P=.045).
The study provides an example of the use of supervised machine learning methods to categorize cannabis- and synthetic cannabinoid–related tweets with fairly high accuracy. Use of these content analysis tools along with geographic identification capabilities developed by the eDrugTrends platform will provide powerful methods for tracking regional changes in user opinions related to cannabis and synthetic cannabinoids use over time and across different regions.
social media; Twitter; cannabis; synthetic cannabinoids; machine learning; sentiment analysis; eDrugTrends
Automated identification of cell cycle phases of individual live cells in a large population captured via automated fluorescence microscopy technique is important for cancer drug discovery and cell cycle studies. Time-lapse fluorescence microscopy images provide an important method to study the cell cycle process under different conditions of perturbation. Existing methods are limited in dealing with such time-lapse data sets while manual analysis is not feasible. This paper presents statistical data analysis and statistical pattern recognition to perform this task.
The data is generated from Hela H2B GFP cells imaged during a 2-day period with images acquired 15 minutes apart using an automated time-lapse fluorescence microscopy. The patterns are described with four kinds of features, including twelve general features, Haralick texture features, Zernike moment features, and wavelet features. To generate a new set of features with more discriminate power, the commonly used feature reduction techniques are used, which include Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), Maximum Margin Criterion (MMC), Stepwise Discriminate Analysis based Feature Selection (SDAFS), and Genetic Algorithm based Feature Selection (GAFS). Then, we propose a Context Based Mixture Model (CBMM) for dealing with the time-series cell sequence information and compare it to other traditional classifiers: Support Vector Machine (SVM), Neural Network (NN), and K-Nearest Neighbor (KNN). Being a standard practice in machine learning, we systematically compare the performance of a number of common feature reduction techniques and classifiers to select an optimal combination of a feature reduction technique and a classifier. A cellular database containing 100 manually labelled subsequence is built for evaluating the performance of the classifiers. The generalization error is estimated using the cross validation technique. The experimental results show that CBMM outperforms all other classifies in identifying prophase and has the best overall performance.
The application of feature reduction techniques can improve the prediction accuracy significantly. CBMM can effectively utilize the contextual information and has the best overall performance when combined with any of the previously mentioned feature reduction techniques.
Due to the lack of annotated data sets, there are few studies on machine learning based approaches to extract named entities (NEs) in clinical text. The 2009 i2b2 NLP challenge is a task to extract six types of medication related NEs, including medication names, dosage, mode, frequency, duration, and reason from hospital discharge summaries. Several machine learning based systems have been developed and showed good performance in the challenge. Those systems often involve two steps: 1) recognition of medication related entities; and 2) determination of the relation between a medication name and its modifiers (e.g., dosage). A few machine learning algorithms including Conditional Random Field (CRF) and Maximum Entropy have been applied to the Named Entity Recognition (NER) task at the first step. In this study, we developed a Support Vector Machine (SVM) based method to recognize medication related entities. In addition, we systematically investigated various types of features for NER in clinical text. Evaluation on 268 manually annotated discharge summaries from i2b2 challenge showed that the SVM-based NER system achieved the best F-score of 90.05% (93.20% Precision, 87.12% Recall), when semantic features generated from a rule-based system were included.
Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net.
We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone.
Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution.
Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error.
Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations.
The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning parameters.
The penalized SVM classification algorithms as well as fixed grid and interval search for finding appropriate tuning parameters were implemented in our freely available R package 'penalizedSVM'.
We conclude that the Elastic SCAD SVM is a flexible and robust tool for classification and feature selection tasks for high-dimensional data such as microarray data sets.
Predicting a protein's structural class from its amino acid sequence is a fundamental problem in computational biology. Much recent work has focused on developing new representations for protein sequences, called string kernels, for use with support vector machine (SVM) classifiers. However, while some of these approaches exhibit state-of-the-art performance at the binary protein classification problem, i.e. discriminating between a particular protein class and all other classes, few of these studies have addressed the real problem of multi-class superfamily or fold recognition. Moreover, there are only limited software tools and systems for SVM-based protein classification available to the bioinformatics community.
We present a new multi-class SVM-based protein fold and superfamily recognition system and web server called SVM-Fold, which can be found at . Our system uses an efficient implementation of a state-of-the-art string kernel for sequence profiles, called the profile kernel, where the underlying feature representation is a histogram of inexact matching k-mer frequencies. We also employ a novel machine learning approach to solve the difficult multi-class problem of classifying a sequence of amino acids into one of many known protein structural classes. Binary one-vs-the-rest SVM classifiers that are trained to recognize individual structural classes yield prediction scores that are not comparable, so that standard "one-vs-all" classification fails to perform well. Moreover, SVMs for classes at different levels of the protein structural hierarchy may make useful predictions, but one-vs-all does not try to combine these multiple predictions. To deal with these problems, our method learns relative weights between one-vs-the-rest classifiers and encodes information about the protein structural hierarchy for multi-class prediction. In large-scale benchmark results based on the SCOP database, our code weighting approach significantly improves on the standard one-vs-all method for both the superfamily and fold prediction in the remote homology setting and on the fold recognition problem. Moreover, our code weight learning algorithm strongly outperforms nearest-neighbor methods based on PSI-BLAST in terms of prediction accuracy on every structure classification problem we consider.
By combining state-of-the-art SVM kernel methods with a novel multi-class algorithm, the SVM-Fold system delivers efficient and accurate protein fold and superfamily recognition.
Pattern recognition methods, such as computer aided diagnosis (CAD) systems, can help clinicians in their diagnosis by marking abnormal regions in an image. We propose a machine learning system based on a one-class support vector machine (OC-SVM) classifier for the detection of abnormalities in magnetic resonance images (MRI) applied to patients with intractable epilepsy. The system learns the features associated with healthy control subjects, allowing a voxelwise assessment of the deviation of a test subject pattern from the learned patterns. While any number of various features can be chosen and learned, here we focus on two texture parameters capturing image patterns associated with epileptogenic lesions on T1-weighted brain MRI e.g. heterotopia and blurred junction between the grey and white matter. The CAD output consists of patient specific 3D maps locating clusters of suspicious voxels ranked by size and degree of deviation from control patterns. System performance was evaluated using realistic simulations of challenging detection tasks as well as clinical data of 77 healthy control subjects and of eleven patients (13 lesions). It was compared to that of a mass univariate statistical parametric mapping (SPM) single subject analysis based on the same set of features. For all simulations, OC-SVM yielded significantly higher values of the area under the ROC curve (AUC) and higher sensitivity at low false positive rate. For the clinical data, both OC-SVM and SPM successfully detected 100% of the lesions in the MRI positive cases (3/13). For the MRI negative cases (10/13), OC-SVM detected 7/10 lesions and SPM analysis detected 5/10 lesions. In all experiments, OC-SVM produced fewer false positive detections than SPM. OC-SVM may be a versatile system for unbiased lesion detection.
Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
This study was to assess whether active learning strategies can be integrated with supervised word sense disambiguation (WSD) methods, thus reducing the number of annotated samples, while keeping or improving the quality of disambiguation models.
We developed support vector machine (SVM) classifiers to disambiguate 197 ambiguous terms and abbreviations in the MSH WSD collection. Three different uncertainty sampling-based active learning algorithms were implemented with the SVM classifiers and were compared with a passive learner (PL) based on random sampling. For each ambiguous term and each learning algorithm, a learning curve that plots the accuracy computed from the test set as a function of the number of annotated samples used in the model was generated. The area under the learning curve (ALC) was used as the primary metric for evaluation.
Our experiments demonstrated that active learners (ALs) significantly outperformed the PL, showing better performance for 177 out of 197 (89.8%) WSD tasks. Further analysis showed that to achieve an average accuracy of 90%, the PL needed 38 annotated samples, while the ALs needed only 24, a 37% reduction in annotation effort. Moreover, we analyzed cases where active learning algorithms did not achieve superior performance and identified three causes: (1) poor models in the early learning stage; (2) easy WSD cases; and (3) difficult WSD cases, which provide useful insight for future improvements.
This study demonstrated that integrating active learning strategies with supervised WSD methods could effectively reduce annotation cost and improve the disambiguation models.
Active Learning; Word Sense Disambiguation; Natural Language Processing; Machine Learning; Uncertainty Sampling; Annotation
Motivation: Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. In previous work, we have proposed ways to explicitly annotate the structure of scientific investigations in scholarly publications. Here we present the means to facilitate automatic access to the scientific discourse of articles by automating the recognition of 11 categories at the sentence level, which we call Core Scientific Concepts (CoreSCs). These include: Hypothesis, Motivation, Goal, Object, Background, Method, Experiment, Model, Observation, Result and Conclusion. CoreSCs provide the structure and context to all statements and relations within an article and their automatic recognition can greatly facilitate biomedical information extraction by characterizing the different types of facts, hypotheses and evidence available in a scientific publication.
Results: We have trained and compared machine learning classifiers (support vector machines and conditional random fields) on a corpus of 265 full articles in biochemistry and chemistry to automatically recognize CoreSCs. We have evaluated our automatic classifications against a manually annotated gold standard, and have achieved promising accuracies with ‘Experiment’, ‘Background’ and ‘Model’ being the categories with the highest F1-scores (76%, 62% and 53%, respectively). We have analysed the task of CoreSC annotation both from a sentence classification as well as sequence labelling perspective and we present a detailed feature evaluation. The most discriminative features are local sentence features such as unigrams, bigrams and grammatical dependencies while features encoding the document structure, such as section headings, also play an important role for some of the categories. We discuss the usefulness of automatically generated CoreSCs in two biomedical applications as well as work in progress.
Availability: A web-based tool for the automatic annotation of articles with CoreSCs and corresponding documentation is available online at http://www.sapientaproject.com/software
http://www.sapientaproject.com also contains detailed information pertaining to CoreSC annotation and links to annotation guidelines as well as a corpus of manually annotated articles, which served as our training data.
Supplementary data are available at Bioinformatics online.
For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation  of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology.
Eukaryotic genes contain introns, which are intervening sequences that are excised from a gene transcript with the concomitant ligation of flanking segments called exons. The process of removing introns is called splicing. It involves biochemical mechanisms that to date are too complex to be modeled comprehensively and accurately. However, abundant sequencing results can serve as a blueprint database exemplifying what this process accomplishes. Using this database, we employ discriminative machine learning techniques to predict the mature mRNA given the unspliced pre-mRNA. Our method utilizes support vector machines and recent advances in label sequence learning, originally developed for natural language processing. The system, called mSplicer, was trained and evaluated on the genome of the nematode C. elegans, a well-studied model organism. We were able to show that mSplicer correctly predicts the splice form in most cases. Surprisingly, our predictions on currently unconfirmed genes deviate considerably from the public genome annotation. It is hypothesized that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation and additional sequencing results show the superiority of mSplicer's predictions. It is concluded that the annotation of nematode and other genomes can be greatly enhanced using modern machine learning.
The goal of this study is to develop, test, and compare multinomial logistic regression (MLR) and support vector machines (SVM) in classifying preschool-aged children physical activity data acquired from an accelerometer. In this study, 69 children aged 3–5 years old were asked to participate in a supervised protocol of physical activities while wearing a triaxial accelerometer. Accelerometer counts, steps, and position were obtained from the device. We applied K-means clustering to determine the number of natural groupings presented by the data. We used MLR and SVM to classify the six activity types. Using direct observation as the criterion method, the 10-fold cross-validation (CV) error rate was used to compare MLR and SVM classifiers, with and without sleep. Altogether, 58 classification models based on combinations of the accelerometer output variables were developed. In general, the SVM classifiers have a smaller 10-fold CV error rate than their MLR counterparts. Including sleep, a SVM classifier provided the best performance with a 10-fold CV error rate of 24.70%. Without sleep, a SVM classifier-based triaxial accelerometer counts, vector magnitude, steps, position, and 1- and 2-min lag and lead values achieved a 10-fold CV error rate of 20.16% and an overall classification error rate of 15.56%. SVM supersedes the classical classifier MLR in categorizing physical activities in preschool-aged children. Using accelerometer data, SVM can be used to correctly classify physical activities typical of preschool-aged children with an acceptable classification error rate.
Accelerometers; activity monitoring; classification; multinomial logistic regression classifiers; support vector machines classifiers
Pain often exists in the absence of observable injury; therefore, the gold standard for pain assessment has long been self-report. Because the inability to verbally communicate can prevent effective pain management, research efforts have focused on the development of a tool that accurately assesses pain without depending on self-report. Those previous efforts have not proven successful at substituting self-report with a clinically valid, physiology-based measure of pain. Recent neuroimaging data suggest that functional magnetic resonance imaging (fMRI) and support vector machine (SVM) learning can be jointly used to accurately assess cognitive states. Therefore, we hypothesized that an SVM trained on fMRI data can assess pain in the absence of self-report. In fMRI experiments, 24 individuals were presented painful and nonpainful thermal stimuli. Using eight individuals, we trained a linear SVM to distinguish these stimuli using whole-brain patterns of activity. We assessed the performance of this trained SVM model by testing it on 16 individuals whose data were not used for training. The whole-brain SVM was 81% accurate at distinguishing painful from non-painful stimuli (p<0.0000001). Using distance from the SVM hyperplane as a confidence measure, accuracy was further increased to 84%, albeit at the expense of excluding 15% of the stimuli that were the most difficult to classify. Overall performance of the SVM was primarily affected by activity in pain-processing regions of the brain including the primary somatosensory cortex, secondary somatosensory cortex, insular cortex, primary motor cortex, and cingulate cortex. Region of interest (ROI) analyses revealed that whole-brain patterns of activity led to more accurate classification than localized activity from individual brain regions. Our findings demonstrate that fMRI with SVM learning can assess pain without requiring any communication from the person being tested. We outline tasks that should be completed to advance this approach toward use in clinical settings.