Summary: Differential dependency network (DDN) is a caBIG® (cancer Biomedical Informatics Grid) analytical tool for detecting and visualizing statistically significant topological changes in transcriptional networks representing two biological conditions. Developed under caBIG® 's In Silico Research Centers of Excellence (ISRCE) Program, DDN enables differential network analysis and provides an alternative way for defining network biomarkers predictive of phenotypes. DDN also serves as a useful systems biology tool for users across biomedical research communities to infer how genetic, epigenetic or environment variables may affect biological networks and clinical phenotypes. Besides the standalone Java application, we have also developed a Cytoscape plug-in, CytoDDN, to integrate network analysis and visualization seamlessly.
Availability: The Java and MATLAB source code can be downloaded at the authors' web site http://www.cbil.ece.vt.edu/software.htm
Supplementary information: Supplementary data are available at Bioinformatics online.
Gene expression profiling has been widely used to study molecular signatures of many diseases and to develop molecular diagnostics for disease prediction. Gene selection, as an important step for improved diagnostics, screens tens of thousands of genes and identifies a small subset that discriminates between disease types. A two-step gene selection method is proposed to identify informative gene subsets for accurate classification of multiclass phenotypes. In the first step, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (wFC). In the second step, jointly discriminatory genes (JDGs) are selected by sequential search methods, based on their joint class separability measured by multidimensional weighted Fisher criterion (wFC). The performance of the selected gene subsets for multiclass prediction is evaluated by artificial neural networks (ANNs) and/or support vector machines (SVMs). By applying the proposed IDG/JDG approach to two microarray studies, that is, small round blue cell tumors (SRBCTs) and muscular dystrophies (MDs), we successfully identified a much smaller yet efficient set of JDGs for diagnosing SRBCTs and MDs with high prediction accuracies (96.9% for SRBCTs and 92.3% for MDs, resp.). These experimental results demonstrated that the two-step gene selection method is able to identify a subset of highly discriminative genes for improved multiclass prediction.
We investigate the problems of multiclass cancer classification
with gene selection from gene expression data. Two different
constructed multiclass classifiers with gene selection are
proposed, which are fuzzy support vector machine (FSVM) with gene
selection and binary classification tree based on SVM with gene
selection. Using F test and recursive feature elimination based on
SVM as gene selection methods, binary classification tree based on
SVM with F test, binary classification tree based on SVM with
recursive feature elimination based on SVM, and FSVM with
recursive feature elimination based on SVM are tested in our
experiments. To accelerate computation, preselecting the strongest
genes is also used. The proposed techniques are applied to analyze
breast cancer data, small round blue-cell tumors, and acute
leukemia data. Compared to existing multiclass cancer classifiers
and binary classification tree based on SVM with F test or binary
classification tree based on SVM with recursive feature
elimination based on SVM mentioned in this paper, FSVM based on
recursive feature elimination based on SVM can find most important
genes that affect certain types of cancer with high recognition accuracy.
Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems.
We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes.
Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results.
Predicting protein subnuclear localization is a challenging problem. Some previous works based on non-sequence information including Gene Ontology annotations and kernel fusion have respective limitations. The aim of this work is twofold: one is to propose a novel individual feature extraction method; another is to develop an ensemble method to improve prediction performance using comprehensive information represented in the form of high dimensional feature vector obtained by 11 feature extraction methods.
A novel two-stage multiclass support vector machine is proposed to predict protein subnuclear localizations. It only considers those feature extraction methods based on amino acid classifications and physicochemical properties. In order to speed up our system, an automatic search method for the kernel parameter is used. The prediction performance of our method is evaluated on four datasets: Lei dataset, multi-localization dataset, SNL9 dataset and a new independent dataset. The overall accuracy of prediction for 6 localizations on Lei dataset is 75.2% and that for 9 localizations on SNL9 dataset is 72.1% in the leave-one-out cross validation, 71.7% for the multi-localization dataset and 69.8% for the new independent dataset, respectively. Comparisons with those existing methods show that our method performs better for both single-localization and multi-localization proteins and achieves more balanced sensitivities and specificities on large-size and small-size subcellular localizations. The overall accuracy improvements are 4.0% and 4.7% for single-localization proteins and 6.5% for multi-localization proteins. The reliability and stability of our classification model are further confirmed by permutation analysis.
It can be concluded that our method is effective and valuable for predicting protein subnuclear localizations. A web server has been designed to implement the proposed method. It is freely available at http://bioinformatics.awowshop.com/snlpred_page.php.
We describe Support Vector Machine (SVM) applications to classification and clustering of channel current data. SVMs are variational-calculus based methods that are constrained to have structural risk minimization (SRM), i.e., they provide noise tolerant solutions for pattern recognition. The SVM approach encapsulates a significant amount of model-fitting information in the choice of its kernel. In work thus far, novel, information-theoretic, kernels have been successfully employed for notably better performance over standard kernels. Currently there are two approaches for implementing multiclass SVMs. One is called external multi-class that arranges several binary classifiers as a decision tree such that they perform a single-class decision making function, with each leaf corresponding to a unique class. The second approach, namely internal-multiclass, involves solving a single optimization problem corresponding to the entire data set (with multiple hyperplanes).
Each SVM approach encapsulates a significant amount of model-fitting information in its choice of kernel. In work thus far, novel, information-theoretic, kernels were successfully employed for notably better performance over standard kernels. Two SVM approaches to multiclass discrimination are described: (1) internal multiclass (with a single optimization), and (2) external multiclass (using an optimized decision tree). We describe benefits of the internal-SVM approach, along with further refinements to the internal-multiclass SVM algorithms that offer significant improvement in training time without sacrificing accuracy. In situations where the data isn't clearly separable, making for poor discrimination, signal clustering is used to provide robust and useful information – to this end, novel, SVM-based clustering methods are also described. As with the classification, there are Internal and External SVM Clustering algorithms, both of which are briefly described.
Intrusion detection systems were used in the past along with various techniques to detect intrusions in networks effectively. However, most of these systems are able to detect the intruders only with high false alarm rate. In this paper, we propose a new intelligent agent-based intrusion detection model for mobile ad hoc networks using a combination of attribute selection, outlier detection, and enhanced multiclass SVM classification methods. For this purpose, an effective preprocessing technique is proposed that improves the detection accuracy and reduces the processing time. Moreover, two new algorithms, namely, an Intelligent Agent Weighted Distance Outlier Detection algorithm and an Intelligent Agent-based Enhanced Multiclass Support Vector Machine algorithm are proposed for detecting the intruders in a distributed database environment that uses intelligent agents for trust management and coordination in transaction processing. The experimental results of the proposed model show that this system detects anomalies with low false alarm rate and high-detection rate when tested with KDD Cup 99 data set.
The human fungal pathogen Candida albicans is able to change its shape in response to various environmental signals. We analyzed the C. albicans BIG1 homolog, which might be involved in β-1,6-glucan biosynthesis in Saccharomyces cerevisiae. C. albicans BIG1 is a functional homolog of an S. cerevisiae BIG1 gene, because the slow growth of an S. cerevisiae big1 mutant was restored by introduction of C. albicans BIG1. CaBig1p was expressed constitutively in both the yeast and hyphal forms. A specific localization of CaBig1p at the endoplasmic reticulum or plasma membrane similar to the subcellular localization of S. cerevisiae Big1p was observed in yeast form. The content of β-1,6-glucan in the cell wall was decreased in the Cabig1Δ strain in comparison with the wild-type or reconstituted strain. The C. albicans BIG1 disruptant showed reduced filamentation on a solid agar medium and in a liquid medium. The Cabig1Δ mutant showed markedly attenuated virulence in a mouse model of systemic candidiasis. Adherence to human epithelial HeLa cells and fungal burden in kidneys of infected mice were reduced in the Cabig1Δ mutant. Deletion of CaBIG1 abolished hyphal growth and invasiveness in the kidneys of infected mice. Our results indicate that adhesion failure and morphological abnormality contribute to the attenuated virulence of the Cabig1Δ mutant.
The National Cancer Institute (NCI) is developing an integrated biomedical informatics infrastructure, the cancer Biomedical Informatics Grid (caBIG®), to support collaboration within the cancer research community. A key part of the caBIG architecture is the establishment of terminology standards for representing data. In order to evaluate the suitability of existing controlled terminologies, the caBIG Vocabulary and Data Elements Workspace (VCDE WS) working group has developed a set of criteria that serve to assess a terminology's structure, content, documentation, and editorial process. This paper describes the evolution of these criteria and the results of their use in evaluating four standard terminologies: the Gene Ontology (GO), the NCI Thesaurus (NCIt), the Common Terminology for Adverse Events (known as CTCAE), and the laboratory portion of the Logical Objects, Identifiers, Names and Codes (LOINC). The resulting caBIG criteria are presented as a matrix that may be applicable to any terminology standardization effort.
Terminology; Ontology; Auditing; Evaluation
Motivation: Direct sequencing of microbes in human ecosystems (the human microbiome) has complemented single genome cultivation and sequencing to understand and explore the impact of commensal microbes on human health. As sequencing technologies improve and costs decline, the sophistication of data has outgrown available computational methods. While several existing machine learning methods have been adapted for analyzing microbiome data recently, there is not yet an efficient and dedicated algorithm available for multiclass classification of human microbiota.
Results: By combining instance-based and model-based learning, we propose a novel sparse distance-based learning method for simultaneous class prediction and feature (variable or taxa, which is used interchangeably) selection from multiple treatment populations on the basis of 16S rRNA sequence count data. Our proposed method simultaneously minimizes the intraclass distance and maximizes the interclass distance with many fewer estimated parameters than other methods. It is very efficient for problems with small sample sizes and unbalanced classes, which are common in metagenomic studies. We implemented this method in a MATLAB toolbox called MetaDistance. We also propose several approaches for data normalization and variance stabilization transformation in MetaDistance. We validate this method on several real and simulated 16S rRNA datasets to show that it outperforms existing methods for classifying metagenomic data. This article is the first to address simultaneous multifeature selection and class prediction with metagenomic count data.
Availability: The MATLAB toolbox is freely available online at http://metadistance.igs.umaryland.edu/.
Supplementary Information: Supplementary data are available at Bioinformatics online.
Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage.
In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold.
In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage.
Code and data sets are available at
Classification is the problem of assigning each input object to one of a finite number of classes. This problem has been extensively studied in machine learning and statistics, and there are numerous applications to bioinformatics as well as many other fields. Building a multiclass classifier has been a challenge, where the direct approach of altering the binary classification algorithm to accommodate more than two classes can be computationally too expensive. Hence the indirect approach of using binary decomposition has been commonly used, in which retrieving the class posterior probabilities from the set of binary posterior probabilities given by the individual binary classifiers has been a major issue.
In this work, we present an extension of a recently introduced probabilistic kernel-based learning algorithm called the Classification Relevance Units Machine (CRUM) to the multiclass setting to increase its applicability. The extension is achieved under the error correcting output codes framework. The probabilistic outputs of the binary CRUMs are preserved using a proposed linear-time decoding algorithm, an alternative to the generalized Bradley-Terry (GBT) algorithm whose application to large-scale prediction settings is prohibited by its computational complexity. The resulting classifier is called the Multiclass Relevance Units Machine (McRUM).
The evaluation of McRUM on a variety of real small-scale benchmark datasets shows that our proposed Naïve decoding algorithm is computationally more efficient than the GBT algorithm while maintaining a similar level of predictive accuracy. Then a set of experiments on a larger scale dataset for small ncRNA classification have been conducted with Naïve McRUM and compared with the Gaussian and linear SVM. Although McRUM's predictive performance is slightly lower than the Gaussian SVM, the results show that the similar level of true positive rate can be achieved by sacrificing false positive rate slightly. Furthermore, McRUM is computationally more efficient than the SVM, which is an important factor for large-scale analysis.
We have proposed McRUM, a multiclass extension of binary CRUM. McRUM with Naïve decoding algorithm is computationally efficient in run-time and its predictive performance is comparable to the well-known SVM, showing its potential in solving large-scale multiclass problems in bioinformatics and other fields of study.
Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits.
A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework.
sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
Recently, a growing number of neuroimaging studies have begun to investigate the brains of schizophrenic patients and their healthy siblings to identify heritable biomarkers of this complex disorder. The objective of this study was to use multiclass pattern analysis to investigate the inheritable characters of schizophrenia at the individual level, by comparing whole-brain resting-state functional connectivity of patients with schizophrenia to their healthy siblings.
Twenty-four schizophrenic patients, twenty-five healthy siblings and twenty-two matched healthy controls underwent the resting-state functional Magnetic Resonance Imaging (rs-fMRI) scanning. A linear support vector machine along with principal component analysis was used to solve the multi-classification problem. By reconstructing the functional connectivities with high discriminative power, three types of functional connectivity-based signatures were identified: (i) state connectivity patterns, which characterize the nature of disruption in the brain network of patients with schizophrenia; (ii) trait connectivity patterns, reflecting shared connectivities of dysfunction in patients with schizophrenia and their healthy siblings, thereby providing a possible neuroendophenotype and revealing the genetic vulnerability to develop schizophrenia; and (iii) compensatory connectivity patterns, which underlie special brain connectivities by which healthy siblings might compensate for an increased genetic risk for developing schizophrenia.
Our multiclass pattern analysis achieved 62.0% accuracy via leave-one-out cross-validation (p < 0.001). The identified state patterns related to the default mode network, the executive control network and the cerebellum. For the trait patterns, functional connectivities between the cerebellum and the prefrontal lobe, the middle temporal gyrus, the thalamus and the middle temporal poles were identified. Connectivities among the right precuneus, the left middle temporal gyrus, the left angular and the left rectus, as well as connectivities between the cingulate cortex and the left rectus showed higher discriminative power in the compensatory patterns.
Based on our experimental results, we saw some indication of differences in functional connectivity patterns in the healthy siblings of schizophrenic patients compared to other healthy individuals who have no relations with the patients. Our preliminary investigation suggested that the use of resting-state functional connectivities as classification features to discriminate among schizophrenic patients, their healthy siblings and healthy controls is meaningful.
Schizophrenia; Healthy siblings; Functional magnetic resonance imaging; Resting-state; Functional connectivity; Multiclass pattern analysis
Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.
A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.
A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance.
We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors.
Given its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry.
Selection of influential genes with microarray data often faces the difficulties of a large number of genes and a relatively small group of subjects. In addition to the curse of dimensionality, many gene selection methods weight the contribution from each individual subject equally. This equal-contribution assumption cannot account for the possible dependence among subjects who associate similarly to the disease, and may restrict the selection of influential genes.
A novel approach to gene selection is proposed based on kernel similarities and kernel weights. We do not assume uniformity for subject contribution. Weights are calculated via regularized least squares support vector regression (RLS-SVR) of class levels on kernel similarities and are used to weight subject contribution. The cumulative sum of weighted expression levels are next ranked to select responsible genes. These procedures also work for multiclass classification. We demonstrate this algorithm on acute leukemia, colon cancer, small, round blue cell tumors of childhood, breast cancer, and lung cancer studies, using kernel Fisher discriminant analysis and support vector machines as classifiers. Other procedures are compared as well.
This approach is easy to implement and fast in computation for both binary and multiclass problems. The gene set provided by the RLS-SVR weight-based approach contains a less number of genes, and achieves a higher accuracy than other procedures.
Obstructive sleep apnea (OSA) has become an important public health concern. Polysomnography (PSG) is traditionally considered an established and effective diagnostic tool providing information on the severity of OSA and the degree of sleep fragmentation. However, the numerous steps in the PSG test to diagnose OSA are costly and time consuming. This study aimed to apply the multiclass Mahalanobis-Taguchi system (MMTS) based on anthropometric information and questionnaire data to predict OSA. Implementation results showed that MMTS had an accuracy of 84.38% on the OSA prediction and achieved better performance compared to other approaches such as logistic regression, neural networks, support vector machine, C4.5 decision tree, and rough set. Therefore, MMTS can assist doctors in prediagnosis of OSA before running the PSG test, thereby enabling the more effective use of medical resources.
Summary: The affordability of high-throughput sequencing has created an unprecedented surge in the use of genomic data in basic, translational and clinical research. The rapid evolution of sequencing technology, coupled with its broad adoption across biology and medicine, necessitates fast, collaborative interdisciplinary discussion. SEQanswers provides a real-time knowledge-sharing resource to address this need, covering experimental and computational aspects of sequencing and sequence analysis. Developers of popular analysis tools are among the >4000 active members, and ~40 peer-reviewed publications have referenced SEQanswers.
Availability: The SEQanswers community is freely accessible at http://SEQanswers.com/
Supplementary data are available at Bioinformatics online.
Summary: We have developed an RNA-Seq analysis workflow for single-ended Illumina reads, termed RseqFlow. This workflow includes a set of analytic functions, such as quality control for sequencing data, signal tracks of mapped reads, calculation of expression levels, identification of differentially expressed genes and coding SNPs calling. This workflow is formalized and managed by the Pegasus Workflow Management System, which maps the analysis modules onto available computational resources, automatically executes the steps in the appropriate order and supervises the whole running process. RseqFlow is available as a Virtual Machine with all the necessary software, which eliminates any complex configuration and installation steps.
Availability and implementation: http://genomics.isi.edu/rnaseq
Contact: email@example.com; firstname.lastname@example.org; email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
To develop a security infrastructure to support controlled and secure access to data and analytical resources in a biomedical research Grid environment, while facilitating resource sharing among collaborators.
A Grid security infrastructure, called Grid Authentication and Authorization with Reliably Distributed Services (GAARDS), is developed as a key architecture component of the NCI-funded cancer Biomedical Informatics Grid (caBIG™). The GAARDS is designed to support in a distributed environment 1) efficient provisioning and federation of user identities and credentials; 2) group-based access control support with which resource providers can enforce policies based on community accepted groups and local groups; and 3) management of a trust fabric so that policies can be enforced based on required levels of assurance.
GAARDS is implemented as a suite of Grid services and administrative tools. It provides three core services: Dorian for management and federation of user identities, Grid Trust Service for maintaining and provisioning a federated trust fabric within the Grid environment, and Grid Grouper for enforcing authorization policies based on both local and Grid-level groups.
The GAARDS infrastructure is available as a stand-alone system and as a component of the caGrid infrastructure. More information about GAARDS can be accessed at http://www.cagrid.org.
GAARDS provides a comprehensive system to address the security challenges associated with environments in which resources may be located at different sites, requests to access the resources may cross institutional boundaries, and user credentials are created, managed, revoked dynamically in a de-centralized manner.
We show that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient. Analytical results are provided for the limit cases of general no-information (n-face dice rolling) of the binary classification. Computational evidence supports the claim in the general case.
Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy.
We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets.
For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures.
This paper proposes that interoperability across biomedical databases can be improved by utilizing a repository of Common Data Elements (CDEs), UML model class-attributes and simple lexical algorithms to facilitate the building domain models. This is examined in the context of an existing system, the National Cancer Institute (NCI)'s cancer Biomedical Informatics Grid (caBIG™). The goal is to demonstrate the deployment of open source tools that can be used to effectively map models and enable the reuse of existing information objects and CDEs in the development of new models for translational research applications. This effort is intended to help developers reuse appropriate CDEs to enable interoperability of their systems when developing within the caBIG™ framework or other frameworks that use metadata repositories.
The Dice (di-grams) and Dynamic algorithms are compared and both algorithms have similar performance matching UML model class-attributes to CDE class object-property pairs. With algorithms used, the baselines for automatically finding the matches are reasonable for the data models examined. It suggests that automatic mapping of UML models and CDEs is feasible within the caBIG™ framework and potentially any framework that uses a metadata repository.
This work opens up the possibility of using mapping algorithms to reduce cost and time required to map local data models to a reference data model such as those used within caBIG™. This effort contributes to facilitating the development of interoperable systems within caBIG™ as well as other metadata frameworks. Such efforts are critical to address the need to develop systems to handle enormous amounts of diverse data that can be leveraged from new biomedical methodologies.
The cancer Biomedical Informatics Grid (caBIG) was launched in 2003 by the US National Cancer Institute with the aim of connecting research teams through the use of shared infrastructure and software to collect, analyse and share data. It was an ambitious project, and the issue it aimed to address was huge and far-reaching. With such developments as the mapping of the human genome and the advancement of new technologies for the analysis of genes and proteins, cancer researchers have never produced so much complex data, nor have they understood so much about cancer on a molecular level. This new ‘molecular understanding’ of cancer, according to the caBIG 2007 ‘Pilot Report’, leads to molecular or ‘personalised’ medicine being the way forward in cancer research and treatment, and connects basic research to clinical care in an unprecedented way. But the former ‘silo-like’ nature of research does not lend itself to this brave new world of molecular medicine—individual labs and institutes working in isolation, “in effect, as cottage industries, each collecting and interpreting data using a unique language of their own” will not advance cancer research as it should be advanced. The solution proposed by the NCI in caBIG was to produce an integrated informatics grid (‘caGrid’) to incorporate open source, open access tools to collect, analyse and share data, enabling everyone to use the same methods and language for these tasks.
caBIG is primarily a US-based endeavour, and though the tools are openly available for users worldwide, it is in US NCI-funded cancer centres that they have been actively introduced and promoted with the eventual hope, according to the pilot report, of being able to do the same worldwide. caBIG also has a collaboration in place with the UK organisation NCRI to exchange technologies and research data. The European Association for Cancer Research, a member association for cancer researchers, conducted an online survey in January 2011 to identify the penetration of the ambitious caBIG project into European laboratories. The survey was sent to 6396 researchers based in Europe, with 764 respondents, a total response rate of 11.94%.