This is an editorial report of the supplement to BMC Genomics that includes 15 papers selected from the BIOCOMP'10 - The 2010 International Conference on Bioinformatics & Computational Biology as well as other sources with a focus on genomics studies.
BIOCOMP'10 was held on July 12-15 in Las Vegas, Nevada. The congress covered a large variety of research areas, and genomics was one of the major focuses because of the fast development in this field. We set out to launch a supplement to BMC Genomics with manuscripts selected from this congress and invited submissions. With a rigorous peer review process, we selected 15 manuscripts that showed work in cutting-edge genomics fields and proposed innovative methodology. We hope this supplement presents the current computational and statistical challenges faced in genomics studies, and shows the enormous promises and opportunities in the genomic future.
Significant interest exists in establishing synergistic research in bioinformatics, systems biology and intelligent computing. Supported by the United States National Science Foundation (NSF), International Society of Intelligent Biological Medicine (http://www.ISIBM.org), International Journal of Computational Biology and Drug Design (IJCBDD) and International Journal of Functional Informatics and Personalized Medicine, the ISIBM International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (ISIBM IJCBS 2009) attracted more than 300 papers and 400 researchers and medical doctors world-wide. It was the only inter/multidisciplinary conference aimed to promote synergistic research and education in bioinformatics, systems biology and intelligent computing. The conference committee was very grateful for the valuable advice and suggestions from honorary chairs, steering committee members and scientific leaders including Dr. Michael S. Waterman (USC, Member of United States National Academy of Sciences), Dr. Chih-Ming Ho (UCLA, Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Wing H. Wong (Stanford, Member of United States National Academy of Sciences), Dr. Ruzena Bajcsy (UC Berkeley, Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Qu Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Andrzej Niemierko (Harvard), Dr. A. Keith Dunker (Indiana), Dr. Brian D. Athey (Michigan), Dr. Weida Tong (FDA, United States Department of Health and Human Services), Dr. Cathy H. Wu (Georgetown), Dr. Dong Xu (Missouri), Drs. Arif Ghafoor and Okan K Ersoy (Purdue), Dr. Mark Borodovsky (Georgia Tech, President of ISIBM), Dr. Hamid R. Arabnia (UGA, Vice-President of ISIBM), and other scientific leaders. The committee presented the 2009 ISIBM Outstanding Achievement Awards to Dr. Joydeep Ghosh (UT Austin), Dr. Aidong Zhang (Buffalo) and Dr. Zhi-Hua Zhou (Nanjing) for their significant contributions to the field of intelligent biological medicine.
Significant interest exists in establishing radiologic imaging as a valid biomarker for assessing the response of cancer to a variety of treatments. To address this problem, we have chosen to study patients with metastatic colorectal carcinoma to learn whether statistical learning theory can improve the performance of radiologists using CT in predicting patient treatment response to therapy compared with the more traditional RECIST (Response Evaluation Criteria in Solid Tumors) standard.
Predictions of survival after 8 months in 38 patients with metastatic colorectal carcinoma using the Support Vector Machine (SVM) technique improved 30% when using additional information compared to WHO (World Health Organization) or RECIST measurements alone. With both Logistic Regression (LR) and SVM, there was no significant difference in performance between WHO and RECIST. The SVM and LR techniques also demonstrated that one radiologist consistently outperformed another.
This preliminary research study has demonstrated that SLT algorithms, properly used in a clinical setting, have the potential to address questions and criticisms associated with both RECIST and WHO scoring methods. We also propose that tumor heterogeneity, shape, etc. obtained from CT and/or MRI scans be added to the SLT feature vector for processing.
Short interfering RNAs (siRNAs) can be used to knockdown gene expression in functional genomics. For a target gene of interest, many siRNA molecules may be designed, whereas their efficiency of expression inhibition often varies.
To facilitate gene functional studies, we have developed a new machine learning method to predict siRNA potency based on random forests and support vector machines. Since there were many potential sequence features, random forests were used to select the most relevant features affecting gene expression inhibition. Support vector machine classifiers were then constructed using the selected sequence features for predicting siRNA potency. Interestingly, gene expression inhibition is significantly affected by nucleotide dimer and trimer compositions of siRNA sequence.
The findings in this study should help design potent siRNAs for functional genomics, and might also provide further insights into the molecular mechanism of RNA interference.
In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data.
We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naïve Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data.
Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
Gene expression time series array data has become a useful resource for investigating gene functions and the interactions between genes. However, the gene expression arrays are always mixed with noise, and many nonlinear regulatory relationships have been omitted in many linear models. Because of those practical limitations, inference of gene regulatory model from expression data is still far from satisfactory.
In this study, we present a model-based computational approach, Slice Pattern Model (SPM), to identify gene regulatory network from time series gene expression array data. In order to estimate performances of stability and reliability of our model, an artificial gene network is tested by the traditional linear model and SPM. SPM can handle the multiple transcriptional time lags and more accurately reconstruct the gene network. Using SPM, a 17 time-series gene expression data in yeast cell cycle is retrieved to reconstruct the regulatory network. Under the reliability threshold, θ = 55%, 18 relationships between genes are identified and transcriptional regulatory network is reconstructed. Results from previous studies demonstrate that most of gene relationships identified by SPM are correct.
With the help of pattern recognition and similarity analysis, the effect of noise has been limited in SPM method. At the same time, genetic algorithm is introduced to optimize parameters of gene network model, which is performed based on a statistic method in our experiments. The results of experiments demonstrate that the gene regulatory model reconstructed using SPM is more stable and reliable than those models coming from traditional linear model.
Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data.
A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures.
The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community.
The advent of high-throughput next generation sequencing technologies have fostered enormous potential applications of supercomputing techniques in genome sequencing, epi-genetics, metagenomics, personalized medicine, discovery of non-coding RNAs and protein-binding sites. To this end, the 2008 International Conference on Bioinformatics and Computational Biology (Biocomp) – 2008 World Congress on Computer Science, Computer Engineering and Applied Computing (Worldcomp) was designed to promote synergistic inter/multidisciplinary research and education in response to the current research trends and advances. The conference attracted more than two thousand scientists, medical doctors, engineers, professors and students gathered at Las Vegas, Nevada, USA during July 14–17 and received great success. Supported by International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design (IJCBDD), International Journal of Functional Informatics and Personalized Medicine (IJFIPM) and the leading research laboratories from Harvard, M.I.T., Purdue, UIUC, UCLA, Georgia Tech, UT Austin, U. of Minnesota, U. of Iowa etc, the conference received thousands of research papers. Each submitted paper was reviewed by at least three reviewers and accepted papers were required to satisfy reviewers' comments. Finally, the review board and the committee decided to select only 19 high-quality research papers for inclusion in this supplement to BMC Genomics based on the peer reviews only. The conference committee was very grateful for the Plenary Keynote Lectures given by: Dr. Brian D. Athey (University of Michigan Medical School), Dr. Vladimir N. Uversky (Indiana University School of Medicine), Dr. David A. Patterson (Member of United States National Academy of Sciences and National Academy of Engineering, University of California at Berkeley) and Anousheh Ansari (Prodea Systems, Space Ambassador). The theme of the conference to promote synergistic research and education has been achieved successfully.
Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation.
Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity).
We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins.
Sudden death syndrome (SDS) of soybean (Glycine max L. Merr.) is an economically important disease, caused by the semi-biotrophic fungus Fusarium solani f. sp. glycines, recently renamed Fusarium virguliforme (Fv). Due to the complexity and length of the soybean-Fusarium interaction, the molecular mechanisms underlying plant resistance and susceptibility to the pathogen are not fully understood. F. virguliforme has a very wide host range for the ability to cause root rot and a very narrow host range for the ability to cause a leaf scorch. Arabidopsis thaliana is a host for many types of phytopathogens including bacteria, fungi, viruses and nematodes. Deciphering the variations among transcript abundances (TAs) of functional orthologous genes of soybean and A. thaliana involved in the interaction will provide insights into plant resistance to F. viguliforme.
In this study, we reported the analyses of microarrays measuring TA in whole plants after A. thaliana cv 'Columbia' was challenged with fungal pathogen F. virguliforme. Infection caused significant variations in TAs. The total number of increased transcripts was nearly four times more than that of decreased transcripts in abundance. A putative resistance pathway involved in responding to the pathogen infection in A. thaliana was identified and compared to that reported in soybean.
Microarray experiments allow the interrogation of tens of thousands of transcripts simultaneously and thus, the identification of plant pathways is likely to be involved in plant resistance to Fusarial pathogens. Dissection of the set functional orthologous genes between soybean and A. thaliana enabled a broad view of the functional relationships and molecular interactions among plant genes involved in F. virguliforme resistance.
Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.
We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.
Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.
Prostate cancer is one of the leading causes of cancer death in men. Androgen ablation, the most commonly-used therapy for progressive prostate cancer, is ineffective once the cancer cells become androgen-independent. The regulatory mechanisms that cause this transition (from androgen-dependent to androgen-independent) remain unknown. In this study, based on the microarray data comparing global gene expression patterns in the prostate tissue between androgen-dependent and -independent prostate cancer patients, we indentify a set of transcription factors and microRNAs that potentially cause such difference, using a model-based computational approach.
From 335 position weight matrices in the TRANSFAC database and 564 microRNAs in the microRNA registry, our model identify 5 transcription factors and 7 microRNAs to be potentially responsible for the level of androgen dependency. Of these transcription factors and microRNAs, the estimated function of all the 5 transcription factors are predicted to be inhibiting transcription in androgen-independent samples comparing with the dependent ones. Six out of 7 microRNAs, however, demonstrated stimulatory effects. We also find that the expression levels of three predicted transcription factors, including AP-1, STAT3 (signal transducers and activators of transcription 3), and DBP (albumin D-box) are significantly different between androgen-dependent and -independent patients. In addition, microRNA microarray data from other studies confirm that several predicted microRNAs, including miR-21, miR-135a, and miR-135b, demonstrate differential expression in prostate cancer cells, comparing with normal tissues.
We present a model-based computational approach to identify transcription factors and microRNAs influencing the progression of androgen-dependent prostate cancer to androgen-independent prostate cancer. This result suggests that the capability of transcription factors to initiate transcription and microRNAs to facilitate mRNA degradation are both decreased in androgen-independent prostate cancer. The proposed model-based approach indicates that considering combinatorial effects of transcription factors and microRNAs in a unified model provides additional transcriptional and post-transcriptional regulatory mechanisms on global gene expression in the prostate cancer with different hormone-dependency.
Adjuvant Radiotherapy (RT) after surgical removal of tumors proved beneficial in long-term tumor control and treatment planning. For many years, it has been well concluded that radio-sensitivities of tumors upon radiotherapy decrease according to the sizes of tumors and RT models based on Poisson statistics have been used extensively to validate clinical data.
We found that Poisson statistics on RT is actually derived from bacterial cells despite of many validations from clinical data. However cancerous cells do have abnormal cellular communications and use chemical messengers to signal both surrounding normal and cancerous cells to develop new blood vessels and to invade, to metastasis and to overcome intercellular spatial confinements in general. We therefore investigated the cell killing effects on adjuvant RT and found that radio-sensitivity is actually not a monotonic function of volume as it was believed before. We present detailed analysis and explanation to justify above statement. Based on EUD, we present an equivalent radio-sensitivity model.
We conclude that radio sensitivity is a sophisticated function over tumor volumes, since tumor responses upon radio therapy also depend on cellular communications.
Microarray technology is widely applied to address complex scientific questions. However, there remain fundamental issues on how to design experiments to ensure that the resulting data enables robust statistical analysis. Interwoven loop design has several advantages over other designs. However it suffers in the complexity of design. We have implemented an online web application which allows users to find optimal loop designs for two-color microarray experiments. Given a number of conditions (such as treatments or time points) and replicates, the application will find the best possible design of the experiment and output experimental parameters. It is freely available from .
Our first predictor of protein disorder was published just over a decade ago in the Proceedings of the IEEE International Conference on Neural Networks (Romero P, Obradovic Z, Kissinger C, Villafranca JE, Dunker AK (1997) Identifying disordered regions in proteins from amino acid sequence. Proceedings of the IEEE International Conference on Neural Networks, 1: 90–95). By now more than twenty other laboratory groups have joined the efforts to improve the prediction of protein disorder. While the various prediction methodologies used for protein intrinsic disorder resemble those methodologies used for secondary structure prediction, the two types of structures are entirely different. For example, the two structural classes have very different dynamic properties, with the irregular secondary structure class being much less mobile than the disorder class. The prediction of secondary structure has been useful. On the other hand, the prediction of intrinsic disorder has been revolutionary, leading to major modifications of the more than 100 year-old views relating protein structure and function. Experimentalists have been providing evidence over many decades that some proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. In addition, experimentalists are also showing that, for many proteins, their functions depend on the unstructured rather than structured state; such results are in marked contrast to the greater than hundred year old views such as the lock and key hypothesis. Despite extensive data on many important examples, including disease-associated proteins, the importance of disorder for protein function has been largely ignored. Indeed, to our knowledge, current biochemistry books don't present even one acknowledged example of a disorder-dependent function, even though some reports of disorder-dependent functions are more than 50 years old. The results from genome-wide predictions of intrinsic disorder and the results from other bioinformatics studies of intrinsic disorder are demanding attention for these proteins.
Disorder prediction has been important for showing that the relatively few experimentally characterized examples are members of a very large collection of related disordered proteins that are wide-spread over all three domains of life. Many significant biological functions are now known to depend directly on, or are importantly associated with, the unfolded or partially folded state. Here our goal is to review the key discoveries and to weave these discoveries together to support novel approaches for understanding sequence-function relationships.
Intrinsically disordered protein is common across the three domains of life, but especially common among the eukaryotic proteomes. Signaling sequences and sites of posttranslational modifications are frequently, or very likely most often, located within regions of intrinsic disorder. Disorder-to-order transitions are coupled with the adoption of different structures with different partners. Also, the flexibility of intrinsic disorder helps different disordered regions to bind to a common binding site on a common partner. Such capacity for binding diversity plays important roles in both protein-protein interaction networks and likely also in gene regulation networks. Such disorder-based signaling is further modulated in multicellular eukaryotes by alternative splicing, for which such splicing events map to regions of disorder much more often than to regions of structure. Associating alternative splicing with disorder rather than structure alleviates theoretical and experimentally observed problems associated with the folding of different length, isomeric amino acid sequences. The combination of disorder and alternative splicing is proposed to provide a mechanism for easily "trying out" different signaling pathways, thereby providing the mechanism for generating signaling diversity and enabling the evolution of cell differentiation and multicellularity. Finally, several recent small molecules of interest as potential drugs have been shown to act by blocking protein-protein interactions based on intrinsic disorder of one of the partners. Study of these examples has led to a new approach for drug discovery, and bioinformatics analysis of the human proteome suggests that various disease-associated proteins are very rich in such disorder-based drug discovery targets.
Supported by National Science Foundation (NSF), International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design and International Journal of Functional Informatics and Personalized Medicine, IEEE 7th Bioinformatics and Bioengineering attracted more than 600 papers and 500 researchers and medical doctors. It was the only synergistic inter/multidisciplinary IEEE conference with 24 Keynote Lectures, 7 Tutorials, 5 Cutting-Edge Research Workshops and 32 Scientific Sessions including 11 Special Research Interest Sessions that were designed dynamically at Harvard in response to the current research trends and advances. The committee was very grateful for the IEEE Plenary Keynote Lectures given by: Dr. A. Keith Dunker (Indiana), Dr. Jun Liu (Harvard), Dr. Brian Athey (Michigan), Dr. Mark Borodovsky (Georgia Tech and President of ISIBM), Dr. Hamid Arabnia (Georgia and Vice-President of ISIBM), Dr. Ruzena Bajcsy (Berkeley and Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Chih-Ming Ho (UCLA and Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Andy Baxevanis (United States National Institutes of Health), Dr. Arif Ghafoor (Purdue), Dr. John Quackenbush (Harvard), Dr. Eric Jakobsson (UIUC), Dr. Vladimir Uversky (Indiana), Dr. Laura Elnitski (United States National Institutes of Health) and other world-class scientific leaders. The Harvard meeting was a large academic event 100% full-sponsored by IEEE financially and academically. After a rigorous peer-review process, the committee selected 27 high-quality research papers from 600 submissions. The committee is grateful for contributions from keynote speakers Dr. Russ Altman (IEEE BIBM conference keynote lecturer on combining simulation and machine learning to recognize function in 4D), Dr. Mary Qu Yang (IEEE BIBM workshop keynote lecturer on new initiatives of detecting microscopic disease using machine learning and molecular biology, http://ieeexplore.ieee.org/servlet/opac?punumber=4425386) and Dr. Jack Y. Yang (IEEE BIBM workshop keynote lecturer on data mining and knowledge discovery in translational medicine) from the first IEEE Computer Society BioInformatics and BioMedicine (IEEE BIBM) international conference and workshops, November 2-4, 2007, Silicon Valley, California, USA.
An important subfamily of membrane proteins are the transmembrane α-helical proteins, in which the membrane-spanning regions are made up of α-helices. Given the obvious biological and medical significance of these proteins, it is of tremendous practical importance to identify the location of transmembrane segments. The difficulty of inferring the secondary or tertiary structure of transmembrane proteins using experimental techniques has led to a surge of interest in applying techniques from machine learning and bioinformatics to infer secondary structure from primary structure in these proteins. We are therefore interested in determining which physicochemical properties are most useful for discriminating transmembrane segments from non-transmembrane segments in transmembrane proteins, and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins, and in using the results of these investigations to develop classifiers to identify transmembrane segments in transmembrane proteins.
We determined that the most useful properties for discriminating transmembrane segments from non-transmembrane segments and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins were hydropathy, polarity, and flexibility, and used the results of this analysis to construct classifiers to discriminate transmembrane segments from non-transmembrane segments using four classification techniques: two variants of the Self-Organizing Global Ranking algorithm, a decision tree algorithm, and a support vector machine algorithm. All four techniques exhibited good performance, with out-of-sample accuracies of approximately 75%.
Several interesting observations emerged from our study: intrinsically unstructured segments and transmembrane segments tend to have opposite properties; transmembrane proteins appear to be much richer in intrinsically unstructured segments than other proteins; and, in approximately 70% of transmembrane proteins that contain intrinsically unstructured segments, the intrinsically unstructured segments are close to transmembrane segments.
Since the high dimensionality of gene expression microarray data sets degrades the generalization performance of classifiers, feature selection, which selects relevant features and discards irrelevant and redundant features, has been widely used in the bioinformatics field. Multi-task learning is a novel technique to improve prediction accuracy of tumor classification by using information contained in such discarded redundant features, but which features should be discarded or used as input or output remains an open issue.
We demonstrate a framework for automatically selecting features to be input, output, and discarded by using a genetic algorithm, and propose two algorithms: GA-MTL (Genetic algorithm based multi-task learning) and e-GA-MTL (an enhanced version of GA-MTL). Experimental results demonstrate that this framework is effective at selecting features for multi-task learning, and that GA-MTL and e-GA-MTL perform better than other heuristic methods.
Genetic algorithms are a powerful technique to select features for multi-task learning automatically; GA-MTL and e-GA-MTL are shown to to improve generalization performance of classifiers on microarray data sets.
The prognosis for many cancers could be improved dramatically if they could be detected while still at the microscopic disease stage. It follows from a comprehensive statistical analysis that a number of antigens such as hTERT, PCNA and Ki-67 can be considered as cancer markers, while another set of antigens such as P27KIP1 and FHIT are possible markers for normal tissue. Because more than one marker must be considered to obtain a classification of cancer or no cancer, and if cancer, to classify it as malignant, borderline, or benign, we must develop an intelligent decision system that can fullfill such an unmet medical need.
We have developed an intelligent decision system using machine learning techniques and markers to characterize tissue as cancerous, non-cancerous or borderline. The system incorporates learning techniques such as variants of support vector machines, neural networks, decision trees, self-organizing feature maps (SOFM) and recursive maximum contrast trees (RMCT). These variants and algorithms we have developed, tend to detect microscopic pathological changes based on features derived from gene expression levels and metabolic profiles. We have also used immunohistochemistry techniques to measure the gene expression profiles from a number of antigens such as cyclin E, P27KIP1, FHIT, Ki-67, PCNA, Bax, Bcl-2, P53, Fas, FasL and hTERT in several particular types of neuroendocrine tumors such as pheochromocytomas, paragangliomas, and the adrenocortical carcinomas (ACC), adenomas (ACA), and hyperplasia (ACH) involved with Cushing's syndrome. We provided statistical evidence that higher expression levels of hTERT, PCNA and Ki-67 etc. are associated with a higher risk that the tumors are malignant or borderline as opposed to benign. We also investigated whether higher expression levels of P27KIP1 and FHIT, etc., are associated with a decreased risk of adrenomedullary tumors. While no significant difference was found between cell-arrest antigens such as P27KIP1 for malignant, borderline, and benign tumors, there was a significant difference between expression levels of such antigens in normal adrenal medulla samples and in adrenomedullary tumors.
Our frame work focused on not only different classification schemes and feature selection algorithms, but also ensemble methods such as boosting and bagging in an effort to improve upon the accuracy of the individual classifiers. It is evident that when all sorts of machine learning and statistically learning techniques are combined appropriately into one integrated intelligent medical decision system, the prediction power can be enhanced significantly. This research has many potential applications; it might provide an alternative diagnostic tool and a better understanding of the mechanisms involved in malignant transformation as well as information that is useful for treatment planning and cancer prevention.
Many protein regions and some entire proteins have no definite tertiary structure, existing instead as dynamic, disorder ensembles under different physiochemical circumstances. Identification of these protein disorder regions is important for protein production, protein structure prediction and determination, and protein function annotation. A number of different disorder prediction software and web services have been developed since the first predictor was designed by Dunker's lab in 1997. However, most of the software packages use a pre-defined threshold to select ordered or disordered residues. In many situations, users need to choose ordered or disordered residues at different sensitivity and specificity levels.
Here we benchmark a state of the art disorder predictor, DISpro, on a large protein disorder dataset created from Protein Data Bank and systematically evaluate the relationship of sensitivity and specificity. Also, we extend its functionality to allow users to trade off specificity and sensitivity by setting different decision thresholds. Moreover, we compare DISpro with seven other automated disorder predictors on the 95 protein targets used in the seventh edition of Critical Assessment of Techniques for Protein Structure Prediction (CASP7). DISpro is ranked as one of the best predictors.
The evaluation and extension of DISpro make it a more valuable and useful tool for structural and functional genomics.
This is a first report, using our MotifModeler informatics program, to simultaneously identify transcription factor (TF) and microRNA (miRNA) binding sites from gene expression microarray data. Based on the assumption that gene expression is controlled by combinatorial effects of transcription factors binding in the 5'-upstream regulatory region and miRNAs binding in the 3'-untranslated region (3'-UTR), we developed a model for (1) predicting the most influential cis-acting elements under a given biological condition, and (2) estimating the effects of those elements on gene expression levels. The regulatory regions, TF and miRNA, which mediate the differential genes expression in fetal alcohol syndrome were unknown; microarray data from alcohol exposure paradigm was used. The model predicted strong inhibitory effects of 5' cis-acting elements and stimulatory effects of 3'-UTR under alcohol treatment. Current predictive model derived a key hypothesis for the first time a novel role of miRNAs in gene expression changes associated with abnormal mouse embryo development after alcohol exposure. This suggests that disturbance of miRNA functions may contribute to the alcohol-induced developmental deficiencies.
Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.
We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%.
A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.
The budding yeast Saccharomyces cerevisiae is a eukaryotic organism with extensive genetic redundancy. Large-scale gene deletion analysis has shown that over 80% of the ~6200 predicted genes are nonessential and that the functions of 30% of all ORFs remain unclassified, implying that yeast cells can tolerate deletion of a substantial number of individual genes. For example, a class of zinc finger proteins containing C2H2 zinc fingers in tandem arrays of two or three is predicted to be transcription factors; however, seven of the thirty-one predicted genes of this class are nonessential, and their functions are poorly understood. In this study we completed a transcriptomic profiling of three mutants lacking C2H2 zinc finger proteins, ypr013cΔ,ypr015cΔ and ypr013cΔypr015cΔ.
Gene expression patterns were remarkably different between wild type and the mutants. The results indicate altered expression of 79 genes in ypr013cΔ, 185 genes in ypr015cΔ and 426 genes in the double mutant when compared with that of the wild type strain. More than 80% of the alterations in the double mutants were not observed in either one of the single deletion mutants. Functional categorization based on Munich Information Center for Protein Sequences (MIPS) revealed up-regulation of genes related to transcription and down-regulation of genes involving cell rescue and defense, suggesting a decreased response to stress conditions. Genes related to cell cycle and DNA processing whose expression was affected by single or double deletions were also identified.
Our results suggest that microarray analysis can define the biological roles of zinc finger proteins with unknown functions and identify target genes that are regulated by these putative transcriptional factors. These findings also suggest that both YPR013C and YPR015C have biological processes in common, in addition to their own regulatory pathways.
Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results.
In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers.
We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.
Proteins are involved in many interactions with other proteins leading to networks that regulate and control a wide variety of physiological processes. Some of these proteins, called hub proteins or hubs, bind to many different protein partners. Protein intrinsic disorder, via diversity arising from structural plasticity or flexibility, provide a means for hubs to associate with many partners (Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible Nets: The roles of intrinsic disorder in protein interaction networks. FEBS J 2005, 272:5129-5148).
Here we present a detailed examination of two divergent examples: 1) p53, which uses different disordered regions to bind to different partners and which also has several individual disordered regions that each bind to multiple partners, and 2) 14-3-3, which is a structured protein that associates with many different intrinsically disordered partners. For both examples, three-dimensional structures of multiple complexes reveal that the flexibility and plasticity of intrinsically disordered protein regions as well as induced-fit changes in the structured regions are both important for binding diversity.
These data support the conjecture that hub proteins often utilize intrinsic disorder to bind to multiple partners and provide detailed information about induced fit in structured regions.