Search tips
Search criteria

Results 1-22 (22)

Clipboard (0)
Year of Publication
Document Types
1.  The emerging genomics and systems biology research lead to systems genomics studies 
BMC Genomics  2014;15(Suppl 11):I1.
Synergistically integrating multi-layer genomic data at systems level not only can lead to deeper insights into the molecular mechanisms related to disease initiation and progression, but also can guide pathway-based biomarker and drug target identification. With the advent of high-throughput next-generation sequencing technologies, sequencing both DNA and RNA has generated multi-layer genomic data that can provide DNA polymorphism, non-coding RNA, messenger RNA, gene expression, isoform and alternative splicing information. Systems biology on the other hand studies complex biological systems, particularly systematic study of complex molecular interactions within specific cells or organisms. Genomics and molecular systems biology can be merged into the study of genomic profiles and implicated biological functions at cellular or organism level. The prospectively emerging field can be referred to as systems genomics or genomic systems biology.
The Mid-South Bioinformatics Centre (MBC) and Joint Bioinformatics Ph.D. Program of University of Arkansas at Little Rock and University of Arkansas for Medical Sciences are particularly interested in promoting education and research advancement in this prospectively emerging field. Based on past investigations and research outcomes, MBC is further utilizing differential gene and isoform/exon expression from RNA-seq and co-regulation from the ChiP-seq specific for different phenotypes in combination with protein-protein interactions, and protein-DNA interactions to construct high-level gene networks for an integrative genome-phoneme investigation at systems biology level.
PMCID: PMC4304174  PMID: 25558922
2.  MicroRNA and messenger RNA profiling reveals new biomarkers and mechanisms for RDX induced neurotoxicity 
BMC Genomics  2014;15(Suppl 11):S1.
RDX is a well-known pollutant to induce neurotoxicity. MicroRNAs (miRNA) and messenger RNA (mRNA) profiles are useful tools for toxicogenomics studies. It is worthy to integrate MiRNA and mRNA expression data to understand RDX-induced neurotoxicity.
Rats were treated with or without RDX for 48 h. Both miRNA and mRNA profiles were conducted using brain tissues. Nine miRNAs were significantly regulated by RDX. Of these, 6 and 3 miRNAs were up- and down-regulated respectively. The putative target genes of RDX-regulated miRNAs were highly nervous system function genes and pathways enriched. Fifteen differentially genes altered by RDX from mRNA profiles were the putative targets of regulated miRNAs. The induction of miR-71, miR-27ab, miR-98, and miR-135a expression by RDX, could reduce the expression of the genes POLE4, C5ORF13, SULF1 and ROCK2, and eventually induce neurotoxicity. Over-expression of miR-27ab, or reduction of the expression of unknown miRNAs by RDX, could up-regulate HMGCR expression and contribute to neurotoxicity. RDX regulated immune and inflammation response miRNAs and genes could contribute to RDX- induced neurotoxicity and other toxicities as well as animal defending reaction response to RDX exposure.
Our results demonstrate that integrating miRNA and mRNA profiles is valuable to indentify novel biomarkers and molecular mechanisms for RDX-induced neurological disorder and neurotoxicity.
PMCID: PMC4304176  PMID: 25559034
3.  Gene regulation mediated by microRNAs in response to green tea polyphenol EGCG in mouse lung cancer 
BMC Genomics  2014;15(Suppl 11):S3.
Epigallocatechin-3-gallate (EGCG) has been demonstrated to inhibit cancer in experimental studies through its antioxidant activity and modulations on cellular functions by binding specific proteins. We demonstrated previously that EGCG upregulates the expression of microRNA (i.e. miR-210) by binding HIF-1α, resulting in reduced cell proliferation and anchorage-independent growth. However, the binding affinities of EGCG to HIF-1α and many other targets are higher than the EGCG plasma peak level in experimental animals administered with high dose of EGCG, raising a concern whether the microRNA regulation by HIF-1α is involved in the anti-cancer activity of EGCG in vivo.
We employed functional genomic approaches to elucidate the role of microRNA in the EGCG inhibition of tobacco carcinogen-induced lung tumors in A/J mice. By analysing the microRNA profiles, we found modest changes in the expression levels of 21 microRNAs. By correlating these 21 microRNAs with the mRNA expression profiles using the computation methods, we identified 26 potential targeted genes of the 21 microRNAs. Further exploration using pathway analysis revealed that the most impacted pathways of EGCG treatment are the regulatory networks associated to AKT, NF-κB, MAP kinases, and cell cycle, and the identified miRNA targets are involved in the networks of AKT, MAP kinases and cell cycle regulation
These results demonstrate that the miRNA-mediated regulation is actively involved in the major aspects of the anti-cancer activity of EGCG in vivo.
PMCID: PMC4304179  PMID: 25559244
4.  Comparison of feature selection and classification for MALDI-MS data 
BMC Genomics  2009;10(Suppl 1):S3.
In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data.
We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naïve Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data.
Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
PMCID: PMC2709264  PMID: 19594880
5.  Word-based characterization of promoters involved in human DNA repair pathways 
BMC Genomics  2009;10(Suppl 1):S18.
DNA repair genes provide an important contribution towards the surveillance and repair of DNA damage. These genes produce a large network of interacting proteins whose mRNA expression is likely to be regulated by similar regulatory factors. Full characterization of promoters of DNA repair genes and the similarities among them will more fully elucidate the regulatory networks that activate or inhibit their expression. To address this goal, the authors introduce a technique to find regulatory genomic signatures, which represents a specific application of the genomic signature methodology to classify DNA sequences as putative functional elements within a single organism.
The effectiveness of the regulatory genomic signatures is demonstrated via analysis of promoter sequences for genes in DNA repair pathways of humans. The promoters are divided into two classes, the bidirectional promoters and the unidirectional promoters, and distinct genomic signatures are calculated for each class. The genomic signatures include statistically overrepresented words, word clusters, and co-occurring words. The robustness of this method is confirmed by the ability to identify sequences that exist as motifs in TRANSFAC and JASPAR databases, and in overlap with verified binding sites in this set of promoter regions.
The word-based signatures are shown to be effective by finding occurrences of known regulatory sites. Moreover, the signatures of the bidirectional and unidirectional promoters of human DNA repair pathways are clearly distinct, exhibiting virtually no overlap. In addition to providing an effective characterization method for related DNA sequences, the signatures elucidate putative regulatory aspects of DNA repair pathways, which are notably under-characterized.
PMCID: PMC2709261  PMID: 19594877
6.  Prediction of DNA-binding residues from protein sequence information using random forests 
BMC Genomics  2009;10(Suppl 1):S1.
Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data.
A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures.
The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF has thus been developed to make the RF classifier accessible to the biological research community.
PMCID: PMC2709252  PMID: 19594868
7.  High-throughput next-generation sequencing technologies foster new cutting-edge computing techniques in bioinformatics 
BMC Genomics  2009;10(Suppl 1):I1.
The advent of high-throughput next generation sequencing technologies have fostered enormous potential applications of supercomputing techniques in genome sequencing, epi-genetics, metagenomics, personalized medicine, discovery of non-coding RNAs and protein-binding sites. To this end, the 2008 International Conference on Bioinformatics and Computational Biology (Biocomp) – 2008 World Congress on Computer Science, Computer Engineering and Applied Computing (Worldcomp) was designed to promote synergistic inter/multidisciplinary research and education in response to the current research trends and advances. The conference attracted more than two thousand scientists, medical doctors, engineers, professors and students gathered at Las Vegas, Nevada, USA during July 14–17 and received great success. Supported by International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design (IJCBDD), International Journal of Functional Informatics and Personalized Medicine (IJFIPM) and the leading research laboratories from Harvard, M.I.T., Purdue, UIUC, UCLA, Georgia Tech, UT Austin, U. of Minnesota, U. of Iowa etc, the conference received thousands of research papers. Each submitted paper was reviewed by at least three reviewers and accepted papers were required to satisfy reviewers' comments. Finally, the review board and the committee decided to select only 19 high-quality research papers for inclusion in this supplement to BMC Genomics based on the peer reviews only. The conference committee was very grateful for the Plenary Keynote Lectures given by: Dr. Brian D. Athey (University of Michigan Medical School), Dr. Vladimir N. Uversky (Indiana University School of Medicine), Dr. David A. Patterson (Member of United States National Academy of Sciences and National Academy of Engineering, University of California at Berkeley) and Anousheh Ansari (Prodea Systems, Space Ambassador). The theme of the conference to promote synergistic research and education has been achieved successfully.
PMCID: PMC2709251  PMID: 19594867
8.  Genomics, molecular imaging, bioinformatics, and bio-nano-info integration are synergistic components of translational medicine and personalized healthcare research 
BMC Genomics  2008;9(Suppl 2):I1.
Supported by National Science Foundation (NSF), International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design and International Journal of Functional Informatics and Personalized Medicine, IEEE 7th Bioinformatics and Bioengineering attracted more than 600 papers and 500 researchers and medical doctors. It was the only synergistic inter/multidisciplinary IEEE conference with 24 Keynote Lectures, 7 Tutorials, 5 Cutting-Edge Research Workshops and 32 Scientific Sessions including 11 Special Research Interest Sessions that were designed dynamically at Harvard in response to the current research trends and advances. The committee was very grateful for the IEEE Plenary Keynote Lectures given by: Dr. A. Keith Dunker (Indiana), Dr. Jun Liu (Harvard), Dr. Brian Athey (Michigan), Dr. Mark Borodovsky (Georgia Tech and President of ISIBM), Dr. Hamid Arabnia (Georgia and Vice-President of ISIBM), Dr. Ruzena Bajcsy (Berkeley and Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Chih-Ming Ho (UCLA and Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Andy Baxevanis (United States National Institutes of Health), Dr. Arif Ghafoor (Purdue), Dr. John Quackenbush (Harvard), Dr. Eric Jakobsson (UIUC), Dr. Vladimir Uversky (Indiana), Dr. Laura Elnitski (United States National Institutes of Health) and other world-class scientific leaders. The Harvard meeting was a large academic event 100% full-sponsored by IEEE financially and academically. After a rigorous peer-review process, the committee selected 27 high-quality research papers from 600 submissions. The committee is grateful for contributions from keynote speakers Dr. Russ Altman (IEEE BIBM conference keynote lecturer on combining simulation and machine learning to recognize function in 4D), Dr. Mary Qu Yang (IEEE BIBM workshop keynote lecturer on new initiatives of detecting microscopic disease using machine learning and molecular biology, and Dr. Jack Y. Yang (IEEE BIBM workshop keynote lecturer on data mining and knowledge discovery in translational medicine) from the first IEEE Computer Society BioInformatics and BioMedicine (IEEE BIBM) international conference and workshops, November 2-4, 2007, Silicon Valley, California, USA.
PMCID: PMC3226104  PMID: 18831773
9.  Predicting protein disorder by analyzing amino acid sequence 
BMC Genomics  2008;9(Suppl 2):S8.
Many protein regions and some entire proteins have no definite tertiary structure, presenting instead as dynamic, disorder ensembles under different physiochemical circumstances. These proteins and regions are known as Intrinsically Unstructured Proteins (IUP). IUP have been associated with a wide range of protein functions, along with roles in diseases characterized by protein misfolding and aggregation.
Identifying IUP is important task in structural and functional genomics. We exact useful features from sequences and develop machine learning algorithms for the above task. We compare our IUP predictor with PONDRs (mainly neural-network-based predictors), disEMBL (also based on neural networks) and Globplot (based on disorder propensity).
We find that augmenting features derived from physiochemical properties of amino acids (such as hydrophobicity, complexity etc.) and using ensemble method proved beneficial. The IUP predictor is a viable alternative software tool for identifying IUP protein regions and proteins.
PMCID: PMC2559898  PMID: 18831799
10.  Diversity of core promoter elements comprising human bidirectional promoters 
BMC Genomics  2008;9(Suppl 2):S3.
Bidirectional promoters lie between adjacent genes, which are transcribed from opposite strands of DNA. The functional mechanisms underlying the activation of bidirectional promoters are currently uncharacterised. To define the core promoter elements of bidirectional promoters in human, we mapped motifs for TATA, INR, BRE, DPE, INR, as well as CpG-islands.
We found a consistently high correspondence between C+G content, CpG-island presence and an average expression level increasing the median level for all genes in bidirectional promoters. These CpG-rich promoters showed discrete initiation patterns rather than broad regions of transcription initiation, as are typically seen for CpG-island promoters. CpG-islands encompass both TSSs within bidirectional promoters, providing an explanation for the symmetrical co-expression patterns of many of these genes. In contrast, TATA motifs appear to be asymmetrically positioned at one TSS or the other.
Our findings demonstrate that bidirectional promoters utilize a variety of core promoter elements to initiate transcription. CpG-islands dominate the regulatory landscape of this group of promoters.
PMCID: PMC2559893  PMID: 18831794
11.  Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis 
BMC Genomics  2008;9(Suppl 2):S24.
Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods.
We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates.
Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.
PMCID: PMC2559889  PMID: 18831790
12.  Analyzing adjuvant radiotherapy suggests a non monotonic radio-sensitivity over tumor volumes 
BMC Genomics  2008;9(Suppl 2):S9.
Adjuvant Radiotherapy (RT) after surgical removal of tumors proved beneficial in long-term tumor control and treatment planning. For many years, it has been well concluded that radio-sensitivities of tumors upon radiotherapy decrease according to the sizes of tumors and RT models based on Poisson statistics have been used extensively to validate clinical data.
We found that Poisson statistics on RT is actually derived from bacterial cells despite of many validations from clinical data. However cancerous cells do have abnormal cellular communications and use chemical messengers to signal both surrounding normal and cancerous cells to develop new blood vessels and to invade, to metastasis and to overcome intercellular spatial confinements in general. We therefore investigated the cell killing effects on adjuvant RT and found that radio-sensitivity is actually not a monotonic function of volume as it was believed before. We present detailed analysis and explanation to justify above statement. Based on EUD, we present an equivalent radio-sensitivity model.
We conclude that radio sensitivity is a sophisticated function over tumor volumes, since tumor responses upon radio therapy also depend on cellular communications.
PMCID: PMC2559899  PMID: 18831800
13.  ILOOP – a web application for two-channel microarray interwoven loop design 
BMC Genomics  2008;9(Suppl 2):S11.
Microarray technology is widely applied to address complex scientific questions. However, there remain fundamental issues on how to design experiments to ensure that the resulting data enables robust statistical analysis. Interwoven loop design has several advantages over other designs. However it suffers in the complexity of design. We have implemented an online web application which allows users to find optimal loop designs for two-color microarray experiments. Given a number of conditions (such as treatments or time points) and replicates, the application will find the best possible design of the experiment and output experimental parameters. It is freely available from .
PMCID: PMC2559875  PMID: 18831776
14.  Promoting synergistic research and education in genomics and bioinformatics 
BMC Genomics  2008;9(Suppl 1):I1.
Bioinformatics and Genomics are closely related disciplines that hold great promises for the advancement of research and development in complex biomedical systems, as well as public health, drug design, comparative genomics, personalized medicine and so on. Research and development in these two important areas are impacting the science and technology.
High throughput sequencing and molecular imaging technologies marked the beginning of a new era for modern translational medicine and personalized healthcare. The impact of having the human sequence and personalized digital images in hand has also created tremendous demands of developing powerful supercomputing, statistical learning and artificial intelligence approaches to handle the massive bioinformatics and personalized healthcare data, which will obviously have a profound effect on how biomedical research will be conducted toward the improvement of human health and prolonging of human life in the future. The International Society of Intelligent Biological Medicine ( and its official journals, the International Journal of Functional Informatics and Personalized Medicine ( and the International Journal of Computational Biology and Drug Design ( in collaboration with International Conference on Bioinformatics and Computational Biology (Biocomp), touch tomorrow's bioinformatics and personalized medicine throughout today's efforts in promoting the research, education and awareness of the upcoming integrated inter/multidisciplinary field. The 2007 international conference on Bioinformatics and Computational Biology (BIOCOMP07) was held in Las Vegas, the United States of American on June 25-28, 2007. The conference attracted over 400 papers, covering broad research areas in the genomics, biomedicine and bioinformatics. The Biocomp 2007 provides a common platform for the cross fertilization of ideas, and to help shape knowledge and scientific achievements by bridging these two very important disciplines into an interactive and attractive forum. Keeping this objective in mind, Biocomp 2007 aims to promote interdisciplinary and multidisciplinary education and research. 25 high quality peer-reviewed papers were selected from 400+ submissions for this supplementary issue of BMC Genomics. Those papers contributed to a wide-range of important research fields including gene expression data analysis and applications, high-throughput genome mapping, sequence analysis, gene regulation, protein structure prediction, disease prediction by machine learning techniques, systems biology, database and biological software development. We always encourage participants submitting proposals for genomics sessions, special interest research sessions, workshops and tutorials to Professor Hamid R. Arabnia ( in order to ensure that Biocomp continuously plays the leadership role in promoting inter/multidisciplinary research and education in the fields. Biocomp received top conference ranking with a high score of 0.95/1.00. Biocomp is academically co-sponsored by the International Society of Intelligent Biological Medicine and the Research Laboratories and Centers of Harvard University – Massachusetts Institute of Technology, Indiana University - Purdue University, Georgia Tech – Emory University, UIUC, UCLA, Columbia University, University of Texas at Austin and University of Iowa etc. Biocomp - Worldcomp brings leading scientists together across the nation and all over the world and aims to promote synergistic components such as keynote lectures, special interest sessions, workshops and tutorials in response to the advances of cutting-edge research.
PMCID: PMC3226105  PMID: 18366597
15.  Investigation of transmembrane proteins using a computational approach 
BMC Genomics  2008;9(Suppl 1):S7.
An important subfamily of membrane proteins are the transmembrane α-helical proteins, in which the membrane-spanning regions are made up of α-helices. Given the obvious biological and medical significance of these proteins, it is of tremendous practical importance to identify the location of transmembrane segments. The difficulty of inferring the secondary or tertiary structure of transmembrane proteins using experimental techniques has led to a surge of interest in applying techniques from machine learning and bioinformatics to infer secondary structure from primary structure in these proteins. We are therefore interested in determining which physicochemical properties are most useful for discriminating transmembrane segments from non-transmembrane segments in transmembrane proteins, and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins, and in using the results of these investigations to develop classifiers to identify transmembrane segments in transmembrane proteins.
We determined that the most useful properties for discriminating transmembrane segments from non-transmembrane segments and for discriminating intrinsically unstructured segments from intrinsically structured segments in transmembrane proteins were hydropathy, polarity, and flexibility, and used the results of this analysis to construct classifiers to discriminate transmembrane segments from non-transmembrane segments using four classification techniques: two variants of the Self-Organizing Global Ranking algorithm, a decision tree algorithm, and a support vector machine algorithm. All four techniques exhibited good performance, with out-of-sample accuracies of approximately 75%.
Several interesting observations emerged from our study: intrinsically unstructured segments and transmembrane segments tend to have opposite properties; transmembrane proteins appear to be much richer in intrinsically unstructured segments than other proteins; and, in approximately 70% of transmembrane proteins that contain intrinsically unstructured segments, the intrinsically unstructured segments are close to transmembrane segments.
PMCID: PMC2386072  PMID: 18366620
16.  Supervised learning-based tagSNP selection for genome-wide disease classifications 
BMC Genomics  2008;9(Suppl 1):S6.
Comprehensive evaluation of common genetic variations through association of single nucleotide polymorphisms (SNPs) with complex human diseases on the genome-wide scale is an active area in human genome research. One of the fundamental questions in a SNP-disease association study is to find an optimal subset of SNPs with predicting power for disease status. To find that subset while reducing study burden in terms of time and costs, one can potentially reconcile information redundancy from associations between SNP markers.
We have developed a feature selection method named Supervised Recursive Feature Addition (SRFA). This method combines supervised learning and statistical measures for the chosen candidate features/SNPs to reconcile the redundancy information and, in doing so, improve the classification performance in association studies. Additionally, we have proposed a Support Vector based Recursive Feature Addition (SVRFA) scheme in SNP-disease association analysis.
We have proposed using SRFA with different statistical learning classifiers and SVRFA for both SNP selection and disease classification and then applying them to two complex disease data sets. In general, our approaches outperform the well-known feature selection method of Support Vector Machine Recursive Feature Elimination and logic regression-based SNP selection for disease classification in genetic association studies. Our study further indicates that both genetic and environmental variables should be taken into account when doing disease predictions and classifications for the most complex human diseases that have gene-environment interactions.
PMCID: PMC2386071  PMID: 18366619
17.  Improving prediction accuracy of tumor classification by reusing genes discarded during gene selection 
BMC Genomics  2008;9(Suppl 1):S3.
Since the high dimensionality of gene expression microarray data sets degrades the generalization performance of classifiers, feature selection, which selects relevant features and discards irrelevant and redundant features, has been widely used in the bioinformatics field. Multi-task learning is a novel technique to improve prediction accuracy of tumor classification by using information contained in such discarded redundant features, but which features should be discarded or used as input or output remains an open issue.
We demonstrate a framework for automatically selecting features to be input, output, and discarded by using a genetic algorithm, and propose two algorithms: GA-MTL (Genetic algorithm based multi-task learning) and e-GA-MTL (an enhanced version of GA-MTL). Experimental results demonstrate that this framework is effective at selecting features for multi-task learning, and that GA-MTL and e-GA-MTL perform better than other heuristic methods.
Genetic algorithms are a powerful technique to select features for multi-task learning automatically; GA-MTL and e-GA-MTL are shown to to improve generalization performance of classifiers on microarray data sets.
PMCID: PMC2386068  PMID: 18366616
18.  A hybrid machine learning-based method for classifying the Cushing's Syndrome with comorbid adrenocortical lesions 
BMC Genomics  2008;9(Suppl 1):S23.
The prognosis for many cancers could be improved dramatically if they could be detected while still at the microscopic disease stage. It follows from a comprehensive statistical analysis that a number of antigens such as hTERT, PCNA and Ki-67 can be considered as cancer markers, while another set of antigens such as P27KIP1 and FHIT are possible markers for normal tissue. Because more than one marker must be considered to obtain a classification of cancer or no cancer, and if cancer, to classify it as malignant, borderline, or benign, we must develop an intelligent decision system that can fullfill such an unmet medical need.
We have developed an intelligent decision system using machine learning techniques and markers to characterize tissue as cancerous, non-cancerous or borderline. The system incorporates learning techniques such as variants of support vector machines, neural networks, decision trees, self-organizing feature maps (SOFM) and recursive maximum contrast trees (RMCT). These variants and algorithms we have developed, tend to detect microscopic pathological changes based on features derived from gene expression levels and metabolic profiles. We have also used immunohistochemistry techniques to measure the gene expression profiles from a number of antigens such as cyclin E, P27KIP1, FHIT, Ki-67, PCNA, Bax, Bcl-2, P53, Fas, FasL and hTERT in several particular types of neuroendocrine tumors such as pheochromocytomas, paragangliomas, and the adrenocortical carcinomas (ACC), adenomas (ACA), and hyperplasia (ACH) involved with Cushing's syndrome. We provided statistical evidence that higher expression levels of hTERT, PCNA and Ki-67 etc. are associated with a higher risk that the tumors are malignant or borderline as opposed to benign. We also investigated whether higher expression levels of P27KIP1 and FHIT, etc., are associated with a decreased risk of adrenomedullary tumors. While no significant difference was found between cell-arrest antigens such as P27KIP1 for malignant, borderline, and benign tumors, there was a significant difference between expression levels of such antigens in normal adrenal medulla samples and in adrenomedullary tumors.
Our frame work focused on not only different classification schemes and feature selection algorithms, but also ensemble methods such as boosting and bagging in an effort to improve upon the accuracy of the individual classifiers. It is evident that when all sorts of machine learning and statistically learning techniques are combined appropriately into one integrated intelligent medical decision system, the prediction power can be enhanced significantly. This research has many potential applications; it might provide an alternative diagnostic tool and a better understanding of the mechanisms involved in malignant transformation as well as information that is useful for treatment planning and cancer prevention.
PMCID: PMC2386065  PMID: 18366613
19.  Prediction-based approaches to characterize bidirectional promoters in the mammalian genome 
BMC Genomics  2008;9(Suppl 1):S2.
Machine learning approaches are emerging as a way to discriminate various classes of functional elements. Previous attempts to create Regulatory Potential (RP) scores to discriminate functional DNA from nonfunctional DNA included using Markov models trained to identify sequences from promoters and enhancers from ancestral repeats. We proposed that knowledge gleaned from those methods could be further refined using a multiple class predictor to separate classes of promoter elements from enhancers or nonfunctional DNA.
We extended our previous work, which identified over 5,000 candidate bidirectional promoters in the human genome, to map the orthologous promoter regions in the mouse genome. Our algorithm measured the robustness of evidence provided by the spliced EST annotations and incorporated evidence from annotations of UCSC Known Genes and GenBank mRNA. In preparation for de novo prediction of this promoter type, we examined characteristic features of the dataset as a whole. For instance, bidirectional promoters score very highly among all functional elements for Regulatory Potential Scores. This result was unexpected due to the limited sequence conservation found in these noncoding regions. We demonstrate that bidirectional promoters can be classified apart from other genomic features including non-bidirectional promoters, i.e. those promoters having no nearby upstream genes. Furthermore bidirectional promoters consistently score at the level of very highly conserved functional elements in the genome- developmental enhancers. The high scores are due to sequence-based characteristics within the promoters, not the surrounding exons. These results indicate that high-scoring RP regions can be deconvoluted into various functional classes of genomic elements. Using a multiple class predictor we are able to discriminate bidirectional promoters from enhancers, non-bidirectional promoters, and non-promoter regions on the basis of RP scores and CpG islands.
We examine orthology at bidirectional promoters, use discriminatory machine learning approaches to differentiate multiple types of promoters from other functional and nonfunctional features in the genome and begin the process of deconvoluting classes of functional regions that score well with RP scores. These types of approaches precede supervised learning techniques to discover unannotated promoter regions.
PMCID: PMC2386062  PMID: 18366609
20.  Supervised learning method for the prediction of subcellular localization of proteins using amino acid and amino acid pair composition 
BMC Genomics  2008;9(Suppl 1):S16.
Occurrence of protein in the cell is an important step in understanding its function. It is highly desirable to predict a protein's subcellular locations automatically from its sequence. Most studied methods for prediction of subcellular localization of proteins are signal peptides, the location by sequence homology, and the correlation between the total amino acid compositions of proteins. Taking amino-acid composition and amino acid pair composition into consideration helps improving the prediction accuracy.
We constructed a dataset of protein sequences from SWISS-PROT database and segmented them into 12 classes based on their subcellular locations. SVM modules were trained to predict the subcellular location based on amino acid composition and amino acid pair composition. Results were calculated after 10-fold cross validation. Radial Basis Function (RBF) outperformed polynomial and linear kernel functions. Total prediction accuracy reached to 71.8% for amino acid composition and 77.0% for amino acid pair composition. In order to observe the impact of number of subcellular locations we constructed two more datasets of nine and five subcellular locations. Total accuracy was further improved to 79.9% and 85.66%.
A new SVM based approach is presented based on amino acid and amino acid pair composition. Result shows that data simulation and taking more protein features into consideration improves the accuracy to a great extent. It was also noticed that the data set needs to be crafted to take account of the distribution of data in all the classes.
PMCID: PMC2386058  PMID: 18366605
21.  A comparative study of different machine learning methods on microarray gene expression data 
BMC Genomics  2008;9(Suppl 1):S13.
Several classification and feature selection methods have been studied for the identification of differentially expressed genes in microarray data. Classification methods such as SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods have been used in recent studies. The accuracy of these methods has been calculated with validation methods such as v-fold validation. However there is lack of comparison between these methods to find a better framework for classification, clustering and analysis of microarray gene expression results.
In this study, we compared the efficiency of the classification methods including; SVM, RBF Neural Nets, MLP Neural Nets, Bayesian, Decision Tree and Random Forrest methods. The v-fold cross validation was used to calculate the accuracy of the classifiers. Some of the common clustering methods including K-means, DBC, and EM clustering were applied to the datasets and the efficiency of these methods have been analysed. Further the efficiency of the feature selection methods including support vector machine recursive feature elimination (SVM-RFE), Chi Squared, and CSF were compared. In each case these methods were applied to eight different binary (two class) microarray datasets. We evaluated the class prediction efficiency of each gene list in training and test cross-validation using supervised classifiers.
We presented a study in which we compared some of the common used classification, clustering, and feature selection methods. We applied these methods to eight publicly available datasets, and compared how these methods performed in class prediction of test datasets. We reported that the choice of feature selection methods, the number of genes in the gene list, the number of cases (samples) substantially influence classification success. Based on features chosen by these methods, error rates and accuracy of several classification algorithms were obtained. Results revealed the importance of feature selection in accurately classifying new samples and how an integrated feature selection and classification algorithm is performing and is capable of identifying significant genes.
PMCID: PMC2386055  PMID: 18366602
22.  Flexible nets: disorder and induced fit in the associations of p53 and 14-3-3 with their partners 
BMC Genomics  2008;9(Suppl 1):S1.
Proteins are involved in many interactions with other proteins leading to networks that regulate and control a wide variety of physiological processes. Some of these proteins, called hub proteins or hubs, bind to many different protein partners. Protein intrinsic disorder, via diversity arising from structural plasticity or flexibility, provide a means for hubs to associate with many partners (Dunker AK, Cortese MS, Romero P, Iakoucheva LM, Uversky VN: Flexible Nets: The roles of intrinsic disorder in protein interaction networks. FEBS J 2005, 272:5129-5148).
Here we present a detailed examination of two divergent examples: 1) p53, which uses different disordered regions to bind to different partners and which also has several individual disordered regions that each bind to multiple partners, and 2) 14-3-3, which is a structured protein that associates with many different intrinsically disordered partners. For both examples, three-dimensional structures of multiple complexes reveal that the flexibility and plasticity of intrinsically disordered protein regions as well as induced-fit changes in the structured regions are both important for binding diversity.
These data support the conjecture that hub proteins often utilize intrinsic disorder to bind to multiple partners and provide detailed information about induced fit in structured regions.
PMCID: PMC2386051  PMID: 18366598

Results 1-22 (22)