RDX is a well-known pollutant to induce neurotoxicity. MicroRNAs (miRNA) and messenger RNA (mRNA) profiles are useful tools for toxicogenomics studies. It is worthy to integrate MiRNA and mRNA expression data to understand RDX-induced neurotoxicity.
Rats were treated with or without RDX for 48 h. Both miRNA and mRNA profiles were conducted using brain tissues. Nine miRNAs were significantly regulated by RDX. Of these, 6 and 3 miRNAs were up- and down-regulated respectively. The putative target genes of RDX-regulated miRNAs were highly nervous system function genes and pathways enriched. Fifteen differentially genes altered by RDX from mRNA profiles were the putative targets of regulated miRNAs. The induction of miR-71, miR-27ab, miR-98, and miR-135a expression by RDX, could reduce the expression of the genes POLE4, C5ORF13, SULF1 and ROCK2, and eventually induce neurotoxicity. Over-expression of miR-27ab, or reduction of the expression of unknown miRNAs by RDX, could up-regulate HMGCR expression and contribute to neurotoxicity. RDX regulated immune and inflammation response miRNAs and genes could contribute to RDX- induced neurotoxicity and other toxicities as well as animal defending reaction response to RDX exposure.
Our results demonstrate that integrating miRNA and mRNA profiles is valuable to indentify novel biomarkers and molecular mechanisms for RDX-induced neurological disorder and neurotoxicity.
Epigallocatechin-3-gallate (EGCG) has been demonstrated to inhibit cancer in experimental studies through its antioxidant activity and modulations on cellular functions by binding specific proteins. We demonstrated previously that EGCG upregulates the expression of microRNA (i.e. miR-210) by binding HIF-1α, resulting in reduced cell proliferation and anchorage-independent growth. However, the binding affinities of EGCG to HIF-1α and many other targets are higher than the EGCG plasma peak level in experimental animals administered with high dose of EGCG, raising a concern whether the microRNA regulation by HIF-1α is involved in the anti-cancer activity of EGCG in vivo.
We employed functional genomic approaches to elucidate the role of microRNA in the EGCG inhibition of tobacco carcinogen-induced lung tumors in A/J mice. By analysing the microRNA profiles, we found modest changes in the expression levels of 21 microRNAs. By correlating these 21 microRNAs with the mRNA expression profiles using the computation methods, we identified 26 potential targeted genes of the 21 microRNAs. Further exploration using pathway analysis revealed that the most impacted pathways of EGCG treatment are the regulatory networks associated to AKT, NF-κB, MAP kinases, and cell cycle, and the identified miRNA targets are involved in the networks of AKT, MAP kinases and cell cycle regulation
These results demonstrate that the miRNA-mediated regulation is actively involved in the major aspects of the anti-cancer activity of EGCG in vivo.
High throughput transcriptomics profiles such as those generated using microarrays have been useful in identifying biomarkers for different classification and toxicity prediction purposes. Here, we investigated the use of microarrays to predict chemical toxicants and their possible mechanisms of action.
In this study, in vitro cultures of primary rat hepatocytes were exposed to 105 chemicals and vehicle controls, representing 14 compound classes. We comprehensively compared various normalization of gene expression profiles, feature selection and classification algorithms for the classification of these 105 chemicals into14 compound classes. We found that normalization had little effect on the averaged classification accuracy. Two support vector machine (SVM) methods, LibSVM and sequential minimal optimization, had better classification performance than other methods. SVM recursive feature selection (SVM-RFE) had the highest overfitting rate when an independent dataset was used for a prediction. Therefore, we developed a new feature selection algorithm called gradient method that had a relatively high training classification as well as prediction accuracy with the lowest overfitting rate of the methods tested. Analysis of biomarkers that distinguished the 14 classes of compounds identified a group of genes principally involved in cell cycle function that were significantly downregulated by metal and inflammatory compounds, but were induced by anti-microbial, cancer related drugs, pesticides, and PXR mediators.
Our results indicate that using microarrays and a supervised machine learning approach to predict chemical toxicants, their potential toxicity and mechanisms of action is practical and efficient. Choosing the right feature and classification algorithms for this multiple category classification and prediction is critical.
Biomarker; Microarray; Hepatocytes; Chemical; Classification
This is an editorial report of the supplement to BMC Genomics that includes 15 papers selected from the BIOCOMP'10 - The 2010 International Conference on Bioinformatics & Computational Biology as well as other sources with a focus on genomics studies.
BIOCOMP'10 was held on July 12-15 in Las Vegas, Nevada. The congress covered a large variety of research areas, and genomics was one of the major focuses because of the fast development in this field. We set out to launch a supplement to BMC Genomics with manuscripts selected from this congress and invited submissions. With a rigorous peer review process, we selected 15 manuscripts that showed work in cutting-edge genomics fields and proposed innovative methodology. We hope this supplement presents the current computational and statistical challenges faced in genomics studies, and shows the enormous promises and opportunities in the genomic future.
Microarray data have a high dimension of variables and a small sample size. In microarray data analyses, two important issues are how to choose genes, which provide reliable and good prediction for disease status, and how to determine the final gene set that is best for classification. Associations among genetic markers mean one can exploit information redundancy to potentially reduce classification cost in terms of time and money.
To deal with redundant information and improve classification, we propose a gene selection method, Recursive Feature Addition, which combines supervised learning and statistical similarity measures. To determine the final optimal gene set for prediction and classification, we propose an algorithm, Lagging Prediction Peephole Optimization. By using six benchmark microarray gene expression data sets, we compared Recursive Feature Addition with recently developed gene selection methods: Support Vector Machine Recursive Feature Elimination, Leave-One-Out Calculation Sequential Forward Selection and several others.
On average, with the use of popular learning machines including Nearest Mean Scaled Classifier, Support Vector Machine, Naive Bayes Classifier and Random Forest, Recursive Feature Addition outperformed other methods. Our studies also showed that Lagging Prediction Peephole Optimization is superior to random strategy; Recursive Feature Addition with Lagging Prediction Peephole Optimization obtained better testing accuracies than the gene selection method varSelRF.
gene selection; microarray; classification; supervised-learning; similarity
The recent advancement in array CGH (aCGH) research has significantly improved tumor identification using DNA copy number data. A number of unsupervised learning methods have been proposed for clustering aCGH samples. Two of the major challenges for developing aCGH sample clustering are the high spatial correlation between aCGH markers and the low computing efficiency. A mixture hidden Markov model based algorithm was developed to address these two challenges.
The hidden Markov model (HMM) was used to model the spatial correlation between aCGH markers. A fast clustering algorithm was implemented and real data analysis on glioma aCGH data has shown that it converges to the optimal cluster rapidly and the computation time is proportional to the sample size. Simulation results showed that this HMM based clustering (HMMC) method has a substantially lower error rate than NMF clustering. The HMMC results for glioma data were significantly associated with clinical outcomes.
We have developed a fast clustering algorithm to identify tumor subtypes based on DNA copy number aberrations. The performance of the proposed HMMC method has been evaluated using both simulated and real aCGH data. The software for HMMC in both R and C++ is available in ND INBRE website http://ndinbre.org/programs/bioinformatics.php.
Along with obesity, physical inactivity, and family history of metabolic disorders, African American ethnicity is a risk factor for type 2 diabetes (T2D) in the United States. However, little is known about the differences in gene expression and transcriptomic profiles of blood in T2D between African Americans (AA) and Caucasians (CAU), and microarray analysis of peripheral white blood cells (WBCs) from these two ethnic groups will facilitate our understanding of the underlying molecular mechanism in T2D and identify genetic biomarkers responsible for the disparities.
A whole human genome oligomicroarray of peripheral WBCs was performed on 144 samples obtained from 84 patients with T2D (44 AA and 40 CAU) and 60 healthy controls (28 AA and 32 CAU). The results showed that 30 genes had significant difference in expression between patients and controls (a fold change of <-1.4 or >1.4 with a P value <0.05). These known genes were mainly clustered in three functional categories: immune responses, lipid metabolism, and organismal injury/abnormaly. Transcriptomic analysis also showed that 574 genes were differentially expressed in AA diseased versus AA control, compared to 200 genes in CAU subjects. Pathway study revealed that "Communication between innate and adaptive immune cells"/"Primary immunodeficiency signaling" are significantly down-regulated in AA patients and "Interferon signaling"/"Complement System" are significantly down-regulated in CAU patients.
These newly identified genetic markers in WBCs provide valuable information about the pathophysiology of T2D and can be used for diagnosis and pharmaceutical drug design. Our results also found that AA and CAU patients with T2D express genes and pathways differently.
Speckles in ultrasound imaging affect image quality and can make the post-processing difficult. Speckle reduction technologies have been employed for removing speckles for some time. One of the effective speckle reduction technologies is anisotropic diffusion. Anisotropic diffusion technology can remove the speckles effectively while preserving the edges of the image and thus has drawn great attention from image processing scientists. However, the proposed methods in the past have different disadvantages, such as being sensitive to the number of iterations or low capability of preserving the details of the ultrasound images. Thus a detail preserved anisotropic diffusion speckle reduction with less sensitive to the number of iterations is needed. This paper aims to develop this kind of technologies.
In this paper, we propose a robust detail preserving anisotropic diffusion filter (RDPAD) for speckle reduction. In order to get robust diffusion, the proposed method integrates Tukey error norm function into the detail preserving anisotropic diffusion filter (DPAD) developed recently. The proposed method could prohibit over-diffusion and thus is less sensitive to the number of iterations
The proposed anisotropic diffusion can preserve the important structure information of the original image while reducing speckles. It is also less sensitive to the number of iterations. Experimental results on real ultrasound images show the effectiveness of the proposed anisotropic diffusion filter.
Microarray data have been used for gene signature selection to predict clinical outcomes. Many studies have attempted to identify factors that affect models' performance with only little success. Fine-tuning of model parameters and optimizing each step of the modeling process often results in over-fitting problems without improving performance.
We propose a quantitative measurement, termed consistency degree, to detect the correlation between disease endpoint and gene expression profile. Different endpoints were shown to have different consistency degrees to gene expression profiles. The validity of this measurement to estimate the consistency was tested with significance at a p-value less than 2.2e-16 for all of the studied endpoints. According to the consistency degree score, overall survival milestone outcome of multiple myeloma was proposed to extend from 730 days to 1561 days, which is more consistent with gene expression profile.
For various clinical endpoints, the maximum predictive powers of different microarray-based models are limited by the correlation between endpoint and gene expression profile of disease samples as indicated by the consistency degree score. In addition, previous defined clinical outcomes can also be reassessed and refined more coherent according to related disease gene expression profile. Our findings point to an entirely new direction for assessing the microarray-based predictive models and provide important information to gene signature based clinical applications.
The use of gene signatures can potentially be of considerable value in the field of clinical diagnosis. However, gene signatures defined with different methods can be quite various even when applied the same disease and the same endpoint. Previous studies have shown that the correct selection of subsets of genes from microarray data is key for the accurate classification of disease phenotypes, and a number of methods have been proposed for the purpose. However, these methods refine the subsets by only considering each single feature, and they do not confirm the association between the genes identified in each gene signature and the phenotype of the disease. We proposed an innovative new method termed Minimize Feature's Size (MFS) based on multiple level similarity analyses and association between the genes and disease for breast cancer endpoints by comparing classifier models generated from the second phase of MicroArray Quality Control (MAQC-II), trying to develop effective meta-analysis strategies to transform the MAQC-II signatures into a robust and reliable set of biomarker for clinical applications.
We analyzed the similarity of the multiple gene signatures in an endpoint and between the two endpoints of breast cancer at probe and gene levels, the results indicate that disease-related genes can be preferably selected as the components of gene signature, and that the gene signatures for the two endpoints could be interchangeable. The minimized signatures were built at probe level by using MFS for each endpoint. By applying the approach, we generated a much smaller set of gene signature with the similar predictive power compared with those gene signatures from MAQC-II.
Our results indicate that gene signatures of both large and small sizes could perform equally well in clinical applications. Besides, consistency and biological significances can be detected among different gene signatures, reflecting the studying endpoints. New classifiers built with MFS exhibit improved performance with both internal and external validation, suggesting that MFS method generally reduces redundancies for features within gene signatures and improves the performance of the model. Consequently, our strategy will be beneficial for the microarray-based clinical applications.
In microarray experiments with small sample sizes, it is a challenge to estimate p-values accurately and decide cutoff p-values for gene selection appropriately. Although permutation-based methods have proved to have greater sensitivity and specificity than the regular t-test, their p-values are highly discrete due to the limited number of permutations available in very small sample sizes. Furthermore, estimated permutation-based p-values for true nulls are highly correlated and not uniformly distributed between zero and one, making it difficult to use current false discovery rate (FDR)-controlling methods.
We propose a model-based information sharing method (MBIS) that, after an appropriate data transformation, utilizes information shared among genes. We use a normal distribution to model the mean differences of true nulls across two experimental conditions. The parameters of the model are then estimated using all data in hand. Based on this model, p-values, which are uniformly distributed from true nulls, are calculated. Then, since FDR-controlling methods are generally not well suited to microarray data with very small sample sizes, we select genes for a given cutoff p-value and then estimate the false discovery rate.
Simulation studies and analysis using real microarray data show that the proposed method, MBIS, is more powerful and reliable than current methods. It has wide application to a variety of situations.
Significant interest exists in establishing synergistic research in bioinformatics, systems biology and intelligent computing. Supported by the United States National Science Foundation (NSF), International Society of Intelligent Biological Medicine (http://www.ISIBM.org), International Journal of Computational Biology and Drug Design (IJCBDD) and International Journal of Functional Informatics and Personalized Medicine, the ISIBM International Joint Conferences on Bioinformatics, Systems Biology and Intelligent Computing (ISIBM IJCBS 2009) attracted more than 300 papers and 400 researchers and medical doctors world-wide. It was the only inter/multidisciplinary conference aimed to promote synergistic research and education in bioinformatics, systems biology and intelligent computing. The conference committee was very grateful for the valuable advice and suggestions from honorary chairs, steering committee members and scientific leaders including Dr. Michael S. Waterman (USC, Member of United States National Academy of Sciences), Dr. Chih-Ming Ho (UCLA, Member of United States National Academy of Engineering and Academician of Academia Sinica), Dr. Wing H. Wong (Stanford, Member of United States National Academy of Sciences), Dr. Ruzena Bajcsy (UC Berkeley, Member of United States National Academy of Engineering and Member of United States Institute of Medicine of the National Academies), Dr. Mary Qu Yang (United States National Institutes of Health and Oak Ridge, DOE), Dr. Andrzej Niemierko (Harvard), Dr. A. Keith Dunker (Indiana), Dr. Brian D. Athey (Michigan), Dr. Weida Tong (FDA, United States Department of Health and Human Services), Dr. Cathy H. Wu (Georgetown), Dr. Dong Xu (Missouri), Drs. Arif Ghafoor and Okan K Ersoy (Purdue), Dr. Mark Borodovsky (Georgia Tech, President of ISIBM), Dr. Hamid R. Arabnia (UGA, Vice-President of ISIBM), and other scientific leaders. The committee presented the 2009 ISIBM Outstanding Achievement Awards to Dr. Joydeep Ghosh (UT Austin), Dr. Aidong Zhang (Buffalo) and Dr. Zhi-Hua Zhou (Nanjing) for their significant contributions to the field of intelligent biological medicine.
The increasing availability of large-scale protein-protein interaction data has made it possible to understand the basic components and organization of cell machinery from the network level. The arising challenge is how to analyze such complex interacting data to reveal the principles of cellular organization, processes and functions. Many studies have shown that clustering protein interaction network is an effective approach for identifying protein complexes or functional modules, which has become a major research topic in systems biology. In this review, recent advances in clustering methods for protein interaction networks will be presented in detail. The predictions of protein functions and interactions based on modules will be covered. Finally, the performance of different clustering methods will be compared and the directions for future research will be discussed.
Military and industrial activities have lead to reported release of 2,4-dinitrotoluene (2,4DNT) into soil, groundwater or surface water. It has been reported that 2,4DNT can induce toxic effects on humans and other organisms. However the mechanism of 2,4DNT induced toxicity is still unclear. Although a series of methods for gene network construction have been developed, few instances of applying such technology to generate pathway connected networks have been reported.
Microarray analyses were conducted using liver tissue of rats collected 24h after exposure to a single oral gavage with one of five concentrations of 2,4DNT. We observed a strong dose response of differentially expressed genes after 2,4DNT treatment. The most affected pathways included: long term depression, breast cancer regulation by stathmin1, WNT Signaling; and PI3K signaling pathways. In addition, we propose a new approach to construct pathway connected networks regulated by 2,4DNT. We also observed clear dose response pathway networks regulated by 2,4DNT.
We developed a new method for constructing pathway connected networks. This new method was successfully applied to microarray data from liver tissue of 2,4DNT exposed animals and resulted in the identification of unique dose responsive biomarkers in regards to affected pathways.
Starches are the main storage polysaccharides in plants and are distributed widely throughout plants including seeds, roots, tubers, leaves, stems and so on. Currently, microscopic observation is one of the most important ways to investigate and analyze the structure of starches. The position, shape, and size of the starch granules are the main measurements for quantitative analysis. In order to obtain these measurements, segmentation of starch granules from the background is very important. However, automatic segmentation of starch granules is still a challenging task because of the limitation of imaging condition and the complex scenarios of overlapping granules.
We propose a novel method to segment starch granules in microscopic images. In the proposed method, we first separate starch granules from background using automatic thresholding and then roughly segment the image using watershed algorithm. In order to reduce the oversegmentation in watershed algorithm, we use the roundness of each segment, and analyze the gradient vector field to find the critical points so as to identify oversegments. After oversegments are found, we extract the features, such as the position and intensity of the oversegments, and use fuzzy c-means clustering to merge the oversegments to the objects with similar features. Experimental results demonstrate that the proposed method can alleviate oversegmentation of watershed segmentation algorithm successfully.
We present a new scheme for starch granules segmentation. The proposed scheme aims to alleviate the oversegmentation in watershed algorithm. We use the shape information and critical points of gradient vector flow (GVF) of starch granules to identify oversegments, and use fuzzy c-mean clustering based on prior knowledge to merge these oversegments to the objects. Experimental results on twenty microscopic starch images demonstrate the effectiveness of the proposed scheme.
Ultrasound imaging technology has wide applications in cattle reproduction and has been used to monitor individual follicles and determine the patterns of follicular development. However, the speckles in ultrasound images affect the post-processing, such as follicle segmentation and finally affect the measurement of the follicles. In order to reduce the effect of speckles, a bilateral filter is developed in this paper.
We develop a new bilateral filter for speckle reduction in ultrasound images for follicle segmentation and measurement. Different from the previous bilateral filters, the proposed bilateral filter uses normalized difference in the computation of the Gaussian intensity difference. We also present the results of follicle segmentation after speckle reduction. Experimental results on both synthetic images and real ultrasound images demonstrate the effectiveness of the proposed filter.
Compared with the previous bilateral filters, the proposed bilateral filter can reduce speckles in both high-intensity regions and low intensity regions in ultrasound images. The segmentation of the follicles in the speckle reduced images by the proposed method has higher performance than the segmentation in the original ultrasound image, and the images filtered by Gaussian filter, the conventional bilateral filter respectively.
Acetylation is a crucial post-translational modification for histones, and plays a key role in gene expression regulation. Due to limited data and lack of a clear acetylation consensus sequence, a few researches have focused on prediction of lysine acetylation sites. Several systematic prediction studies have been conducted for human and yeast, but less for Arabidopsis thaliana.
Concerning the insufficient observation on acetylation site, we analyzed contributions of the peptide-alignment-based distance definition and 3D structure factors in acetylation prediction. We found that traditional structure contributes little to acetylation site prediction. Identified acetylation sites of histones in Arabidopsis thaliana are conserved and cross predictable with that of human by peptide based methods. However, the predicted specificity is overestimated, because of the existence of non-observed acetylable site. Here, by performing a complete exploration on the factors that affect the acetylability of lysines in histones, we focused on the relative position of lysine at nucleosome level, and defined a new structure feature to promote the performance in predicting the acetylability of all the histone lysines in A. thaliana.
We found a new spacial correlated acetylation factor, and defined a ε-N spacial location based feature, which contains five core spacial ellipsoid wired areas. By incorporating the new feature, the performance of predicting the acetylability of all the histone lysines in A. Thaliana was promoted, in which the previous mispredicted acetylable lysines were corrected by comparing to the peptide-based prediction.
Many essential cellular processes, such as cellular metabolism, transport, cellular metabolism and most regulatory mechanisms, rely on physical interactions between proteins. Genome-wide protein interactome networks of yeast, human and several other animal organisms have already been established, but this kind of network reminds to be established in the field of plant.
We first predicted the protein protein interaction in Arabidopsis thaliana with methods, including ortholog, SSBP, gene fusion, gene neighbor, phylogenetic profile, coexpression, protein domain, and used Naïve Bayesian approach next to integrate the results of these methods and text mining data to build a genome-wide protein interactome network. Furthermore, we adopted the data of GO enrichment analysis, pathway, published literature to validate our network, the confirmation of our network shows the feasibility of using our network to predict protein function and other usage.
Our interactome is a comprehensive genome-wide network in the organism plant Arabidopsis thaliana, and provides a rich resource for researchers in related field to study the protein function, molecular interaction and potential mechanism under different conditions.
Sheepshead minnow (Cyprinodon variegatus) are small fish capable of withstanding exposure to very low levels of dissolved oxygen, as well as extreme temperatures and salinities. It is an important model in understanding the impacts and biological response to hypoxia and co-occurring compounding stressors such as polycyclic aromatic hydrocarbons, endocrine disrupting chemicals, metals and herbicides. Here, we initiated a project to sequence and analyze over 10,000 ESTs generated from the Sheepshead minnow (Cyprinodon variegatus) as a resource for investigating stressor responses.
We sequenced 10,858 EST clones using a normalized cDNA library made from larval, embryonic and adult suppression subtractive hybridization-PCR (SSH) libraries. Post- sequencing processing led to 8,099 high quality sequences. Clustering analysis of these ESTs indentified 4,223 unique sequences containing 1,053 contigs and 3,170 singletons. BLASTX searches produced 1,394 significant (E-value < 10-5) hits and further Gene Ontology (GO) analysis annotated 388 of these genes. All the EST sequences were deposited by Expressed Sequence Tags database (dbEST) in GenBank (GenBank: GE329585 to GE337683). Gene discovery and annotations are presented and discussed. This set of ESTs represents a significant proportion of the Sheepshead minnow (Cyprinodon variegatus) transcriptome, and provides a material basis for the development of microarrays useful for further gene expression studies in association with stressors such as hypoxia, cadmium, chromium and pyrene.
One of the most challenging tasks in the post-genomic era is to reconstruct the transcriptional regulatory networks. The goal is to reveal, for each gene that responds to a certain biological event, which transcription factors affect its expression, and how a set of transcription factors coordinate to accomplish temporal and spatial specific regulations.
Here we propose a supervised machine learning approach to address these questions. We focus our study on the gene transcriptional regulation of the cell cycle in the budding yeast, thanks to the large amount of data available and relatively well-understood biology, although the main ideas of our method can be applied to other data as well. Our method starts with building an ensemble of decision trees for each microarray data to capture the association between the expression levels of yeast genes and the binding of transcription factors to gene promoter regions, as determined by chromatin immunoprecipitation microarray (ChIP-chip) experiment. Cross-validation experiments show that the method is more accurate and reliable than the naive decision tree algorithm and several other ensemble learning methods. From the decision tree ensembles, we extract logical rules that explain how a set of transcription factors act in concert to regulate the expression of their targets. We further compute a profile for each rule to show its regulation strengths at different time points. We also propose a spline interpolation method to integrate the rule profiles learned from several time series expression data sets that measure the same biological process. We then combine these rule profiles to build a transcriptional regulatory network for the yeast cell cycle. Compared to the results in the literature, our method correctly identifies all major known yeast cell cycle transcription factors, and assigns them into appropriate cell cycle phases. Our method also identifies many interesting synergetic relationships among these transcription factors, most of which are well known, while many of the rest can also be supported by other evidences.
The high accuracy of our method indicates that our method is valid and robust. As more gene expression and transcription factor binding data become available, we believe that our method is useful for reconstructing large-scale transcriptional regulatory networks in other species as well.
In the classification of Mass Spectrometry (MS) proteomics data, peak detection, feature selection, and learning classifiers are critical to classification accuracy. To better understand which methods are more accurate when classifying data, some publicly available peak detection algorithms for Matrix assisted Laser Desorption Ionization Mass Spectrometry (MALDI-MS) data were recently compared; however, the issue of different feature selection methods and different classification models as they relate to classification performance has not been addressed. With the application of intelligent computing, much progress has been made in the development of feature selection methods and learning classifiers for the analysis of high-throughput biological data. The main objective of this paper is to compare the methods of feature selection and different learning classifiers when applied to MALDI-MS data and to provide a subsequent reference for the analysis of MS proteomics data.
We compared a well-known method of feature selection, Support Vector Machine Recursive Feature Elimination (SVMRFE), and a recently developed method, Gradient based Leave-one-out Gene Selection (GLGS) that effectively performs microarray data analysis. We also compared several learning classifiers including K-Nearest Neighbor Classifier (KNNC), Naïve Bayes Classifier (NBC), Nearest Mean Scaled Classifier (NMSC), uncorrelated normal based quadratic Bayes Classifier recorded as UDC, Support Vector Machines, and a distance metric learning for Large Margin Nearest Neighbor classifier (LMNN) based on Mahanalobis distance. To compare, we conducted a comprehensive experimental study using three types of MALDI-MS data.
Regarding feature selection, SVMRFE outperformed GLGS in classification. As for the learning classifiers, when classification models derived from the best training were compared, SVMs performed the best with respect to the expected testing accuracy. However, the distance metric learning LMNN outperformed SVMs and other classifiers on evaluating the best testing. In such cases, the optimum classification model based on LMNN is worth investigating for future study.
Gene expression time series array data has become a useful resource for investigating gene functions and the interactions between genes. However, the gene expression arrays are always mixed with noise, and many nonlinear regulatory relationships have been omitted in many linear models. Because of those practical limitations, inference of gene regulatory model from expression data is still far from satisfactory.
In this study, we present a model-based computational approach, Slice Pattern Model (SPM), to identify gene regulatory network from time series gene expression array data. In order to estimate performances of stability and reliability of our model, an artificial gene network is tested by the traditional linear model and SPM. SPM can handle the multiple transcriptional time lags and more accurately reconstruct the gene network. Using SPM, a 17 time-series gene expression data in yeast cell cycle is retrieved to reconstruct the regulatory network. Under the reliability threshold, θ = 55%, 18 relationships between genes are identified and transcriptional regulatory network is reconstructed. Results from previous studies demonstrate that most of gene relationships identified by SPM are correct.
With the help of pattern recognition and similarity analysis, the effect of noise has been limited in SPM method. At the same time, genetic algorithm is introduced to optimize parameters of gene network model, which is performed based on a statistic method in our experiments. The results of experiments demonstrate that the gene regulatory model reconstructed using SPM is more stable and reliable than those models coming from traditional linear model.
Affymetrix GeneChip® high-density oligonucleotide arrays are widely used in biological and medical research because of production reproducibility, which facilitates the comparison of results between experiment runs. In order to obtain high-level classification and cluster analysis that can be trusted, it is important to perform various pre-processing steps on the probe-level data to control for variability in sample processing and array hybridization. Many proposed preprocessing methods are parametric, in that they assume that the background noise generated by microarray data is a random sample from a statistical distribution, typically a normal distribution. The quality of the final results depends on the validity of such assumptions.
We propose a Distribution Free Convolution Model (DFCM) to circumvent observed deficiencies in meeting and validating distribution assumptions of parametric methods. Knowledge of array structure and the biological function of the probes indicate that the intensities of mismatched (MM) probes that correspond to the smallest perfect match (PM) intensities can be used to estimate the background noise. Specifically, we obtain the smallest q2 percent of the MM intensities that are associated with the lowest q1 percent PM intensities, and use these intensities to estimate background.
Using the Affymetrix Latin Square spike-in experiments, we show that the background noise generated by microarray experiments typically is not well modeled by a single overall normal distribution. We further show that the signal is not exponentially distributed, as is also commonly assumed. Therefore, DFCM has better sensitivity and specificity, as measured by ROC curves and area under the curve (AUC) than MAS 5.0, RMA, RMA with no background correction (RMA-noBG), GCRMA, PLIER, and dChip (MBEI) for preprocessing of Affymetrix microarray data. These results hold for two spike-in data sets and one real data set that were analyzed. Comparisons with other methods on two spike-in data sets and one real data set show that our nonparametric methods are a superior alternative for background correction of Affymetrix data.
The advent of high-throughput next generation sequencing technologies have fostered enormous potential applications of supercomputing techniques in genome sequencing, epi-genetics, metagenomics, personalized medicine, discovery of non-coding RNAs and protein-binding sites. To this end, the 2008 International Conference on Bioinformatics and Computational Biology (Biocomp) – 2008 World Congress on Computer Science, Computer Engineering and Applied Computing (Worldcomp) was designed to promote synergistic inter/multidisciplinary research and education in response to the current research trends and advances. The conference attracted more than two thousand scientists, medical doctors, engineers, professors and students gathered at Las Vegas, Nevada, USA during July 14–17 and received great success. Supported by International Society of Intelligent Biological Medicine (ISIBM), International Journal of Computational Biology and Drug Design (IJCBDD), International Journal of Functional Informatics and Personalized Medicine (IJFIPM) and the leading research laboratories from Harvard, M.I.T., Purdue, UIUC, UCLA, Georgia Tech, UT Austin, U. of Minnesota, U. of Iowa etc, the conference received thousands of research papers. Each submitted paper was reviewed by at least three reviewers and accepted papers were required to satisfy reviewers' comments. Finally, the review board and the committee decided to select only 19 high-quality research papers for inclusion in this supplement to BMC Genomics based on the peer reviews only. The conference committee was very grateful for the Plenary Keynote Lectures given by: Dr. Brian D. Athey (University of Michigan Medical School), Dr. Vladimir N. Uversky (Indiana University School of Medicine), Dr. David A. Patterson (Member of United States National Academy of Sciences and National Academy of Engineering, University of California at Berkeley) and Anousheh Ansari (Prodea Systems, Space Ambassador). The theme of the conference to promote synergistic research and education has been achieved successfully.
BLAST programs are very efficient in finding similarities for sequences. However for large datasets such as ESTs, manual extraction of the information from the batch BLAST output is needed. This can be time consuming, insufficient, and inaccurate. Therefore implementation of a parser application would be extremely useful in extracting information from BLAST outputs.
We have developed a java application, Batch Blast Extractor, with a user friendly graphical interface to extract information from BLAST output. The application generates a tab delimited text file that can be easily imported into any statistical package such as Excel or SPSS for further analysis. For each BLAST hit, the program obtains and saves the essential features from the BLAST output file that would allow further analysis. The program was written in Java and therefore is OS independent. It works on both Windows and Linux OS with java 1.4 and higher. It is freely available from: