Search tips
Search criteria

Results 1-21 (21)

Clipboard (0)
Year of Publication
1.  A Signal Processing Method to Explore Similarity in Protein Flexibility 
Advances in Bioinformatics  2010;2010:454671.
Understanding mechanisms of protein flexibility is of great importance to structural biology. The ability to detect similarities between proteins and their patterns is vital in discovering new information about unknown protein functions. A Distance Constraint Model (DCM) provides a means to generate a variety of flexibility measures based on a given protein structure. Although information about mechanical properties of flexibility is critical for understanding protein function for a given protein, the question of whether certain characteristics are shared across homologous proteins is difficult to assess. For a proper assessment, a quantified measure of similarity is necessary. This paper begins to explore image processing techniques to quantify similarities in signals and images that characterize protein flexibility. The dataset considered here consists of three different families of proteins, with three proteins in each family. The similarities and differences found within flexibility measures across homologous proteins do not align with sequence-based evolutionary methods.
PMCID: PMC3010618  PMID: 21197478
2.  Prediction of Carbohydrate-Binding Proteins from Sequences Using Support Vector Machines 
Advances in Bioinformatics  2010;2010:289301.
Carbohydrate-binding proteins are proteins that can interact with sugar chains but do not modify them. They are involved in many physiological functions, and we have developed a method for predicting them from their amino acid sequences. Our method is based on support vector machines (SVMs). We first clarified the definition of carbohydrate-binding proteins and then constructed positive and negative datasets with which the SVMs were trained. By applying the leave-one-out test to these datasets, our method delivered 0.92 of the area under the receiver operating characteristic (ROC) curve. We also examined two amino acid grouping methods that enable effective learning of sequence patterns and evaluated the performance of these methods. When we applied our method in combination with the homology-based prediction method to the annotated human genome database, H-invDB, we found that the true positive rate of prediction was improved.
PMCID: PMC2948896  PMID: 20936154
3.  Designing Efficient Spaced Seeds for SOLiD Read Mapping 
Advances in Bioinformatics  2010;2010:708501.
The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency.
PMCID: PMC2945724  PMID: 20936175
4.  Applying Small-Scale DNA Signatures as an Aid in Assembling Soybean Chromosome Sequences 
Advances in Bioinformatics  2010;2010:976792.
Previous work has established a genomic signature based on relative counts of the 16 possible dinucleotides. Until now, it has been generally accepted that the dinucleotide signature is characteristic of a genome and is relatively homogeneous across a genome. However, we found some local regions of the soybean genome with a signature differing widely from that of the rest of the genome. Those regions were mostly centromeric and pericentromeric, and enriched for repetitive sequences. We found that DNA binding energy also presented large-scale patterns across soybean chromosomes. These two patterns were helpful during assembly and quality control of soybean whole genome shotgun scaffold sequences into chromosome pseudomolecules.
PMCID: PMC2933861  PMID: 20827309
5.  Modelling Nonstationary Gene Regulatory Processes 
Advances in Bioinformatics  2010;2010:749848.
An important objective in systems biology is to infer gene regulatory networks from postgenomic data, and dynamic Bayesian networks have been widely applied as a popular tool to this end. The standard approach for nondiscretised data is restricted to a linear model and a homogeneous Markov chain. Recently, various generalisations based on changepoint processes and free allocation mixture models have been proposed. The former aim to relax the homogeneity assumption, whereas the latter are more flexible and, in principle, more adequate for modelling nonlinear processes. In our paper, we compare both paradigms and discuss theoretical shortcomings of the latter approach. We show that a model based on the changepoint process yields systematically better results than the free allocation model when inferring nonstationary gene regulatory processes from simulated gene expression time series. We further cross-compare the performance of both models on three biological systems: macrophages challenged with viral infection, circadian regulation in Arabidopsis thaliana, and morphogenesis in Drosophila melanogaster.
PMCID: PMC2913537  PMID: 20721277
6.  A Comprehensive Study of Progressive Cytogenetic Alterations in Clear Cell Renal Cell Carcinoma and a New Model for ccRCC Tumorigenesis and Progression 
Advances in Bioinformatics  2010;2010:428325.
We present a comprehensive study of cytogenetic alterations that occur during the progression of clear cell renal cell carcinoma (ccRCC). We used high-density high-throughput Affymetrix 100 K SNP arrays to obtain the whole genome SNP copy number information from 71 pretreatment tissue samples with RCC tumors; of those, 42 samples were of human ccRCC subtype. We analyzed patterns of cytogenetic loss and gain from different RCC subtypes and in particular, different stages and grades of ccRCC tumors, using a novel algorithm that we have designed. Based on patterns of cytogenetic alterations in chromosomal regions with frequent losses and gains, we inferred the involvement of candidate genes from these regions in ccRCC tumorigenesis and development. We then proposed a new model of ccRCC tumorigenesis and progression. Our study serves as a comprehensive overview of cytogenetic alterations in a collection of 572 ccRCC tumors from diversified studies and should facilitate the search for specific genes associated with the disease.
PMCID: PMC2909727  PMID: 20671976
7.  Finding Biomarker Signatures in Pooled Sample Designs: A Simulation Framework for Methodological Comparisons 
Advances in Bioinformatics  2010;2010:318573.
Detection of discriminating patterns in gene expression data can be accomplished by using various methods of statistical learning. It has been proposed that sample pooling in this context would have negative effects; however, pooling cannot always be avoided. We propose a simulation framework to explicitly investigate the parameters of patterns, experimental design, noise, and choice of method in order to find out which effects on classification performance are to be expected. We use a two-group classification task and simulated gene expression data with independent differentially expressed genes as well as bivariate linear patterns and the combination of both. Our results show a clear increase of prediction error with pool size. For pooled training sets powered partial least squares discriminant analysis outperforms discriminance analysis, random forests, and support vector machines with linear or radial kernel for two of three simulated scenarios. The proposed simulation approach can be implemented to systematically investigate a number of additional scenarios of practical interest.
PMCID: PMC2909718  PMID: 20671968
8.  PROCARB: A Database of Known and Modelled Carbohydrate-Binding Protein Structures with Sequence-Based Prediction Tools 
Advances in Bioinformatics  2010;2010:436036.
Understanding of the three-dimensional structures of proteins that interact with carbohydrates covalently (glycoproteins) as well as noncovalently (protein-carbohydrate complexes) is essential to many biological processes and plays a significant role in normal and disease-associated functions. It is important to have a central repository of knowledge available about these protein-carbohydrate complexes as well as preprocessed data of predicted structures. This can be significantly enhanced by tools de novo which can predict carbohydrate-binding sites for proteins in the absence of structure of experimentally known binding site. PROCARB is an open-access database comprising three independently working components, namely, (i) Core PROCARB module, consisting of three-dimensional structures of protein-carbohydrate complexes taken from Protein Data Bank (PDB), (ii) Homology Models module, consisting of manually developed three-dimensional models of N-linked and O-linked glycoproteins of unknown three-dimensional structure, and (iii) CBS-Pred prediction module, consisting of web servers to predict carbohydrate-binding sites using single sequence or server-generated PSSM. Several precomputed structural and functional properties of complexes are also included in the database for quick analysis. In particular, information about function, secondary structure, solvent accessibility, hydrogen bonds and literature reference, and so forth, is included. In addition, each protein in the database is mapped to Uniprot, Pfam, PDB, and so forth.
PMCID: PMC2909730  PMID: 20671979
9.  A Topological Description of Hubs in Amino Acid Interaction Networks 
Advances in Bioinformatics  2010;2010:257512.
We represent proteins by amino acid interaction networks. This is a graph whose vertices are the proteins amino acids and whose edges are the interactions between them. Once we have compared this type of graphs to the general model of scale-free networks, we analyze the existence of nodes which highly interact, the hubs. We describe these nodes taking into account their position in the primary structure to study their apparition frequency in the folded proteins. Finally, we observe that their interaction level is a consequence of the general rules which govern the folding process.
PMCID: PMC2877201  PMID: 20585353
10.  EREM: Parameter Estimation and Ancestral Reconstruction by Expectation-Maximization Algorithm for a Probabilistic Model of Genomic Binary Characters Evolution 
Advances in Bioinformatics  2010;2010:167408.
Evolutionary binary characters are features of species or genes, indicating the absence (value zero) or presence (value one) of some property. Examples include eukaryotic gene architecture (the presence or absence of an intron in a particular locus), gene content, and morphological characters. In many studies, the acquisition of such binary characters is assumed to represent a rare evolutionary event, and consequently, their evolution is analyzed using various flavors of parsimony. However, when gain and loss of the character are not rare enough, a probabilistic analysis becomes essential. Here, we present a comprehensive probabilistic model to describe the evolution of binary characters on a bifurcating phylogenetic tree. A fast software tool, EREM, is provided, using maximum likelihood to estimate the parameters of the model and to reconstruct ancestral states (presence and absence in internal nodes) and events (gain and loss events along branches).
PMCID: PMC2866244  PMID: 20467467
11.  Adaptive Evolution Hotspots at the GC-Extremes of the Human Genome: Evidence for Two Functionally Distinct Pathways of Positive Selection 
Advances in Bioinformatics  2010;2010:856825.
We recently reported that the human genome is ‘‘splitting” into two gene subgroups characterised by polarised GC content (Tang et al, 2007), and that such evolutionary change may be accelerated by programmed genetic instability (Zhao et al, 2008). Here we extend this work by mapping the presence of two separate high-evolutionary-rate (Ka/Ks) hotspots in the human genome—one characterized by low GC content, high intron length, and low gene expression, and the other by high GC content, high exon number, and high gene expression. This finding suggests that at least two different mechanisms mediate adaptive genetic evolution in higher organisms: (1) intron lengthening and reduced repair in hypermethylated lowly-transcribed genes, and (2) duplication and/or insertion events affecting highly-transcribed genes, creating low-essentiality satellite daughter genes in nearby regions of active chromatin. Since the latter mechanism is expected to be far more efficient than the former in generating variant genes that increase fitnesss, these results also provide a potential explanation for the controversial value of sequence analysis in defining positively selected genes.
PMCID: PMC2862947  PMID: 20454629
12.  Protein Bioinformatics Infrastructure for the Integration and Analysis of Multiple High-Throughput “omics” Data 
Advances in Bioinformatics  2010;2010:423589.
High-throughput “omics” technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput “omics” data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput “omics” data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied “omics” data from different laboratories to make useful connections that could lead to new biological knowledge.
PMCID: PMC2847380  PMID: 20369061
13.  Testing the Coding Potential of Conserved Short Genomic Sequences 
Advances in Bioinformatics  2010;2010:287070.
Proposed is a procedure to test whether a genomic sequence contains coding DNA, called a coding potential region. The procedure tests the coding potential of conserved short genomic sequence, in which the assumptions on the probability models of gene structures are relaxed. Thus, it is expected to provide additional candidate regions that contain coding DNAs to the current genomic database. The procedure was applied to the set of highly conserved human-mouse sequences in the genome database at the University of California at Santa Cruz. For sequences containing RefSeq coding exons, the procedure detected 91.3% regions having coding potential in this set, which covers 83% of the human RefSeq coding exons, at a 2.6% false positive rate. The procedure detected 12,688 novel short regions with coding potential at the false discovery rate <0.05; 65.7% of the novel regions are between annotated genes.
PMCID: PMC2834954  PMID: 20224812
14.  Network Properties for Ranking Predicted miRNA Targets in Breast Cancer 
Advances in Bioinformatics  2010;2009:182689.
MicroRNAs control the expression of their target genes by translational repression and transcriptional cleavage. They are involved in various biological processes including development and progression of cancer. To uncover the biological role of miRNAs it is important to identify their target genes. The small number of experimentally validated target genes makes computer prediction methods very important. However, state-of-the-art prediction tools result in a great number of putative targets with an unpredictable number of false positives. In this paper, we propose and evaluate two approaches for ranking the biological relevance of putative targets of miRNAs which are associated with breast cancer.
PMCID: PMC2833297  PMID: 20224638
15.  Pathway-BasedFeature Selection Algorithm for Cancer Microarray Data 
Advances in Bioinformatics  2010;2009:532989.
Classification of cancers based on gene expressions produces better accuracy when compared to that of the clinical markers. Feature selection improves the accuracy of these classification algorithms by reducing the chance of overfitting that happens due to large number of features. We develop a new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods, our method integrates signaling and gene regulatory pathways with gene expression data to minimize the chance of overfitting of the method and to improve the test accuracy. Thus, BPFS selects a biologically meaningful feature set that is minimally redundant. Our experiments on published breast cancer datasets demonstrate that all of the top 20 genes found by our method are associated with cancer. Furthermore, the classification accuracy of our signature is up to 18% better than that of vant Veers 70 gene signature, and it is up to 8% better accuracy than the best published feature selection method, I-RELIEF.
PMCID: PMC2831238  PMID: 20204186
16.  Evolution and Diversity of the Human Hepatitis D Virus Genome 
Advances in Bioinformatics  2010;2010:323654.
Human hepatitis delta virus (HDV) is the smallest RNA virus in genome. HDV genome is divided into a viroid-like sequence and a protein-coding sequence which could have originated from different resources and the HDV genome was eventually constituted through RNA recombination. The genome subsequently diversified through accumulation of mutations selected by interactions between the mutated RNA and proteins with host factors to successfully form the infectious virions. Therefore, we propose that the conservation of HDV nucleotide sequence is highly related with its functionality. Genome analysis of known HDV isolates shows that the C-terminal coding sequences of large delta antigen (LDAg) are the highest diversity than other regions of protein-coding sequences but they still retain biological functionality to interact with the heavy chain of clathrin can be selected and maintained. Since viruses interact with many host factors, including escaping the host immune response, how to design a program to predict RNA genome evolution is a great challenging work.
PMCID: PMC2829689  PMID: 20204073
17.  Accurate and Scalable Techniques for the Complex/Pathway Membership Problem in Protein Networks 
Advances in Bioinformatics  2010;2009:787128.
A protein network shows physical interactions as well as functional associations. An important usage of such networks is to discover unknown members of partially known complexes and pathways. A number of methods exist for such analyses, and they can be divided into two main categories based on their treatment of highly connected proteins. In this paper, we show that methods that are not affected by the degree (number of linkages) of a protein give more accurate predictions for certain complexes and pathways. We propose a network flow-based technique to compute the association probability of a pair of proteins. We extend the proposed technique using hierarchical clustering in order to scale well with the size of proteome. We also show that top-k queries are not suitable for a large number of cases, and threshold queries are more meaningful in these cases. Network flow technique with clustering is able to optimize meaningful threshold queries and answer them with high efficiency compared to a similar method that uses Monte Carlo simulation.
PMCID: PMC2826754  PMID: 20182643
19.  Algorithmic Assessment of Vaccine-Induced Selective Pressure and Its Implications on Future Vaccine Candidates 
Advances in Bioinformatics  2010;2010:178069.
Posttrial assessment of a vaccine's selective pressure on infecting strains may be realized through a bioinformatic tool such as parsimony phylogenetic analysis. Following a failed gonococcal pilus vaccine trial of Neisseria gonorrhoeae, we conducted a phylogenetic analysis of pilin DNA and predicted peptide sequences from clinical isolates to assess the extent of the vaccine's effect on the type of field strains that the volunteers contracted. Amplified pilin DNA sequences from infected vaccinees, placebo recipients, and vaccine specimens were phylogenetically analyzed. Cladograms show that the vaccine peptides have diverged substantially from their paternal isolate by clustering distantly from each other. Pilin genes of the field clinical isolates were heterogeneous, and their peptides produced clades comprised of vaccinated and placebo recipients' strains indicating that the pilus vaccine did not exert any significant selective pressure on gonorrhea field strains. Furthermore, sequences of the semivariable and hypervariable regions pointed out heterotachous rates of mutation and substitution.
PMCID: PMC2817498  PMID: 20150957
20.  Synonymous Codon Usage Analysis of Thirty Two Mycobacteriophage Genomes 
Advances in Bioinformatics  2010;2009:316936.
Synonymous codon usage of protein coding genes of thirty two completely sequenced mycobacteriophage genomes was studied using multivariate statistical analysis. One of the major factors influencing codon usage is identified to be compositional bias. Codons ending with either C or G are preferred in highly expressed genes among which C ending codons are highly preferred over G ending codons. A strong negative correlation between effective number of codons (Nc) and GC3s content was also observed, showing that the codon usage was effected by gene nucleotide composition. Translational selection is also identified to play a role in shaping the codon usage operative at the level of translational accuracy. High level of heterogeneity is seen among and between the genomes. Length of genes is also identified to influence the codon usage in 11 out of 32 phage genomes. Mycobacteriophage Cooper is identified to be the highly biased genome with better translation efficiency comparing well with the host specific tRNA genes.
PMCID: PMC2817497  PMID: 20150956
21.  Tree-Based Methods for Discovery of Association between Flow Cytometry Data and Clinical Endpoints 
Advances in Bioinformatics  2010;2009:235320.
We demonstrate the application and comparative interpretations of three tree-based algorithms for the analysis of data arising from flow cytometry: classification and regression trees (CARTs), random forests (RFs), and logic regression (LR). Specifically, we consider the question of what best predicts CD4 T-cell recovery in HIV-1 infected persons starting antiretroviral therapy with CD4 count between 200 and 350 cell/μL. A comparison to a more standard contingency table analysis is provided. While contingency table analysis and RFs provide information on the importance of each potential predictor variable, CART and LR offer additional insight into the combinations of variables that together are predictive of the outcome. In all cases considered, baseline CD3-DR-CD56+CD16+ emerges as an important predictor variable, while the tree-based approaches identify additional variables as potentially informative. Application of tree-based methods to our data suggests that a combination of baseline immune activation states, with emphasis on CD8 T-cell activation, may be a better predictor than any single T-cell/innate cell subset analyzed. Taken together, we show that tree-based methods can be successfully applied to flow cytometry data to better inform and discover associations that may not emerge in the context of a univariate analysis.
PMCID: PMC2817388  PMID: 20145719

Results 1-21 (21)