Background
The computational identification of functional transcription factor binding sites (TFBSs) remains a major challenge of computational biology.
Results
We have analyzed the conserved promoter sequences for the complete set of human RefSeq genes using our conserved transcription factor binding site (CONFAC) software. CONFAC identified 16296 human-mouse ortholog gene pairs, and of those pairs, 9107 genes contained conserved TFBS in the 3 kb proximal promoter and first intron. To attempt to predict in vivo occupancy of transcription factor binding sites, we developed a novel marginal effect isolator algorithm that builds upon Bayesian methods for multigroup TFBS filtering and predicted the in vivo occupancy of two transcription factors with an overall accuracy of 84%.
Conclusion
Our analyses show that integration of chromatin immunoprecipitation data with conserved TFBS analysis can be used to generate accurate predictions of functional TFBS. They also show that TFBS cooccurrence can be used to predict transcription factor binding to promoters in vivo.
doi:10.1155/2008/369830
PMCID: PMC2768302
PMID: 19865592
Background. The computational identification of functional transcription factor binding sites (TFBSs) remains a major challenge of computational biology.
Results.
We have analyzed the conserved promoter sequences for the complete set of human RefSeq genes using our conserved transcription factor binding site (CONFAC) software. CONFAC identified 16296 human-mouse ortholog gene pairs, and of those pairs, 9107 genes contained conserved TFBS in the 3 kb proximal promoter and first intron. To attempt to predict in vivo occupancy of transcription factor binding sites, we developed a novel marginal effect isolator algorithm that builds upon Bayesian methods for multigroup TFBS filtering and predicted the in vivo occupancy of two transcription factors with an overall accuracy of 84%.
Conclusion. Our analyses show that integration of chromatin immunoprecipitation data with conserved TFBS analysis can be used to generate accurate predictions of functional TFBS. They also show that TFBS cooccurrence can be used to predict transcription factor binding to promoters in vivo.
doi:10.1155/2008/369830
PMCID: PMC2768302
PMID: 19865592
The large influx of data from high-throughput genomic and proteomic technologies has encouraged the researchers to seek approaches for understanding the structure of gene regulatory networks and proteomic networks. This work reviews some of the most important statistical methods used for modeling of gene regulatory networks (GRNs) and protein-protein interaction (PPI) networks. The paper focuses on the recent advances in the statistical graphical modeling techniques, state-space representation models, and information theoretic methods that were proposed for inferring the topology of GRNs. It appears that the problem of inferring the structure of PPI networks is quite different from that of GRNs. Clustering and probabilistic graphical modeling techniques are of prime importance in the statistical inference of PPI networks, and some of the recent approaches using these techniques are also reviewed in this paper. Performance evaluation criteria for the approaches used for modeling GRNs and PPI networks are also discussed.
doi:10.1155/2013/953814
PMCID: PMC3594945
In this paper we present a new ab initio approach for constructing an unrooted dendrogram using protein clusters, an approach that has the potential for estimating relationships among several thousands of species based on their putative proteomes. We employ an open-source software program called pClust that was developed for use in metagenomic studies. Sequence alignment is performed by pClust using the Smith-Waterman algorithm, which is known to give optimal alignment and, hence, greater accuracy than BLAST-based methods. Protein clusters generated by pClust are used to create protein profiles for each species in the dendrogram, these profiles forming a correlation filter library for use with a new taxon. To augment the dendrogram with a new taxon, a protein profile for the taxon is created using BLASTp, and this new taxon is placed into a position within the dendrogram corresponding to the highest correlation with profiles in the correlation filter library. This work was initiated because of our interest in plasmids, and each step is illustrated using proteomes from Gram-negative bacterial plasmids. Proteomes for 527 plasmids were used to generate the dendrogram, and to demonstrate the utility of the insertion algorithm twelve recently sequenced pAKD plasmids were used to augment the dendrogram.
doi:10.1155/2013/191586
PMCID: PMC3590580
Solving some mathematical problems such as NP-complete problems by conventional silicon-based computers is problematic and takes so long time. DNA computing is an alternative method of computing which uses DNA molecules for computing purposes. DNA computers have massive degrees of parallel processing capability. The massive parallel processing characteristic of DNA computers is of particular interest in solving NP-complete and hard combinatorial problems. NP-complete problems such as knapsack problem and other hard combinatorial problems can be easily solved by DNA computers in a very short period of time comparing to conventional silicon-based computers. Sticker-based DNA computing is one of the methods of DNA computing. In this paper, the sticker based DNA computing was used for solving the 0/1 knapsack problem. At first, a biomolecular solution space was constructed by using appropriate DNA memory complexes. Then, by the application of a sticker-based parallel algorithm using biological operations, knapsack problem was resolved in polynomial time.
doi:10.1155/2013/341419
PMCID: PMC3588402
Quantitative proteomics applications in mass spectrometry depend on the knowledge of the mass-to-charge ratio (m/z) values of proteotypic peptides for the proteins under study and their product ions. MRMPath and MRMutation, web-based bioinformatics software that are platform independent, facilitate the recovery of this information by biologists. MRMPath utilizes publicly available information related to biological pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database. All the proteins involved in pathways of interest are recovered and processed in silico to extract information relevant to quantitative mass spectrometry analysis. Peptides may also be subjected to automated BLAST analysis to determine whether they are proteotypic. MRMutation catalogs and makes available, following processing, known (mutant) variants of proteins from the current UniProtKB database. All these results, available via the web from well-maintained, public databases, are written to an Excel spreadsheet, which the user can download and save. MRMPath and MRMutation can be freely accessed. As a system that seeks to allow two or more resources to interoperate, MRMPath represents an advance in bioinformatics tool development. As a practical matter, the MRMPath automated approach represents significant time savings to researchers.
doi:10.1155/2013/527295
PMCID: PMC3570921
PMID: 23424586
The long-range interactions, required to the accurate predictions of tertiary structures of β-sheet-containing proteins, are still difficult to simulate. To remedy this problem and to facilitate β-sheet structure predictions, many efforts have been made by computational methods. However, known efforts on β-sheets mainly focus on interresidue contacts or amino acid partners. In this study, to go one step further, we studied β-sheets on the strand level, in which a statistical analysis was made on the terminal extensions of paired β-strands. In most cases, the two paired β-strands have different lengths, and terminal extensions exist. The terminal extensions are the extended part of the paired strands besides the common paired part. However, we found that the best pairing required a terminal alignment, and β-strands tend to pair to make bigger common parts. As a result, 96.97% of β-strand pairs have a ratio of 25% of the paired common part to the whole length. Also 94.26% and 95.98% of β-strand pairs have a ratio of 40% of the paired common part to the length of the two β-strands, respectively. Interstrand register predictions by searching interacting β-strands from several alternative offsets should comply with this rule to reduce the computational searching space to improve the performances of algorithms.
doi:10.1155/2013/909436
PMCID: PMC3569888
PMID: 23424587
doi:10.1155/2013/320436
PMCID: PMC3556398
PMID: 23365570
Background. HNF-1a is a transcription factor that regulates glucose metabolism by expression in various tissues. Aim. To dock potential ligands of HNF-1a using docking software in silico. Methods. We performed in silico studies using HNF-1a protein 2GYP·pdb and the following softwares: ISIS/Draw 2.5SP4, ARGUSLAB 4.0.1, and HEX5.1. Observations. The docking distances (in angstrom units: 1 angstrom unit (Å) = 0.1 nanometer or 1 × 10−10 metres) with ligands in decreasing order are as follows: resveratrol (3.8 Å), aspirin (4.5 Å), stearic acid (4.9 Å), retinol (6.0 Å), nitrazepam (6.8 Å), ibuprofen (7.9 Å), azulfidine (9.0 Å), simvastatin (9.0 Å), elaidic acid (10.1 Å), and oleic acid (11.6 Å). Conclusion. HNF-1a domain interacted most closely with resveratrol and aspirin
doi:10.1155/2012/705435
PMCID: PMC3535823
PMID: 23316227
Recent years have shown a gradual shift in the content of biomedical publications that is freely accessible, from titles and abstracts to full text. This has enabled new forms of automatic text analysis and has given rise to some interesting questions: How informative is the abstract compared to the full-text? What important information in the full-text is not present in the abstract? What should a good summary contain that is not already in the abstract? Do authors and peers see an article differently? We answer these questions by comparing the information content of the abstract to that in citances—sentences containing citations to that article. We contrast the important points of an article as judged by its authors versus as seen by peers. Focusing on the area of molecular interactions, we perform manual and automatic analysis, and we find that the set of all citances to a target article not only covers most information (entities, functions, experimental methods, and other biological concepts) found in its abstract, but also contains 20% more concepts. We further present a detailed summary of the differences across information types, and we examine the effects other citations and time have on the content of citances.
doi:10.1155/2012/750214
PMCID: PMC3514807
PMID: 23227044
Heart murmurs are the first signs of cardiac valve disorders. Several studies have been conducted in recent years to automatically differentiate normal heart sounds, from heart sounds with murmurs using various types of audio features. Entropy was successfully used as a feature to distinguish different heart sounds. In this paper, new entropy was introduced to analyze heart sounds and the feasibility of using this entropy in classification of five types of heart sounds and murmurs was shown. The entropy was previously introduced to analyze mammograms. Four common murmurs were considered including aortic regurgitation, mitral regurgitation, aortic stenosis, and mitral stenosis. Wavelet packet transform was employed for heart sound analysis, and the entropy was calculated for deriving feature vectors. Five types of classification were performed to evaluate the discriminatory power of the generated features. The best results were achieved by BayesNet with 96.94% accuracy. The promising results substantiate the effectiveness of the proposed wavelet packet entropy for heart sounds classification.
doi:10.1155/2012/327269
PMCID: PMC3512213
PMID: 23227043
B-cell epitope prediction aims to aid the design of peptide-based immunogens (e.g., vaccines) for eliciting antipeptide antibodies that protect against disease, but such antibodies fail to confer protection and even promote disease if they bind with low affinity. Hence, the Immune Epitope Database (IEDB) was searched to obtain published thermodynamic and kinetic data on binding interactions of antipeptide antibodies. The data suggest that the affinity of the antibodies for their immunizing peptides appears to be limited in a manner consistent with previously proposed kinetic constraints on affinity maturation in vivo and that cross-reaction of the antibodies with proteins tends to occur with lower affinity than the corresponding reaction of the antibodies with their immunizing peptides. These observations better inform B-cell epitope prediction to avoid overestimating the affinity for both active and passive immunization; whereas active immunization is subject to limitations of affinity maturation in vivo and of the capacity to accumulate endogenous antibodies, passive immunization may transcend such limitations, possibly with the aid of artificial affinity-selection processes and of protein engineering. Additionally, protein disorder warrants further investigation as a possible supplementary criterion for B-cell epitope prediction, where such disorder obviates thermodynamically unfavorable protein structural adjustments in cross-reactions between antipeptide antibodies and proteins.
doi:10.1155/2012/346765
PMCID: PMC3505629
PMID: 23209458
Atherosclerosis is a multifactorial disease involving a lot of genes and proteins recruited throughout its manifestation. The present study aims to exploit bioinformatic tools in order to analyze microarray data of atherosclerotic aortic lesions of ApoE knockout mice, a model widely used in atherosclerosis research. In particular, a dynamic analysis was performed among young and aged animals, resulting in a list of 852 significantly altered genes. Pathway analysis indicated alterations in critical cellular processes related to cell communication and signal transduction, immune response, lipid transport, and metabolism. Cluster analysis partitioned the significantly differentiated genes in three major clusters of similar expression profile. Promoter analysis applied to functional related groups of the same cluster revealed shared putative cis-elements potentially contributing to a common regulatory mechanism. Finally, by reverse engineering the functional relevance of differentially expressed genes with specific cellular pathways, putative genes acting as hubs, were identified, linking functionally disparate cellular processes in the context of traditional molecular description.
doi:10.1155/2012/453513
PMCID: PMC3502768
PMID: 23193398
The problems of modeling and intervention of biological phenomena have captured the interest of many researchers in the past few decades. The aim of the therapeutic intervention strategies is to move an undesirable state of a diseased network towards a more desirable one. Such an objective can be achieved by the application of drugs to act on some genes/metabolites that experience the undesirable behavior. For the purpose of design and analysis of intervention strategies, mathematical models that can capture the complex dynamics of the biological systems are needed. S-systems, which offer a good compromise between accuracy and mathematical flexibility, are a promising framework for modeling the dynamical behavior of biological phenomena. Due to the complex nonlinear dynamics of the biological phenomena represented by S-systems, nonlinear intervention schemes are needed to cope with the complexity of the nonlinear S-system models. Here, we present an intervention technique based on feedback linearization for biological phenomena modeled by S-systems. This technique is based on perfect knowledge of the S-system model. The proposed intervention technique is applied to the glycolytic-glycogenolytic pathway, and simulation results presented demonstrate the effectiveness of the proposed technique.
doi:10.1155/2012/534810
PMCID: PMC3502753
PMID: 23209459
The cell is a highly organized system of interacting molecules including proteins, mRNAs, and miRNAs. Analyzing the cell
from a systems perspective by integrating different types of data helps revealing the complexity of diseases. Although there is
emerging evidence that microRNAs have a functional role in cancer, the role of microRNAs in mediating cancer progression
and metastasis remains not fully explored. As the amount of available miRNA and mRNA gene expression data grows, more
systematic methods combining gene expression and biological networks become necessary to explore miRNA function. In this work
I integrated functional miRNA-target interactions with mRNA and miRNA expression to infer mRNA-mediated miRNA-miRNA
interactions. The inferred network represents miRNA modulation through common targets. The network is used to characterize
the functional role of microRNA response element (MRE) to mediate interactions between miRNAs targeting the MRE. Results
revealed that miRNA-1 is a key player in regulating prostate cancer progression. 11 miRNAs were identified as diagnostic and
prognostic biomarkers that act as tumor suppressor miRNAs. This work demonstrates the utility of a network analysis as opposed
to differential expression to find important miRNAs that regulate prostate cancer.
doi:10.1155/2012/839837
PMCID: PMC3502784
PMID: 23193399
Trypanosoma brucei is a protozoan parasite of major of interest in discovering new genes for drug targets. This parasite alternates its life cycle between the mammal host(s) (bloodstream form) and the insect vector (procyclic form), with two divergent glucose metabolism amenable to in vitro culture. While the metabolic network of the bloodstream forms has been well characterized, the flux distribution between the different branches of the glucose metabolic network in the procyclic form has not been addressed so far. We present a computational analysis (called Metaboflux) that exploits the metabolic topology of the procyclic form, and allows the incorporation of multipurpose experimental data to increase the biological relevance of the model. The alternatives resulting from the structural complexity of networks are formulated as an optimization problem solved by a metaheuristic where experimental data are modeled in a multiobjective function.
Our results show that the current metabolic model is in agreement with experimental data and confirms the observed high metabolic flexibility of glucose metabolism. In addition, Metaboflux offers a rational explanation for the high flexibility in the ratio between final products from glucose metabolism, thsat is, flux redistribution through the malic enzyme steps.
doi:10.1155/2012/159423
PMCID: PMC3477527
PMID: 23097667
Computational approaches to the disulphide bonding state and its connectivity pattern prediction are based on various descriptors. One descriptor is the amino acid sequence motifs flanking the cysteine residue motifs. Despite the existence of disulphide bonding information in many databases and applications, there is no complete reference and motif query available at the moment. Cysteine motif database (CMD) is the first online resource that stores all cysteine residues, their flanking motifs with their secondary structure, and propensity values assignment derived from the laboratory data. We extracted more than 3 million cysteine motifs from PDB and UniProt data, annotated with secondary structure assignment, propensity value assignment, and frequency of occurrence and coefficiency of their bonding status. Removal of redundancies generated 15875 unique flanking motifs that are always bonded and 41577 unique patterns that are always nonbonded. Queries are based on the protein ID, FASTA sequence, sequence motif, and secondary structure individually or in batch format using the provided APIs that allow remote users to query our database via third party software and/or high throughput screening/querying. The CMD offers extensive information about the bonded, free cysteine residues, and their motifs that allows in-depth characterization of the sequence motif composition.
doi:10.1155/2012/849830
PMCID: PMC3474208
PMID: 23091487
Reliable identification of copy number aberrations (CNA) from comparative genomic hybridization data would be improved by the availability of a generalised method for processing large datasets. To this end, we developed swatCGH, a data analysis framework and region detection heuristic for computational grids. swatCGH analyses sequentially displaced (sliding) windows of neighbouring probes and applies adaptive thresholds of varying stringency to identify the 10% of each chromosome that contains the most frequently occurring CNAs. We used the method to analyse a published dataset, comparing data preprocessed using four different DNA segmentation algorithms, and two methods for prioritising the detected CNAs. The consolidated list of the most commonly detected aberrations confirmed the value of swatCGH as a simplified high-throughput method for identifying biologically significant CNA regions of interest.
doi:10.1155/2012/876976
PMCID: PMC3449101
PMID: 23008709
Constraint-based metabolic models are currently the most comprehensive system-wide models of cellular metabolism. Several challenges arise when building an in silico constraint-based model of an organism that need to be addressed before flux balance analysis (FBA) can be applied for simulations. An algorithm called FBA-Gap is presented here that aids the construction of a working model based on plausible modifications to a given list of reactions that are known to occur in the organism. When applied to a working model, the algorithm gives a hypothesis concerning a minimal medium for sustaining the cell in culture. The utility of the algorithm is demonstrated in creating a new model organism and is applied to four existing working models for generating hypotheses about culture media. In modifying a partial metabolic reconstruction so that biomass may be produced using FBA, the proposed method is more efficient than a previously proposed method in that fewer new reactions are added to complete the model. The proposed method is also more accurate than other approaches in that only biologically plausible reactions and exchange reactions are used.
doi:10.1155/2012/323472
PMCID: PMC3444828
PMID: 22997515
Lattice models are a common abstraction used in the study of protein structure, folding, and refinement. They are advantageous because the discretisation of space can make extensive protein evaluations computationally feasible. Various approaches to the protein chain lattice fitting problem have been suggested but only a single backbone-only tool is available currently. We introduce LatFit, a new tool to produce high-accuracy lattice protein models. It generates both backbone-only and backbone-side-chain models in any user defined lattice. LatFit implements a new distance RMSD-optimisation fitting procedure in addition to the known coordinate RMSD method. We tested LatFit's accuracy and speed using a large nonredundant set of high resolution proteins (SCOP database) on three commonly used lattices: 3D cubic, face-centred cubic, and knight's walk. Fitting speed compared favourably to other methods and both backbone-only and backbone-side-chain models show low deviation from the original data (~1.5 Å RMSD in the FCC lattice). To our knowledge this represents the first comprehensive study of lattice quality for on-lattice protein models including side chains while LatFit is the only available tool for such models.
doi:10.1155/2012/148045
PMCID: PMC3426164
PMID: 22934109
The nucleotide sequences complexity in chromosome 3 of Caenorhabditis elegans (C. elegans) is studied. The complexity of these sequences is compared with some random sequences. Moreover, by using some parameters related to complexity such as fractal dimension and frequency, indicator matrix is given a first classification of sequences of C. elegans. In particular, the sequences with highest and lowest fractal value are singled out. It is shown that the intrinsic nature of the low fractal dimension sequences has many common features with the random sequences.
doi:10.1155/2012/287486
PMCID: PMC3418640
PMID: 22919380
Gene alterations are a major component of the landscape of tumor genomes. To assess the significance of these alterations in the development of prostate cancer, it is necessary to identify these alterations and analyze them from systems biology perspective. Here, we present a new method (EigFusion) for predicting outlier genes with potential gene rearrangement. EigFusion demonstrated excellent performance in identifying outlier genes with potential rearrangement by testing it to synthetic and real data to evaluate performance. EigFusion was able to identify previously unrecognized genes such as FABP5 and KCNH8 and confirmed their association with primary and metastatic prostate samples while confirmed the metastatic specificity for other genes such as PAH, TOP2A, and SPINK1. We performed protein network based approaches to analyze the network context of potential rearranged genes. Functional gene rearrangement Modules are constructed by integrating functional protein networks. Rearranged genes showed to be highly connected to well-known altered genes in cancer such as AR, RB1, MYC, and BRCA1. Finally, using clinical outcome data of prostate cancer patients, potential rearranged genes demonstrated significant association with prostate cancer specific death.
doi:10.1155/2012/373506
PMCID: PMC3394389
PMID: 22811706
The world has widely changed in terms of communicating, acquiring, and storing information. Hundreds of millions of people are involved in information retrieval tasks on a daily basis, in particular while using a Web search engine or searching their e-mail, making such field the dominant form of information access, overtaking traditional database-style searching. How to handle this huge amount of information has now become a challenging issue. In this paper, after recalling the main topics concerning information retrieval, we present a survey on the main works on literature retrieval and mining in bioinformatics. While claiming that information retrieval approaches are useful in bioinformatics tasks, we discuss some challenges aimed at showing the effectiveness of these approaches applied therein.
doi:10.1155/2012/573846
PMCID: PMC3388278
PMID: 22778730
Technological advancements in the field of genetics have led not only to an abundance of experimental data, but also caused an exponential increase of the number of published biomolecular studies. Text mining is widely accepted as a promising technique to help researchers in the life sciences deal with the amount of available literature. This paper presents a freely available web application built on top of 21.3 million detailed biomolecular events extracted from all PubMed abstracts. These text mining results were generated by a state-of-the-art event extraction system and enriched with gene family associations and abstract generalizations, accounting for lexical variants and synonymy. The EVEX resource locates relevant literature on phosphorylation, regulation targets, binding partners, and several other biomolecular events and assigns confidence values to these events. The search function accepts official gene/protein symbols as well as common names from all species. Finally, the web application is a powerful tool for generating homology-based hypotheses as well as novel, indirect associations between genes and proteins such as coregulators.
doi:10.1155/2012/582765
PMCID: PMC3375141
PMID: 22719757
Treatment of bipolar disorder with lithium therapy during pregnancy is a medical challenge. Bipolar disorder is more prevalent in women and its onset is often concurrent with peak reproductive age. Treatment typically involves administration of the element lithium, which has been classified as a class D drug (legal to use during pregnancy, but may cause birth defects) and is one of only thirty known teratogenic drugs. There is no clear recommendation in the literature on the maximum acceptable dosage regimen for pregnant, bipolar women. We recommend a maximum dosage regimen based on a physiologically based pharmacokinetic (PBPK) model. The model simulates the concentration of lithium in the organs and tissues of a pregnant woman and her fetus. First, we modeled time-dependent lithium concentration profiles resulting from lithium therapy known to have caused birth defects. Next, we identified maximum and average fetal lithium concentrations during treatment. Then, we developed a lithium therapy regimen to maximize the concentration of lithium in the mother's brain, while maintaining the fetal concentration low enough to reduce the risk of birth defects. This maximum dosage regimen suggested by the model was 400 mg lithium three times per day.
doi:10.1155/2012/352729
PMCID: PMC3369391
PMID: 22693500