Gene set enrichment analysis for analyzing large profiling and screening experiments can reveal unifying biological schemes based on previously accumulated knowledge represented as “gene sets”. Most of the existing implementations use a fixed fold-change or P value cutoff to generate regulated gene lists. However, the threshold selection in most cases is arbitrary, and has a significant effect on the test outcome and interpretation of the experiment. We developed a new gene set enrichment analysis method, ie, FDR-FET, which dynamically optimizes the threshold choice and improves the sensitivity and selectivity of gene set enrichment analysis. The procedure translates experimental results into a series of regulated gene lists at multiple false discovery rate (FDR) cutoffs, and computes the P value of the overrepresentation of a gene set using a Fisher’s exact test (FET) in each of these gene lists. The lowest P value is retained to represent the significance of the gene set. We also implemented improved methods to define a more relevant global reference set for the FET. We demonstrate the validity of the method using a published microarray study of three protease inhibitors of the human immunodeficiency virus and compare the results with those from other popular gene set enrichment analysis algorithms. Our results show that combining FDR with multiple cutoffs allows us to control the error while retaining genes that increase information content. We conclude that FDR-FET can selectively identify significant affected biological processes. Our method can be used for any user-generated gene list in the area of transcriptome, proteome, and other biological and scientific applications.
gene set enrichment analysis; false discovery rate; Fisher’s exact test; microarray profiling; protease inhibitors
The human progesterone receptor (hPR) belongs to the steroid receptor family. It may be found as monomers (A and B) and or as a dimer (AB). hPR is regarded as the prognostic biomarker for breast cancer. In a cellular dimer system, AB is the dominant species in most cases. However, when a cell coexpresses all three isoforms of hPR, the complexity of the action of this receptor increases. For example, hPR A suppresses the activity of hPR B, and the ratio of hPR A to hPR B may determine the physiology of a breast tumor. Also, persistent exposure of hPRs to nonendogenous ligands is a common risk factor for breast cancer. Hence we aimed to study progesterone and some nonendogenous ligand interactions with hPRs and their molecular docking.
Methods and results
A pool of steroid derivatives, namely, progesterone, cholesterol, testosterone, testolectone, estradiol, estrone, norethindrone, exemestane, and norgestrel, was used for this in silico study. Dockings were performed on AutoDock 4.2. We found that estrogens, including estradiol and estrone, had a higher affinity for hPR A and B monomers in comparison with the dimer, hPR AB, and that of the endogenous progesterone ligand. hPR A had a higher affinity to all the docked ligands than hPR B.
This study suggests that the exposure of estrogens to hPR A as well as hPR B, and more particularly to hPR A alone, is a risk factor for breast cancer.
human progesterone receptor; breast cancer; steroid derivatives; estrogens; molecular docking
Here we describe LifePrint, a sequence alignment-independent k-tuple distance method to estimate relatedness between complete genomes.
We designed a representative sample of all possible DNA tuples of length 9 (9-tuples). The final sample comprises 1878 tuples (called the LifePrint set of 9-tuples; LPS9) that are distinct from each other by at least two internal and noncontiguous nucleotide differences. For validation of our k-tuple distance method, we analyzed several real and simulated viroid genomes. Using different distance metrics, we scrutinized diverse viroid genomes to estimate the k-tuple distances between these genomic sequences. Then we used the estimated genomic k-tuple distances to construct phylogenetic trees using the neighbor-joining algorithm. A comparison of the accuracy of LPS9 and the previously reported 5-tuple method was made using symmetric differences between the trees estimated from each method and a simulated “true” phylogenetic tree.
The identified optimal search scheme for LPS9 allows only up to two nucleotide differences between each 9-tuple and the scrutinized genome. Similarity search results of simulated viroid genomes indicate that, in most cases, LPS9 is able to detect single-base substitutions between genomes efficiently. Analysis of simulated genomic variants with a high proportion of base substitutions indicates that LPS9 is able to discern relationships between genomic variants with up to 40% of nucleotide substitution.
Our LPS9 method generates more accurate phylogenetic reconstructions than the previously proposed 5-tuples strategy. LPS9-reconstructed trees show higher bootstrap proportion values than distance trees derived from the 5-tuple method.
phylogeny; sequence alignment; similarity search; tuple; viroid
Artificially synthesized RNA molecules have recently come under study since such molecules have a potential for creating a variety of novel functional molecules. When designing artificial RNA sequences, secondary structure should be taken into account since functions of noncoding RNAs strongly depend on their structure. RNA inverse folding is a methodology for computationally exploring the RNA sequences folding into a user-given target structure. In the present study, we developed a multi-objective genetic algorithm, MODENA (Multi-Objective DEsign of Nucleic Acids), for RNA inverse folding. MODENA explores the approximate set of weak Pareto optimal solutions in the objective function space of 2 objective functions, a structure stability score and structure similarity score. MODENA can simultaneously design multiple different RNA sequences at 1 run, whose lowest free energies range from a very stable value to a higher value near those of natural counterparts. MODENA and previous RNA inverse folding programs were benchmarked with 29 target structures taken from the Rfam database, and we found that MODENA can successfully design 23 RNA sequences folding into the target structures; this result is better than those of the other benchmarked RNA inverse folding programs. The multi-objective genetic algorithm gives a useful framework for a functional biomolecular design. Executable files of MODENA can be obtained at http://rna.eit.hirosaki-u.ac.jp/modena/.
multi-objective genetic algorithm; secondary structure; RNA sequence design; Rfam
Genotoxic stress is induced by a broad range of DNA-damaging agents and could lead to a variety of human diseases including cancer. DNA damage is also therapeutically induced for cancer treatment with the aim to eliminate tumor cells. However, the effectiveness of radio- and chemotherapy is strongly hampered by tumor cell resistance. A major reason for radio- and chemotherapeutic resistances is the simultaneous activation of cell survival pathways resulting in the activation of the transcription factor nuclear factor-kappa B (NF-κB). Here, we present a Boolean network model of the NF-κB signal transduction induced by genotoxic stress in epithelial cells. For the representation and analysis of the model, we used the formalism of logical interaction hypergraphs. Model reconstruction was based on a careful meta-analysis of published data. By calculating minimal intervention sets, we identified p53-induced protein with a death domain (PIDD), receptor-interacting protein 1 (RIP1), and protein inhibitor of activated STAT y (PIASy) as putative therapeutic targets to abrogate NF-κB activation resulting in apoptosis. Targeting these structures therapeutically may potentiate the effectiveness of radio-and chemotherapy. Thus, the presented model allows a better understanding of the signal transduction in tumor cells and provides candidates as new therapeutic target structures.
apoptosis; Boolean network; cancer therapy; DNA-damage response; NF-κB
In recent years, protein–protein interactions are becoming the object of increasing attention in many different fields, such as structural biology, molecular biology, systems biology, and drug discovery. From a structural biology perspective, it would be desirable to integrate current efforts into the structural proteomics programs. Given that experimental determination of many protein–protein complex structures is highly challenging, and in the context of current high-performance computational capabilities, different computer tools are being developed to help in this task. Among them, computational docking aims to predict the structure of a protein–protein complex starting from the atomic coordinates of its individual components, and in recent years, a growing number of docking approaches are being reported with increased predictive capabilities. The improvement of speed and accuracy of these docking methods, together with the modeling of the interaction networks that regulate the most critical processes in a living organism, will be essential for computational proteomics. The ultimate goal is the rational design of drugs capable of specifically inhibiting or modifying protein–protein interactions of therapeutic significance. While rational design of protein–protein interaction inhibitors is at its very early stage, the first results are promising.
protein-protein interactions; drug design; protein docking; structural prediction; virtual ligand screening; hot-spots
Protein–protein docking simulations can provide the predicted complex structural models. In a docking simulation, several putative structural models are selected by scoring functions from an ensemble of many complex models. Scoring functions based on statistical analyses of heterodimers are usually designed to select the complex model with the most abundant interaction mode found among the known complexes, as the correct model. However, because the formation schemes of heterodimers are extremely diverse, a single scoring function does not seem to be sufficient to describe the fitness of the predicted models other than the most abundant interaction mode. Thus, it is necessary to classify the heterodimers in terms of their individual interaction modes, and then to construct multiple scoring functions for each heterodimer type. In this study, we constructed the classification method of heterodimers based on the discriminative characters between near-native and decoy models, which were found in the comparison of the interfaces in terms of the complementarities for the hydrophobicity, the electrostatic potential and the shape. Consequently, we found four heterodimer clusters, and then constructed the multiple scoring functions, each of which was optimized for each cluster. Our multiple scoring functions were applied to the predictions in the unbound docking.
classification of heterodimers; prediction of complex structures; scoring functions; protein-protein docking; CAPRI
The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
computational biology; protein homology; amino acid substitution matrix; protein structure
Multivariate partial least square (PLS) regression allows the modeling of complex biological events, by considering different factors at the same time. It is unaffected by data collinearity, representing a valuable method for modeling high-dimensional biological data (as derived from genomics, proteomics and peptidomics). In presence of multiple responses, it is of particular interest how to appropriately “dissect” the model, to reveal the importance of single attributes with regard to individual responses (for example, variable selection). In this paper, performances of multivariate PLS regression coefficients, in selecting relevant predictors for different responses in omics-type of data, were investigated by means of a receiver operating characteristic (ROC) analysis. For this purpose, simulated data, mimicking the covariance structures of microarray and liquid chromatography mass spectrometric data, were used to generate matrices of predictors and responses. The relevant predictors were set a priori. The influences of noise, the source of data with different covariance structure and the size of relevant predictors were investigated. Results demonstrate the applicability of PLS regression coefficients in selecting variables for each response of a multivariate PLS, in omics-type of data. Comparisons with other feature selection methods, such as variable importance in the projection scores, principal component regression, and least absolute shrinkage and selection operator regression were also provided.
partial least square regression; regression coefficients; variable selection; biomarker discovery; omics-data
Identification of genes involved in the aging process is critical for understanding the mechanisms of age-dependent diseases such as cancer and diabetes. Measuring the mutant gene lifespan, each missing one gene, is traditionally employed to identify longevity genes. While such screening is impractical for the whole genome due to the time-consuming nature of lifespan assays, it can be achieved by in silico genetic manipulations with systems biology approaches. In this review, we will introduce pilot explorations applying two approaches of systems biology in aging studies. One approach is to predict the role of a specific gene in the aging process by comparing its expression profile and protein–protein interaction pattern with those of known longevity genes (top-down systems biology). The other approach is to construct mathematical models from previous kinetics data and predict how a specific protein contributes to aging and antiaging processes (bottom-up systems biology). These approaches allow researchers to simulate the effect of each gene’s product in aging by in silico genetic manipulations such as deletion or over-expression. Since simulation-based approaches are not as widely used as the other approaches, we will focus our review on this effort in more detail. A combination of hypothesis from data-mining, in silico experimentation from simulations, and wet laboratory validation will make the systematic identification of all longevity genes possible.
systems biology; yeast; aging; in silico; genetic manipulation; modeling
Probabilistic DNA sequence models have been intensively applied to genome research. Within the evolutionary biology framework, this article investigates the feasibility for rigorously estimating the probability of a set of orthologous DNA sequences which evolve from a common progenitor. We propose Monte Carlo integration algorithms to sample the unknown ancestral and/or root sequences a posteriori conditional on a reference sequence and apply pairwise Needleman–Wunsch alignment between the sampled and nonreference species sequences to estimate the probability. We test our algorithms on both simulated and real sequences and compare calculated probabilities from Monte Carlo integration to those induced by single multiple alignment.
evolution; Jukes-Cantor model; Monte Carlo integration; Needleman-Wunsch alignment; orthologous
Simple sequence repeats (SSRs) play important roles in gene regulation and genome evolution. Although there exist several online resources for SSR mining, most of them only extract general SSR patterns without providing functional information. Here, an online search tool, CG-SSR (Comparative Genomics SSR discovery), has been developed for discovering potential functional SSRs from vertebrate genomes through cross-species comparison. In addition to revealing SSR candidates in conserved regions among various species, it also combines accurate coordinate and functional genomics information. CG-SSR is the first comprehensive and efficient online tool for conserved SSR discovery.
microsatellites; genome; comparative genomics; functional SSR; gene ontology; conserved region
Bladder cancer is relatively common but early detection techniques such as cystoscopy and cytology are somewhat limited. We developed a broadly applicable, platform-independent and clinically relevant method based on simple ratios of gene expression to diagnose human cancers. In this study, we sought to determine whether this technique could be applied to the diagnosis of bladder cancer.
We developed a model for the diagnosis of bladder cancer using expression profiling data from 80 normal and tumor bladder tissues to identify statistically significant discriminating genes with reciprocal average expression levels in each tissue type. The expression levels of select genes were used to calculate individual gene pair expression ratios in order to assign diagnosis. The optimal model was examined in two additional published microarray data sets and using quantitative RT-PCR in a cohort of 13 frozen benign bladder urothelium samples and 13 bladder cancer samples from our institution.
A five-ratio test utilizing six genes proved to be 100% accurate (26 of 26 samples) for distinguishing benign from malignant bladder tissue samples (P < 10−6).
: We have provided a proof of principle study for the use of gene expression ratios in the diagnosis of bladder cancer. This technique may ultimately prove to be a useful adjunct to cytopathology in screening urine specimens for bladder cancer.
bladder cancer; gene expression profiling; and diagnosis
A system was developed to evaluate and predict the interaction between protein pairs by using the widely used shape complementarity search method as the algorithm for docking simulations between the proteins. We used this system, which we call the affinity evaluation and prediction (AEP) system, to evaluate the interaction between 20 protein pairs. The system first executes a “round robin” shape complementarity search of the target protein group, and evaluates the interaction between the complex structures obtained by the search. These complex structures are selected by using a statistical procedure that we developed called ‘grouping’. At a prevalence of 5.0%, our AEP system predicted protein–protein interactions with a 50.0% recall, 55.6% precision, 95.5% accuracy, and an F-measure of 0.526. By optimizing the grouping process, our AEP system successfully predicted 10 protein pairs (among 20 pairs) that were biologically relevant combinations. Our ultimate goal is to construct an affinity database that will provide cell biologists and drug designers with crucial information obtained using our AEP system.
protein-protein interaction; affinity analysis; protein-protein docking; FFT; massive parallel computing
It is expected that different markers may show different patterns of association with different pathogenic variants within a given gene. It would be helpful to combine the evidence implicating association at the level of the whole gene rather than just for individual markers or haplotypes. Doing this is complicated by the fact that different markers do not represent independent sources of information.
We propose combining the p values from all single locus and/or multilocus analyses of different markers according to the formula of Fisher, X = ∑(−2ln(pi)), and then assessing the empirical significance of this statistic using permutation testing. We present an example application to 19 markers around the HTRA2 gene in a case-control study of Parkinson’s disease.
Applying our approach shows that, although some individual tests produce low p values, overall association at the level of the gene is not supported.
Approaches such as this should be more widely used in assimilating the overall evidence supporting involvement of a gene in a particular disease. Information can be combined from biallelic and multiallelic markers and from single markers along with multimarker analyses. Single genes can be tested or results from groups of genes involved in the same pathway could be combined in order to test biologically relevant hypotheses. The approach has been implemented in a computer program called COMBASSOC which is made available for downloading.
Fisher; significance; genetic marker
A discrimination method between biologically relevant interfaces and artificial crystal-packing contacts in crystal structures was constructed. The method evaluates protein-protein interfaces in terms of complementarities for hydrophobicity, electrostatic potential and shape on the protein surfaces, and chooses the most probable biological interfaces among all possible contacts in the crystal. The method uses a discriminator named as “COMP”, which is a linear combination of the complementarities for the above three surface features and does not correlate with the contact area. The discrimination of homo-dimer interfaces from symmetry-related crystal-packing contacts based on the COMP value achieved the modest success rate. Subsequent detailed review of the discrimination results raised the success rate to about 88.8%. In addition, our discrimination method yielded some clues for understanding the interaction patterns in several examples in the PDB. Thus, the COMP discriminator can also be used as an indicator of the “biological-ness” of protein-protein interfaces.
protein-protein interaction; complementarity analysis; homo-dimer interface; crystal-packing contact; biological interfaces
There is a need to identify the regulatory gene interaction of anticancer drugs on target cancer cells. Whole genome expression profiling offers promise in this regard, but can be complicated by the challenge of identifying the genes affected by hundreds to thousands of genes that induce changes in expression. A proteasome inhibitor, bortezomib, could be a potential therapeutic agent in treating adult T-cell leukemia (ATL) patients, however, the underlying mechanism by which bortezomib induces cell death in ATL cells via gene regulatory network has not been fully elucidated. Here we show that a Bayesian statistical framework by VoyaGene® identified a secreted protein acidic and rich in cysteine (SPARC) gene, a tumor-invasiveness related gene, as a possible modulator of bortezomib-induced cell death in ATL cells. Functional analysis using RNAi experiments revealed that inhibition of the expression SPARC by siRNA enhanced the apoptotic effect of bortezomib on ATL cells in accordance with an increase of cleaved caspase 3. Targeting SPARC may help to treat ATL patients in combination with bortezomib. This work shows that a network biology approach can be used advantageously to identify the genetic interaction related to anticancer effects.
network biology; adult T cell leukemia; bortezomib; SPARC
Mobile phone technology makes use of radio frequency (RF) electromagnetic fields transmitted through a dense network of base stations in Europe. Possible harmful effects of RF fields on humans and animals are discussed, but their effect on plants has received little attention. In search for physiological processes of plant cells sensitive to RF fields, cell suspension cultures of Arabidopsis thaliana were exposed for 24 h to a RF field protocol representing typical microwave exposition in an urban environment. mRNA of exposed cultures and controls was used to hybridize Affymetrix-ATH1 whole genome microarrays. Differential expression analysis revealed significant changes in transcription of 10 genes, but they did not exceed a fold change of 2.5. Besides that 3 of them are dark-inducible, their functions do not point to any known responses of plants to environmental stimuli. The changes in transcription of these genes were compared with published microarray datasets and revealed a weak similarity of the microwave to light treatment experiments. Considering the large changes described in published experiments, it is questionable if the small alterations caused by a 24 h continuous microwave exposure would have any impact on the growth and reproduction of whole plants.
suspension cultured plant cells; radio frequency electromagnetic fields; microarrays; Arabidopsis thaliana
The microtubule network, the major organelle of the eukaryotic cytoskeleton, is involved in cell division and differentiation but also with many other cellular functions. In plants, microtubules seem to be involved in the ordered deposition of cellulose microfibrils by a so far unknown mechanism. Microtubule-associated proteins (MAP) typically contain various domains targeting or binding proteins with different functions to microtubules. Here we have investigated a proposed microtubule-targeting domain, TPX2, first identified in the Kinesin-like protein 2 in Xenopus. A TPX2 containing microtubule binding protein, PttMAP20, has been recently identified in poplar tissues undergoing xylogenesis. Furthermore, the herbicide 2,6-dichlorobenzonitrile (DCB), which is a known inhibitor of cellulose synthesis, was shown to bind specifically to PttMAP20. It is thus possible that PttMAP20 may have a role in coupling cellulose biosynthesis and the microtubular networks in poplar secondary cell walls. In order to get more insight into the occurrence, evolution and potential functions of TPX2-containing proteins we have carried out bioinformatic analysis for all genes so far found to encode TPX2 domains with special reference to poplar PttMAP20 and its putative orthologs in other plants.
TPX2 domain; MAP20; evolution; microtubule; cellulose; bioinformatics
Prion diseases are fatal neurodegenerative disorders that affect animals and humans. There is a need to gain understanding of prion disease pathogenesis and to develop diagnostic assays to detect prion diseases prior to the onset of clinical symptoms. The goal of this study was to identify genes that show altered expression early in the disease process in the spleen and brain of prion disease-infected mice. Using Affymetrix microarrays, we identified 67 genes that showed increased expression in the brains of prion disease-infected mice prior to the onset of clinical symptoms. These genes function in many cellular processes including immunity, the endosome/lysosome system, hormone activity, and the cytoskeleton. We confirmed a subset of these gene expression alterations using other methods and determined the time course in which these changes occur. We also identified 14 genes showing altered expression prior to the onset of clinical symptoms in spleens of prion disease infected mice. Interestingly, four genes, Atp1b1, Gh, Anp32a, and Grn, were altered at the very early time of 46 days post-infection. These gene expression alterations provide insights into the molecular mechanisms underlying prion disease pathogenesis and may serve as surrogate markers for the early detection and diagnosis of prion disease.
prion disease; microarrays; gene expression
We examined the procedures to combine two different in silico drug-screening results to achieve a high hit ratio. When the 3D structure of the target protein and some active compounds are known, both structure-based and ligand-based in silico screening methods can be applied. In the present study, the machine-learning score modification multiple target screening (MSM-MTS) method was adopted as a structure-based screening method, and the machine-learning docking score index (ML-DSI) method was adopted as a ligand-based screening method. To combine the predicted compound’s sets by these two screening methods, we examined the product of the sets (consensus set) and the sum of the sets. As a result, the consensus set achieved a higher hit ratio than the sum of the sets and than either individual predicted set. In addition, the current combination was shown to be robust enough for the structural diversities both in different crystal structure and in snapshot structures during molecular dynamics simulations.
in silico; screening; consensus score; protein-based screening; protein-ligand docking; conformation of active site
In the studies of genomics, it is essential to select a small number of genes that are more significant than the others for research ranging from candidate gene studies to genome-wide association studies. In this study, we proposed a Bayesian method for identifying the promising candidate genes that are significantly more influential than the others. We employed the framework of variable selection and a Gibbs sampling based technique to identify significant genes. The proposed approach was applied to a genomics study for persons with chronic fatigue syndrome. Our studies show that the proposed Bayesian methodology is effective for deriving models for genomic studies and for providing information on significant genes.
Bayesian variable selection; genomics; Gibbs sampling; variable selection
Binarization is often recognized to be one of the most important steps in most high-level image analysis systems, particularly for object recognition. Its precise functioning highly determines the performance of the entire system. According to many researchers, segmentation finishes when the observer’s goal is satisfied. Experience has shown that the most effective methods continue to be the iterative ones. However, a problem with these algorithms is the stopping criterion. In this work, entropy is used as the stopping criterion when segmenting an image by recursively applying mean shift filtering. Of this way, a new algorithm is introduced for the binarization of medical images, where the binarization is carried out after the segmented image was obtained. The good performance of the proposed method; that is, the good quality of the binarization, is illustrated with several experimental results. In this paper a comparison was carried out among the obtained results with this new algorithm with respect to another developed by the author and collaborators previously and also with Otsu’s method.
image segmentation; mean shift; algorithm; entropy; Otsu’s method