The large numbers of protein sequences generated by whole genome sequencing projects require rapid and accurate methods of annotation. The detection of homology through computational sequence analysis is a powerful tool in determining the complex evolutionary and functional relationships that exist between proteins. Homology search algorithms employ amino acid substitution matrices to detect similarity between proteins sequences. The substitution matrices in common use today are constructed using sequences aligned without reference to protein structure. Here we present amino acid substitution matrices constructed from the alignment of a large number of protein domain structures from the structural classification of proteins (SCOP) database. We show that when incorporated into the homology search algorithms BLAST and PSI-blast, the structure-based substitution matrices enhance the efficacy of detecting remote homologs.
computational biology; protein homology; amino acid substitution matrix; protein structure
Multivariate partial least square (PLS) regression allows the modeling of complex biological events, by considering different factors at the same time. It is unaffected by data collinearity, representing a valuable method for modeling high-dimensional biological data (as derived from genomics, proteomics and peptidomics). In presence of multiple responses, it is of particular interest how to appropriately “dissect” the model, to reveal the importance of single attributes with regard to individual responses (for example, variable selection). In this paper, performances of multivariate PLS regression coefficients, in selecting relevant predictors for different responses in omics-type of data, were investigated by means of a receiver operating characteristic (ROC) analysis. For this purpose, simulated data, mimicking the covariance structures of microarray and liquid chromatography mass spectrometric data, were used to generate matrices of predictors and responses. The relevant predictors were set a priori. The influences of noise, the source of data with different covariance structure and the size of relevant predictors were investigated. Results demonstrate the applicability of PLS regression coefficients in selecting variables for each response of a multivariate PLS, in omics-type of data. Comparisons with other feature selection methods, such as variable importance in the projection scores, principal component regression, and least absolute shrinkage and selection operator regression were also provided.
partial least square regression; regression coefficients; variable selection; biomarker discovery; omics-data
Identification of genes involved in the aging process is critical for understanding the mechanisms of age-dependent diseases such as cancer and diabetes. Measuring the mutant gene lifespan, each missing one gene, is traditionally employed to identify longevity genes. While such screening is impractical for the whole genome due to the time-consuming nature of lifespan assays, it can be achieved by in silico genetic manipulations with systems biology approaches. In this review, we will introduce pilot explorations applying two approaches of systems biology in aging studies. One approach is to predict the role of a specific gene in the aging process by comparing its expression profile and protein–protein interaction pattern with those of known longevity genes (top-down systems biology). The other approach is to construct mathematical models from previous kinetics data and predict how a specific protein contributes to aging and antiaging processes (bottom-up systems biology). These approaches allow researchers to simulate the effect of each gene’s product in aging by in silico genetic manipulations such as deletion or over-expression. Since simulation-based approaches are not as widely used as the other approaches, we will focus our review on this effort in more detail. A combination of hypothesis from data-mining, in silico experimentation from simulations, and wet laboratory validation will make the systematic identification of all longevity genes possible.
systems biology; yeast; aging; in silico; genetic manipulation; modeling
Probabilistic DNA sequence models have been intensively applied to genome research. Within the evolutionary biology framework, this article investigates the feasibility for rigorously estimating the probability of a set of orthologous DNA sequences which evolve from a common progenitor. We propose Monte Carlo integration algorithms to sample the unknown ancestral and/or root sequences a posteriori conditional on a reference sequence and apply pairwise Needleman–Wunsch alignment between the sampled and nonreference species sequences to estimate the probability. We test our algorithms on both simulated and real sequences and compare calculated probabilities from Monte Carlo integration to those induced by single multiple alignment.
evolution; Jukes-Cantor model; Monte Carlo integration; Needleman-Wunsch alignment; orthologous
Simple sequence repeats (SSRs) play important roles in gene regulation and genome evolution. Although there exist several online resources for SSR mining, most of them only extract general SSR patterns without providing functional information. Here, an online search tool, CG-SSR (Comparative Genomics SSR discovery), has been developed for discovering potential functional SSRs from vertebrate genomes through cross-species comparison. In addition to revealing SSR candidates in conserved regions among various species, it also combines accurate coordinate and functional genomics information. CG-SSR is the first comprehensive and efficient online tool for conserved SSR discovery.
microsatellites; genome; comparative genomics; functional SSR; gene ontology; conserved region
Bladder cancer is relatively common but early detection techniques such as cystoscopy and cytology are somewhat limited. We developed a broadly applicable, platform-independent and clinically relevant method based on simple ratios of gene expression to diagnose human cancers. In this study, we sought to determine whether this technique could be applied to the diagnosis of bladder cancer.
We developed a model for the diagnosis of bladder cancer using expression profiling data from 80 normal and tumor bladder tissues to identify statistically significant discriminating genes with reciprocal average expression levels in each tissue type. The expression levels of select genes were used to calculate individual gene pair expression ratios in order to assign diagnosis. The optimal model was examined in two additional published microarray data sets and using quantitative RT-PCR in a cohort of 13 frozen benign bladder urothelium samples and 13 bladder cancer samples from our institution.
A five-ratio test utilizing six genes proved to be 100% accurate (26 of 26 samples) for distinguishing benign from malignant bladder tissue samples (P < 10−6).
: We have provided a proof of principle study for the use of gene expression ratios in the diagnosis of bladder cancer. This technique may ultimately prove to be a useful adjunct to cytopathology in screening urine specimens for bladder cancer.
bladder cancer; gene expression profiling; and diagnosis
A system was developed to evaluate and predict the interaction between protein pairs by using the widely used shape complementarity search method as the algorithm for docking simulations between the proteins. We used this system, which we call the affinity evaluation and prediction (AEP) system, to evaluate the interaction between 20 protein pairs. The system first executes a “round robin” shape complementarity search of the target protein group, and evaluates the interaction between the complex structures obtained by the search. These complex structures are selected by using a statistical procedure that we developed called ‘grouping’. At a prevalence of 5.0%, our AEP system predicted protein–protein interactions with a 50.0% recall, 55.6% precision, 95.5% accuracy, and an F-measure of 0.526. By optimizing the grouping process, our AEP system successfully predicted 10 protein pairs (among 20 pairs) that were biologically relevant combinations. Our ultimate goal is to construct an affinity database that will provide cell biologists and drug designers with crucial information obtained using our AEP system.
protein-protein interaction; affinity analysis; protein-protein docking; FFT; massive parallel computing
It is expected that different markers may show different patterns of association with different pathogenic variants within a given gene. It would be helpful to combine the evidence implicating association at the level of the whole gene rather than just for individual markers or haplotypes. Doing this is complicated by the fact that different markers do not represent independent sources of information.
We propose combining the p values from all single locus and/or multilocus analyses of different markers according to the formula of Fisher, X = ∑(−2ln(pi)), and then assessing the empirical significance of this statistic using permutation testing. We present an example application to 19 markers around the HTRA2 gene in a case-control study of Parkinson’s disease.
Applying our approach shows that, although some individual tests produce low p values, overall association at the level of the gene is not supported.
Approaches such as this should be more widely used in assimilating the overall evidence supporting involvement of a gene in a particular disease. Information can be combined from biallelic and multiallelic markers and from single markers along with multimarker analyses. Single genes can be tested or results from groups of genes involved in the same pathway could be combined in order to test biologically relevant hypotheses. The approach has been implemented in a computer program called COMBASSOC which is made available for downloading.
Fisher; significance; genetic marker
A discrimination method between biologically relevant interfaces and artificial crystal-packing contacts in crystal structures was constructed. The method evaluates protein-protein interfaces in terms of complementarities for hydrophobicity, electrostatic potential and shape on the protein surfaces, and chooses the most probable biological interfaces among all possible contacts in the crystal. The method uses a discriminator named as “COMP”, which is a linear combination of the complementarities for the above three surface features and does not correlate with the contact area. The discrimination of homo-dimer interfaces from symmetry-related crystal-packing contacts based on the COMP value achieved the modest success rate. Subsequent detailed review of the discrimination results raised the success rate to about 88.8%. In addition, our discrimination method yielded some clues for understanding the interaction patterns in several examples in the PDB. Thus, the COMP discriminator can also be used as an indicator of the “biological-ness” of protein-protein interfaces.
protein-protein interaction; complementarity analysis; homo-dimer interface; crystal-packing contact; biological interfaces
There is a need to identify the regulatory gene interaction of anticancer drugs on target cancer cells. Whole genome expression profiling offers promise in this regard, but can be complicated by the challenge of identifying the genes affected by hundreds to thousands of genes that induce changes in expression. A proteasome inhibitor, bortezomib, could be a potential therapeutic agent in treating adult T-cell leukemia (ATL) patients, however, the underlying mechanism by which bortezomib induces cell death in ATL cells via gene regulatory network has not been fully elucidated. Here we show that a Bayesian statistical framework by VoyaGene® identified a secreted protein acidic and rich in cysteine (SPARC) gene, a tumor-invasiveness related gene, as a possible modulator of bortezomib-induced cell death in ATL cells. Functional analysis using RNAi experiments revealed that inhibition of the expression SPARC by siRNA enhanced the apoptotic effect of bortezomib on ATL cells in accordance with an increase of cleaved caspase 3. Targeting SPARC may help to treat ATL patients in combination with bortezomib. This work shows that a network biology approach can be used advantageously to identify the genetic interaction related to anticancer effects.
network biology; adult T cell leukemia; bortezomib; SPARC
Mobile phone technology makes use of radio frequency (RF) electromagnetic fields transmitted through a dense network of base stations in Europe. Possible harmful effects of RF fields on humans and animals are discussed, but their effect on plants has received little attention. In search for physiological processes of plant cells sensitive to RF fields, cell suspension cultures of Arabidopsis thaliana were exposed for 24 h to a RF field protocol representing typical microwave exposition in an urban environment. mRNA of exposed cultures and controls was used to hybridize Affymetrix-ATH1 whole genome microarrays. Differential expression analysis revealed significant changes in transcription of 10 genes, but they did not exceed a fold change of 2.5. Besides that 3 of them are dark-inducible, their functions do not point to any known responses of plants to environmental stimuli. The changes in transcription of these genes were compared with published microarray datasets and revealed a weak similarity of the microwave to light treatment experiments. Considering the large changes described in published experiments, it is questionable if the small alterations caused by a 24 h continuous microwave exposure would have any impact on the growth and reproduction of whole plants.
suspension cultured plant cells; radio frequency electromagnetic fields; microarrays; Arabidopsis thaliana
The microtubule network, the major organelle of the eukaryotic cytoskeleton, is involved in cell division and differentiation but also with many other cellular functions. In plants, microtubules seem to be involved in the ordered deposition of cellulose microfibrils by a so far unknown mechanism. Microtubule-associated proteins (MAP) typically contain various domains targeting or binding proteins with different functions to microtubules. Here we have investigated a proposed microtubule-targeting domain, TPX2, first identified in the Kinesin-like protein 2 in Xenopus. A TPX2 containing microtubule binding protein, PttMAP20, has been recently identified in poplar tissues undergoing xylogenesis. Furthermore, the herbicide 2,6-dichlorobenzonitrile (DCB), which is a known inhibitor of cellulose synthesis, was shown to bind specifically to PttMAP20. It is thus possible that PttMAP20 may have a role in coupling cellulose biosynthesis and the microtubular networks in poplar secondary cell walls. In order to get more insight into the occurrence, evolution and potential functions of TPX2-containing proteins we have carried out bioinformatic analysis for all genes so far found to encode TPX2 domains with special reference to poplar PttMAP20 and its putative orthologs in other plants.
TPX2 domain; MAP20; evolution; microtubule; cellulose; bioinformatics
Prion diseases are fatal neurodegenerative disorders that affect animals and humans. There is a need to gain understanding of prion disease pathogenesis and to develop diagnostic assays to detect prion diseases prior to the onset of clinical symptoms. The goal of this study was to identify genes that show altered expression early in the disease process in the spleen and brain of prion disease-infected mice. Using Affymetrix microarrays, we identified 67 genes that showed increased expression in the brains of prion disease-infected mice prior to the onset of clinical symptoms. These genes function in many cellular processes including immunity, the endosome/lysosome system, hormone activity, and the cytoskeleton. We confirmed a subset of these gene expression alterations using other methods and determined the time course in which these changes occur. We also identified 14 genes showing altered expression prior to the onset of clinical symptoms in spleens of prion disease infected mice. Interestingly, four genes, Atp1b1, Gh, Anp32a, and Grn, were altered at the very early time of 46 days post-infection. These gene expression alterations provide insights into the molecular mechanisms underlying prion disease pathogenesis and may serve as surrogate markers for the early detection and diagnosis of prion disease.
prion disease; microarrays; gene expression
We examined the procedures to combine two different in silico drug-screening results to achieve a high hit ratio. When the 3D structure of the target protein and some active compounds are known, both structure-based and ligand-based in silico screening methods can be applied. In the present study, the machine-learning score modification multiple target screening (MSM-MTS) method was adopted as a structure-based screening method, and the machine-learning docking score index (ML-DSI) method was adopted as a ligand-based screening method. To combine the predicted compound’s sets by these two screening methods, we examined the product of the sets (consensus set) and the sum of the sets. As a result, the consensus set achieved a higher hit ratio than the sum of the sets and than either individual predicted set. In addition, the current combination was shown to be robust enough for the structural diversities both in different crystal structure and in snapshot structures during molecular dynamics simulations.
in silico; screening; consensus score; protein-based screening; protein-ligand docking; conformation of active site
In the studies of genomics, it is essential to select a small number of genes that are more significant than the others for research ranging from candidate gene studies to genome-wide association studies. In this study, we proposed a Bayesian method for identifying the promising candidate genes that are significantly more influential than the others. We employed the framework of variable selection and a Gibbs sampling based technique to identify significant genes. The proposed approach was applied to a genomics study for persons with chronic fatigue syndrome. Our studies show that the proposed Bayesian methodology is effective for deriving models for genomic studies and for providing information on significant genes.
Bayesian variable selection; genomics; Gibbs sampling; variable selection
Binarization is often recognized to be one of the most important steps in most high-level image analysis systems, particularly for object recognition. Its precise functioning highly determines the performance of the entire system. According to many researchers, segmentation finishes when the observer’s goal is satisfied. Experience has shown that the most effective methods continue to be the iterative ones. However, a problem with these algorithms is the stopping criterion. In this work, entropy is used as the stopping criterion when segmenting an image by recursively applying mean shift filtering. Of this way, a new algorithm is introduced for the binarization of medical images, where the binarization is carried out after the segmented image was obtained. The good performance of the proposed method; that is, the good quality of the binarization, is illustrated with several experimental results. In this paper a comparison was carried out among the obtained results with this new algorithm with respect to another developed by the author and collaborators previously and also with Otsu’s method.
image segmentation; mean shift; algorithm; entropy; Otsu’s method