Search tips
Search criteria

Results 1-25 (959024)

Clipboard (0)

Related Articles

1.  Computational prediction of human proteins that can be secreted into the bloodstream 
Bioinformatics  2008;24(20):2370-2375.
We present a novel computational method for predicting which proteins from highly and abnormally expressed genes in diseased human tissues, such as cancers, can be secreted into the bloodstream, suggesting possible marker proteins for follow-up serum proteomic studies. A main challenging issue in tackling this problem is that our understanding about the downstream localization after proteins are secreted outside the cells is very limited and not sufficient to provide useful hints about secretion to the bloodstream. To bypass this difficulty, we have taken a data mining approach by first collecting, through extensive literature searches, human proteins that are known to be secreted into the bloodstream due to various pathological conditions as detected by previous proteomic studies, and then asking the question: ‘what do these secreted proteins have in common in terms of their physical and chemical properties, amino acid sequence and structural features that can be used to predict them?’ We have identified a list of features, such as signal peptides, transmembrane domains, glycosylation sites, disordered regions, secondary structural content, hydrophobicity and polarity measures that show relevance to protein secretion. Using these features, we have trained a support vector machine-based classifier to predict protein secretion to the bloodstream. On a large test set containing 98 secretory proteins and 6601 non-secretory proteins of human, our classifier achieved ∼90% prediction sensitivity and ∼98% prediction specificity. Several additional datasets are used to further assess the performance of our classifier. On a set of 122 proteins that were found to be of abnormally high abundance in human blood due to various cancers, our program predicted 62 as blood-secreted proteins. By applying our program to abnormally highly expressed genes in gastric cancer and lung cancer tissues detected through microarray gene expression studies, we predicted 13 and 31 as blood secreted, respectively, suggesting that they could serve as potential biomarkers for these two cancers, respectively. Our study demonstrated that our method can provide highly useful information to link genomic and proteomic studies for disease biomarker discovery. Our software can be accessed at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2562011  PMID: 18697770
2.  De novo computational prediction of non-coding RNA genes in prokaryotic genomes 
Bioinformatics  2009;25(22):2897-2905.
Motivation: The computational identification of non-coding RNA (ncRNA) genes represents one of the most important and challenging problems in computational biology. Existing methods for ncRNA gene prediction rely mostly on homology information, thus limiting their applications to ncRNA genes with known homologues.
Results: We present a novel de novo prediction algorithm for ncRNA genes using features derived from the sequences and structures of known ncRNA genes in comparison to decoys. Using these features, we have trained a neural network-based classifier and have applied it to Escherichia coli and Sulfolobus solfataricus for genome-wide prediction of ncRNAs. Our method has an average prediction sensitivity and specificity of 68% and 70%, respectively, for identifying windows with potential for ncRNA genes in E.coli. By combining windows of different sizes and using positional filtering strategies, we predicted 601 candidate ncRNAs and recovered 41% of known ncRNAs in E.coli. We experimentally investigated six novel candidates using Northern blot analysis and found expression of three candidates: one represents a potential new ncRNA, one is associated with stable mRNA decay intermediates and one is a case of either a potential riboswitch or transcription attenuator involved in the regulation of cell division. In general, our approach enables the identification of both cis- and trans-acting ncRNAs in partially or completely sequenced microbial genomes without requiring homology or structural conservation.
Availability: The source code and results are available at
Supplementary information: Supplementary data are available at Bioinformatics online.
PMCID: PMC2773258  PMID: 19744996
3.  cBar: a computer program to distinguish plasmid-derived from chromosome-derived sequence fragments in metagenomics data 
Bioinformatics  2010;26(16):2051-2052.
Summary: Huge amount of metagenomic sequence data have been produced as a result of the rapidly increasing efforts worldwide in studying microbial communities as a whole. Most, if not all, sequenced metagenomes are complex mixtures of chromosomal and plasmid sequence fragments from multiple organisms, possibly from different kingdoms. Computational methods for prediction of genomic elements such as genes are significantly different for chromosomes and plasmids, hence raising the need for separation of chromosomal from plasmid sequences in a metagenome. We present a program for classification of a metagenome set into chromosomal and plasmid sequences, based on their distinguishing pentamer frequencies. On a large training set consisting of all the sequenced prokaryotic chromosomes and plasmids, the program achieves ∼92% in classification accuracy. On a large set of simulated metagenomes with sequence lengths ranging from 300 bp to 100 kbp, the program has classification accuracy from 64.45% to 88.75%. On a large independent test set, the program achieves 88.29% classification accuracy.
Availability: The program has been implemented as a standalone prediction program, cBar, which is available at∼ffzhou/cBar
Supplementary information:Supplementary data are available at Bioinformatics online.
PMCID: PMC2916713  PMID: 20538725
4.  PROSPECT-PSPP: an automatic computational pipeline for protein structure prediction 
Nucleic Acids Research  2004;32(Web Server issue):W522-W525.
Knowledge of the detailed structure of a protein is crucial to our understanding of the biological functions of that protein. The gap between the number of solved protein structures and the number of protein sequences continues to widen rapidly in the post-genomics era due to long and expensive processes for solving structures experimentally. Computational prediction of structures from amino acid sequence has come to play a key role in narrowing the gap and has been successful in providing useful information for the biological research community. We have developed a prediction pipeline, PROSPECT-PSPP, an integration of multiple computational tools, for fully automated protein structure prediction. The pipeline consists of tools for (i) preprocessing of protein sequences, which includes signal peptide prediction, protein type prediction (membrane or soluble) and protein domain partition, (ii) secondary structure prediction, (iii) fold recognition and (iv) atomic structural model generation. The centerpiece of the pipeline is our threading-based program PROSPECT. The pipeline is implemented using SOAP (Simple Object Access Protocol), which makes it easier to share our tools and resources. The pipeline has an easy-to-use user interface and is implemented on a 64-node dual processor Linux cluster. It can be used for genome-scale protein structure prediction. The pipeline is accessible at
PMCID: PMC441552  PMID: 15215441
5.  A Role for the Lumenal Domain in Golgi Localization of the Saccharomyces cerevisiae Guanosine Diphosphatase 
Molecular Biology of the Cell  1998;9(6):1351-1365.
Integral membrane proteins (IMPs) contain localization signals necessary for targeting to their resident subcellular compartments. To define signals that mediate localization to the Golgi complex, we have analyzed a resident IMP of the Saccharomyces cerevisiae Golgi complex, guanosine diphosphatase (GDPase). GDPase, which is necessary for Golgi-specific glycosylation reactions, is a type II IMP with a short amino-terminal cytoplasmic domain, a single transmembrane domain (TMD), and a large catalytic lumenal domain. Regions specifying Golgi localization were identified by analyzing recombinant proteins either lacking GDPase domains or containing corresponding domains from type II vacuolar IMPs. Neither deletion nor substitution of the GDPase cytoplasmic domain perturbed Golgi localization. Exchanging the GDPase TMD with vacuolar protein TMDs only marginally affected Golgi localization. Replacement of the lumenal domain resulted in mislocalization of the chimeric protein from the Golgi to the vacuole, but a similar substitution leaving 34 amino acids of the GDPase lumenal domain intact was properly localized. These results identify a major Golgi localization determinant in the membrane-adjacent lumenal region (stem) of GDPase. Although necessary, the stem domain is not sufficient to mediate localization; in addition, a membrane-anchoring domain and either the cytoplasmic or full-length lumenal domain must be present to maintain Golgi residence. The importance of lumenal domain sequences in GDPase Golgi localization and the requirement for multiple hydrophilic protein domains support a model for Golgi localization invoking protein–protein interactions rather than interactions between the TMD and the lipid bilayer.
PMCID: PMC25355  PMID: 9614179
6.  A Comprehensive Comparison of Transmembrane Domains Reveals Organelle-Specific Properties 
Cell  2010;142(1):158-169.
The various membranes of eukaryotic cells differ in composition, but it is at present unclear if this results in differences in physical properties. The sequences of transmembrane domains (TMDs) of integral membrane proteins should reflect the physical properties of the bilayers in which they reside. We used large datasets from both fungi and vertebrates to perform a comprehensive comparison of the TMDs of proteins from different organelles. We find that TMDs are not generic but have organelle-specific properties with a dichotomy in TMD length between the early and late parts of the secretory pathway. In addition, TMDs from post-ER organelles show striking asymmetries in amino acid compositions across the bilayer that is linked to residue size and varies between organelles. The pervasive presence of organelle-specific features among the TMDs of a particular organelle has implications for TMD prediction, regulation of protein activity by location, and sorting of proteins and lipids in the secretory pathway.
Graphical Abstract
► Transmembrane domains (TMDs) vary in length and residue composition between organelles ► TMD lengths differ pre- versus post-Golgi but not between apical and basolateral surfaces ► The differences between TMDs are large enough to have value in predicting location ► Pervasive differences mean TMDs could collectively contribute to membrane properties
PMCID: PMC2928124  PMID: 20603021
7.  CINPER: An Interactive Web System for Pathway Prediction for Prokaryotes 
PLoS ONE  2012;7(12):e51252.
We present a web-based network-construction system, CINPER (CSBL INteractive Pathway BuildER), to assist a user to build a user-specified gene network for a prokaryotic organism in an intuitive manner. CINPER builds a network model based on different types of information provided by the user and stored in the system. CINPER’s prediction process has four steps: (i) collection of template networks based on (partially) known pathways of related organism(s) from the SEED or BioCyc database and the published literature; (ii) construction of an initial network model based on the template networks using the P-Map program; (iii) expansion of the initial model, based on the association information derived from operons, protein-protein interactions, co-expression modules and phylogenetic profiles; and (iv) computational validation of the predicted models based on gene expression data. To facilitate easy applications, CINPER provides an interactive visualization environment for a user to enter, search and edit relevant data and for the system to display (partial) results and prompt for additional data. Evaluation of CINPER on 17 well-studied pathways in the MetaCyc database shows that the program achieves an average recall rate of 76% and an average precision rate of 90% on the initial models; and a higher average recall rate at 87% and an average precision rate at 28% on the final models. The reduced precision rate in the final models versus the initial models reflects the reality that the final models have large numbers of novel genes that have no experimental evidences and hence are not yet collected in the MetaCyc database. To demonstrate the usefulness of this server, we have predicted an iron homeostasis gene network of Synechocystis sp. PCC6803 using the server. The predicted models along with the server can be accessed at
PMCID: PMC3517448  PMID: 23236458
8.  Mapping the Golgi Targeting and Retention Signal of Bunyamwera Virus Glycoproteins 
Journal of Virology  2004;78(19):10793-10802.
The membrane glycoproteins (Gn and Gc) of Bunyamwera virus (BUN; family Bunyaviridae) accumulate in the Golgi complex, where virion maturation occurs. The Golgi targeting and retention signal has previously been shown to reside within the Gn protein. A series of truncated Gn and glycoprotein precursor cDNAs were constructed by progressively deleting the coding region of the transmembrane domain (TMD) and the cytoplasmic tail. We also constructed chimeric proteins of BUN Gc, enhanced green fluorescent protein (EGFP), and human respiratory syncytial virus (HRSV) fusion (F) protein that contain the Gn TMD with various lengths of its adjacent cytoplasmic tails. The subcellular localization of mutated BUN glycoproteins and chimeric proteins was investigated by double-staining immunofluorescence with antibodies against BUN glycoproteins or the HRSV F protein and with antibodies specific for the Golgi complex. The results revealed that Gn and all truncated Gn proteins that contained the intact TMD (residues 206 to 224) were able to translocate to the Golgi complex and also rescued the Gc protein, which is retained in the endoplasmic reticulum when expressed alone, to this organelle. The rescued Gc proteins acquired endo-β-N-acetylglucosaminidase H resistance. The Gn TMD could also target chimeric EGFP to the Golgi and retain the F protein, which is characteristically expressed on the surface of HRSV-infected cells, in the Golgi. However, chimeric BUN Gc did not translocate to the Golgi, suggesting that an interaction with Gn is involved in Golgi retention of the Gc protein. Collectively, these data demonstrate that the Golgi targeting and retention signal of BUN glycoproteins resides in the TMD of the Gn protein.
PMCID: PMC516397  PMID: 15367646
9.  DMINDA: an integrated web server for DNA motif identification and analyses 
Nucleic Acids Research  2014;42(Web Server issue):W12-W19.
DMINDA (DNA motif identification and analyses) is an integrated web server for DNA motif identification and analyses, which is accessible at This web site is freely available to all users and there is no login requirement. This server provides a suite of cis-regulatory motif analysis functions on DNA sequences, which are important to elucidation of the mechanisms of transcriptional regulation: (i) de novo motif finding for a given set of promoter sequences along with statistical scores for the predicted motifs derived based on information extracted from a control set, (ii) scanning motif instances of a query motif in provided genomic sequences, (iii) motif comparison and clustering of identified motifs, and (iv) co-occurrence analyses of query motifs in given promoter sequences. The server is powered by a backend computer cluster with over 150 computing nodes, and is particularly useful for motif prediction and analyses in prokaryotic genomes. We believe that DMINDA, as a new and comprehensive web server for cis-regulatory motif finding and analyses, will benefit the genomic research community in general and prokaryotic genome researchers in particular.
PMCID: PMC4086085  PMID: 24753419
10.  The Transmembrane Domain of the Severe Acute Respiratory Syndrome Coronavirus ORF7b Protein Is Necessary and Sufficient for Its Retention in the Golgi Complex▿  
Journal of Virology  2008;82(19):9477-9491.
The severe acute respiratory syndrome coronavirus (SARS-CoV) ORF7b (also called 7b) protein is an integral membrane protein that is translated from a bicistronic open reading frame encoded within subgenomic RNA 7. When expressed independently or during virus infection, ORF7b accumulates in the Golgi compartment, colocalizing with both cis- and trans-Golgi markers. To identify the domains of this protein that are responsible for Golgi localization, we have generated a set of mutant proteins and analyzed their subcellular localizations by indirect immunofluorescence confocal microscopy. The N- and C-terminal sequences are dispensable, but the ORF7b transmembrane domain (TMD) is essential for Golgi compartment localization. When the TMD of human CD4 was replaced with the ORF7b TMD, the resulting chimeric protein localized to the Golgi complex. Scanning alanine mutagenesis identified two regions in the carboxy-terminal portion of the TMD that eliminated the Golgi complex localization of the chimeric CD4 proteins or ORF7b protein. Collectively, these data demonstrate that the Golgi complex retention signal of the ORF7b protein resides solely within the TMD.
PMCID: PMC2546951  PMID: 18632859
11.  A retention signal necessary and sufficient for Golgi localization maps to the cytoplasmic tail of a Bunyaviridae (Uukuniemi virus) membrane glycoprotein. 
Journal of Virology  1997;71(6):4717-4727.
Members of the Bunyaviridae family mature by a budding process in the Golgi complex. The site of maturation is thought to be largely determined by the accumulation of the two spike glycoproteins, G1 and G2, in this organelle. Here we show that the signal for localizing the Uukuniemi virus (a phlebovirus) spike protein complex to the Golgi complex resides in the cytoplasmic tail of G1. We constructed chimeric proteins in which the ectodomain, transmembrane domain (TMD), and cytoplasmic tail (CT) of Uukuniemi virus G1 were exchanged with the corresponding domains of either vesicular stomatitis virus G protein (VSV G), chicken lysozyme, or CD4, all proteins readily transported to the plasma membrane. The chimeras were expressed in HeLa or BHK-21 cells by using either the T7 RNA polymerase-driven vaccinia virus system or the Semliki Forest virus system. The fate of the chimeric proteins was monitored by indirect immunofluorescence, and their localizations were compared by double labeling with markers specific for the Golgi complex. The results showed that the ectodomain and TMD (including the 10 flanking residues on either side of the membrane) of G1 played no apparent role in targeting chimeric proteins to the Golgi complex. Instead, all chimeras containing the CT of G1 were efficiently targeted to the Golgi complex and colocalized with mannosidase II, a Golgi-specific enzyme. Conversely, replacing the CT of G1 with that from VSV G resulted in the efficient transport of the chimeric protein to the cell surface. Progressive deletions of the G1 tail suggested that the Golgi retention signal maps to a region encompassing approximately residues 10 to 50, counting from the proposed border between the TMD and the tail. Both G1 and G2 were found to be acylated, as shown by incorporation of [3H]palmitate into the viral proteins. By mutational analyses of CD4-G1 chimeras, the sites for palmitylation were mapped to two closely spaced cysteine residues in the G1 tail. Changing either or both of these cysteines to alanine had no effect on the targeting of the chimeric protein to the Golgi complex.
PMCID: PMC191693  PMID: 9151865
12.  dbCAN: a web resource for automated carbohydrate-active enzyme annotation 
Nucleic Acids Research  2012;40(Web Server issue):W445-W451.
Carbohydrate-active enzymes (CAZymes) are very important to the biotech industry, particularly the emerging biofuel industry because CAZymes are responsible for the synthesis, degradation and modification of all the carbohydrates on Earth. We have developed a web resource, dbCAN (, to provide a capability for automated CAZyme signature domain-based annotation for any given protein data set (e.g. proteins from a newly sequenced genome) submitted to our server. To accomplish this, we have explicitly defined a signature domain for every CAZyme family, derived based on the CDD (conserved domain database) search and literature curation. We have also constructed a hidden Markov model to represent the signature domain of each CAZyme family. These CAZyme family-specific HMMs are our key contribution and the foundation for the automated CAZyme annotation.
PMCID: PMC3394287  PMID: 22645317
13.  DOOR 2.0: presenting operons and their functions through dynamic and integrated views 
Nucleic Acids Research  2013;42(Database issue):D654-D659.
We have recently developed a new version of the DOOR operon database, DOOR 2.0, which is available online at and will be updated on a regular basis. DOOR 2.0 contains genome-scale operons for 2072 prokaryotes with complete genomes, three times the number of genomes covered in the previous version published in 2009. DOOR 2.0 has a number of new features, compared with its previous version, including (i) more than 250 000 transcription units, experimentally validated or computationally predicted based on RNA-seq data, providing a dynamic functional view of the underlying operons; (ii) an integrated operon-centric data resource that provides not only operons for each covered genome but also their functional and regulatory information such as their cis-regulatory binding sites for transcription initiation and termination, gene expression levels estimated based on RNA-seq data and conservation information across multiple genomes; (iii) a high-performance web service for online operon prediction on user-provided genomic sequences; (iv) an intuitive genome browser to support visualization of user-selected data; and (v) a keyword-based Google-like search engine for finding the needed information intuitively and rapidly in this database.
PMCID: PMC3965076  PMID: 24214966
14.  Sequence and overexpression of GPP130/GIMPc: evidence for saturable pH-sensitive targeting of a type II early Golgi membrane protein. 
Molecular Biology of the Cell  1997;8(6):1073-1087.
It is thought that residents of the Golgi stack are localized by a retention mechanism that prevents their forward progress. Nevertheless, some early Golgi proteins acquire late Golgi modifications. Herein, we describe GPP130 (Golgi phosphoprotein of 130 kDa), a 130-kDa phosphorylated and glycosylated integral membrane protein localized to the cis/medial Golgi. GPP130 appears to be the human counterpart of rat Golgi integral membrane protein, cis (GIMPc), a previously identified early Golgi antigen that acquires late Golgi carbohydrate modifications. The sequence of cDNAs encoding GPP130 indicate that it is a type II membrane protein with a predicted molecular weight of 81,880 and an unusually acidic lumenal domain. On the basis of the alignment with several rod-shaped proteins and the presence of multiple predicted coiled-coil regions, GPP130 may form a flexible rod in the Golgi lumen. In contrast to the behavior of previously studied type II Golgi proteins, overexpression of GPP130 led to a pronounced accumulation in endocytotic vesicles, and endogenous GPP130 reversibly redistributed to endocytotic vesicles after chloroquine treatment. Thus, localization of GPP130 to the early Golgi involves steps that are saturable and sensitive to lumenal pH, and GPP130 contains targeting information that specifies its return to the Golgi after chloroquine washout. Given that GIMPc acquires late Golgi modifications in untreated cells, it seems likely that GPP130/GIMPc continuously cycles between the early Golgi and distal compartments and that an unidentified retrieval mechanism is important for its targeting.
PMCID: PMC305715  PMID: 9201717
15.  An investigation of the effect of membrane curvature on transmembrane-domain dependent protein sorting in lipid bilayers 
Cellular Logistics  2014;4:e29087.
Sorting of membrane proteins within the secretory pathway of eukaryotic cells is a complex process involving discrete sorting signals as well as physico-chemical properties of the transmembrane domain (TMD). Previous work demonstrated that tail-anchored (TA) protein sorting at the interface between the Endoplasmic Reticulum (ER) and the Golgi complex is exquisitely dependent on the length and hydrophobicity of the transmembrane domain, and suggested that an imbalance between TMD length and bilayer thickness (hydrophobic mismatch) could drive long TMD-containing proteins into curved membrane domains, including ER exit sites, with consequent export of the mismatched protein out of the ER. Here, we tested a possible role of curvature in TMD-dependent sorting in a model system consisting of Giant Unilamellar Vesicles (GUVs) from which narrow membrane tubes were pulled by micromanipulation. Fluorescent TA proteins differing in TMD length were incorporated into GUVs of uniform lipid composition or made of total ER lipids, and TMD-dependent sorting and diffusion, as well as the bending rigidity of bilayers made of microsomal lipids, were investigated. Long and short TMD-containing constructs were inserted with similar orientation, diffused equally rapidly in GUVs and in tubes pulled from GUVs, and no difference in their final distribution between planar and curved regions was detected. These results indicate that curvature alone is not sufficient to drive TMD-dependent sorting at the ER-Golgi interface, and set the basis for the investigation of the additional factors that must be required.
PMCID: PMC4156485  PMID: 25210649
endoplasmic reticulum; giant unilamellar vesicles; hydrophobic mismatch; nanotubes; tail-anchored proteins; optical tweezers; bending rigidity
16.  A novel di-acidic motif facilitates ER export of the syntaxin SYP31 
Journal of Experimental Botany  2009;60(11):3157-3165.
It is generally accepted that ER protein export is largely influenced by the transmembrane domain (TMD). The situation is unclear for membrane-anchored proteins such as SNAREs, which are anchored to the membrane by their TMD at the C-terminus. For example, in plants, Sec22 and SYP31 (a yeast Sed5 homologue) have a 17 aa TMD but different locations (ER/Golgi and Golgi), indicating that TMD length alone is not sufficient to explain their targeting. To establish the identity of factors that influence SNARE targeting, mutagenesis and live cell imaging experiments were performed on SYP31. It was found that deletion of the entire N-terminus domain of SYP31 blocked the protein in the ER. Several deletion mutants of different parts of this N-terminus domain indicated that a region between the SNARE helices Hb and Hc is required for Golgi targeting. In this region, replacement of the aa sequence MELAD by GAGAG or MALAG retained the protein in the ER, suggesting that MELAD may function as a di-acidic ER export motif EXXD. This suggestion was further verified by replacing the established di-acidic ER export motif DLE of a type II Golgi protein AtCASP and a membrane-anchored type I chimaera, TMcCCASP, by MELAD or GAGAG. The MELAD motif allowed the proteins to reach the Golgi, whereas the motif GAGAG was found to be insufficient to facilitate ER protein export. Our analyses indicate that we have identified a novel and transplantable di-acidic motif that facilitates ER export of SYP31 and may function for type I and type II proteins in plants.
PMCID: PMC2718219  PMID: 19516076
Di-acidic motif; ER export; ER–Golgi interface; SNARE; syntaxin
17.  Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes 
Nucleic Acids Research  2011;39(22):e150.
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at
PMCID: PMC3239196  PMID: 21965536
18.  AST: An Automated Sequence-Sampling Method for Improving the Taxonomic Diversity of Gene Phylogenetic Trees 
PLoS ONE  2014;9(6):e98844.
A challenge in phylogenetic inference of gene trees is how to properly sample a large pool of homologous sequences to derive a good representative subset of sequences. Such a need arises in various applications, e.g. when (1) accuracy-oriented phylogenetic reconstruction methods may not be able to deal with a large pool of sequences due to their high demand in computing resources; (2) applications analyzing a collection of gene trees may prefer to use trees with fewer operational taxonomic units (OTUs), for instance for the detection of horizontal gene transfer events by identifying phylogenetic conflicts; and (3) the pool of available sequences is biased towards extensively studied species. In the past, the creation of subsamples often relied on manual selection. Here we present an Automated sequence-Sampling method for improving the Taxonomic diversity of gene phylogenetic trees, AST, to obtain representative sequences that maximize the taxonomic diversity of the sampled sequences. To demonstrate the effectiveness of AST, we have tested it to solve four problems, namely, inference of the evolutionary histories of the small ribosomal subunit protein S5 of E. coli, 16 S ribosomal RNAs and glycosyl-transferase gene family 8, and a study of ancient horizontal gene transfers from bacteria to plants. Our results show that the resolution of our computational results is almost as good as that of manual inference by domain experts, hence making the tool generally useful to phylogenetic studies by non-phylogeny specialists. The program is available at
PMCID: PMC4044049  PMID: 24892935
19.  Barcode Server: A Visualization-Based Genome Analysis System 
PLoS ONE  2013;8(2):e56726.
We have previously developed a computational method for representing a genome as a barcode image, which makes various genomic features visually apparent. We have demonstrated that this visual capability has made some challenging genome analysis problems relatively easy to solve. We have applied this capability to a number of challenging problems, including (a) identification of horizontally transferred genes, (b) identification of genomic islands with special properties and (c) binning of metagenomic sequences, and achieved highly encouraging results. These application results inspired us to develop this barcode-based genome analysis server for public service, which supports the following capabilities: (a) calculation of the k-mer based barcode image for a provided DNA sequence; (b) detection of sequence fragments in a given genome with distinct barcodes from those of the majority of the genome, (c) clustering of provided DNA sequences into groups having similar barcodes; and (d) homology-based search using Blast against a genome database for any selected genomic regions deemed to have interesting barcodes. The barcode server provides a job management capability, allowing processing of a large number of analysis jobs for barcode-based comparative genome analyses. The barcode server is accessible at
PMCID: PMC3574017  PMID: 23457606
20.  GO-PROMTO Illuminates Protein Membrane Topologies of Glycan Biosynthetic Enzymes in the Golgi Apparatus of Living Tissues 
PLoS ONE  2012;7(2):e31324.
The Golgi apparatus is the main site of glycan biosynthesis in eukaryotes. Better understanding of the membrane topology of the proteins and enzymes involved can impart new mechanistic insights into these processes. Publically available bioinformatic tools provide highly variable predictions of membrane topologies for given proteins. Therefore we devised a non-invasive experimental method by which the membrane topologies of Golgi-resident proteins can be determined in the Golgi apparatus in living tissues. A Golgi marker was used to construct a series of reporters based on the principle of bimolecular fluorescence complementation. The reporters and proteins of interest were recombinantly fused to split halves of yellow fluorescent protein (YFP) and transiently co-expressed with the reporters in the Nicotiana benthamiana leaf tissue. Output signals were binary, showing either the presence or absence of fluorescence with signal morphologies characteristic of the Golgi apparatus and endoplasmic reticulum (ER). The method allows prompt and robust determinations of membrane topologies of Golgi-resident proteins and is termed GO-PROMTO (for GOlgi PROtein Membrane TOpology). We applied GO-PROMTO to examine the topologies of proteins involved in the biosynthesis of plant cell wall polysaccharides including xyloglucan and arabinan. The results suggest the existence of novel biosynthetic mechanisms involving transports of intermediates across Golgi membranes.
PMCID: PMC3283625  PMID: 22363620
21.  DOOR: a database for prokaryotic operons 
Nucleic Acids Research  2008;37(Database issue):D459-D463.
We present a database DOOR (Database for prOkaryotic OpeRons) containing computationally predicted operons of all the sequenced prokaryotic genomes. All the operons in DOOR are predicted using our own prediction program, which was ranked to be the best among 14 operon prediction programs by a recent independent review. Currently, the DOOR database contains operons for 675 prokaryotic genomes, and supports a number of search capabilities to facilitate easy access and utilization of the information stored in it. Querying the database: the database provides a search capability for a user to find desired operons and associated information through multiple querying methods.Searching for similar operons: the database provides a search capability for a user to find operons that have similar composition and structure to a query operon.Prediction of cis-regulatory motifs: the database provides a capability for motif identification in the promoter regions of a user-specified group of possibly coregulated operons, using motif-finding tools.Operons for RNA genes: the database includes operons for RNA genes.OperonWiki: the database provides a wiki page (OperonWiki) to facilitate interactions between users and the developer of the database. We believe that DOOR provides a useful resource to many biologists working on bacteria and archaea, which can be accessed at
PMCID: PMC2686520  PMID: 18988623
22.  Six-transmembrane Topology for Golgi Anti-apoptotic Protein (GAAP) and Bax Inhibitor 1 (BI-1) Provides Model for the Transmembrane Bax Inhibitor-containing Motif (TMBIM) Family* 
The Journal of Biological Chemistry  2012;287(19):15896-15905.
Background: Golgi anti-apoptotic protein (GAAP) is a regulator of intracellular Ca2+ fluxes and apoptosis.
Results: The transmembrane topology of viral GAAP is conserved in human GAAP and BI-1.
Conclusion: GAAPs and BI-1 have a six membrane-spanning topology with cytosolic N and C termini and a C-terminal reentrant loop.
Significance: The topology of the TMBIM family provides valuable structural information on these proteins.
The Golgi anti-apoptotic protein (GAAP) is a hydrophobic Golgi protein that regulates intracellular calcium fluxes and apoptosis. GAAP is highly conserved throughout eukaryotes and some strains of vaccinia virus (VACV) and camelpox virus. Based on sequence, phylogeny, and hydrophobicity, GAAPs were classified within the transmembrane Bax inhibitor-containing motif (TMBIM) family. TMBIM members are anti-apoptotic and were predicted to have seven-transmembrane domains (TMDs). However, topology prediction programs are inconsistent and predicted that GAAP and other TMBIM members have six or seven TMDs. To address this discrepancy, we mapped the transmembrane topology of viral (vGAAP) and human (hGAAP), as well as Bax inhibitor (BI-1). Data presented show a six-, not seven-, transmembrane topology for vGAAP with a putative reentrant loop at the C terminus and both termini located in the cytosol. We find that this topology is also conserved in hGAAP and BI-1. This places the charged C terminus in the cytosol, and mutation of these charged residues in hGAAP ablated its anti-apoptotic function. Given the highly conserved hydrophobicity profile within the TMBIM family and recent phylogenetic data indicating that a GAAP-like protein may have been the ancestral progenitor of a subset of the TMBIM family, we propose that this vGAAP topology may be used as a model for the remainder of the TMBIM family of proteins. The topology described provides valuable information on the structure and function of an important but poorly understood family of proteins.
PMCID: PMC3346125  PMID: 22418439
Animal Viruses; Apoptosis; Calcium; Golgi; Membrane Proteins; Protein Domains
23.  The Golgi Localization of GOLPH2 (GP73/GOLM1) Is Determined by the Transmembrane and Cytoplamic Sequences 
PLoS ONE  2011;6(11):e28207.
Golgi phosphoprotein 2 (GOLPH2) is a resident Golgi type-II membrane protein upregulated in liver disease. Given that GOLPH2 traffics through endosomes and can be secreted into the circulation, it is a promising serum marker for liver diseases. The structure of GOLPH2 and the functions of its different protein domains are not known. In the current study, we investigated the structural determinants for Golgi localization using a panel of GOLPH2 truncation mutants. The Golgi localization of GOLPH2 was not affected by the deletion of the C-terminal part of the protein. A truncated mutant containing the N-terminal portion (the cytoplasmic tail and transmembrane domain (TMD)) localized to the Golgi. Sequential deletion analysis of the N-terminal indicated that the TMD with a positively charged residue in the cytoplasmic N-terminal tail were sufficient to support Golgi localization. We also showed that both endogenous and secreted GOLPH2 exist as a disulfide-bonded dimer, and the coiled-coil domain was sufficient for dimerization. This structural knowledge is important for the understanding the pathogenic role of GOLPH2 in liver diseases, and the development of GOLPH2-based hepatocellular cancer diagnostic methods.
PMCID: PMC3226628  PMID: 22140547
24.  In-silico prediction of blood-secretory human proteins using a ranking algorithm 
BMC Bioinformatics  2010;11:250.
Computational identification of blood-secretory proteins, especially proteins with differentially expressed genes in diseased tissues, can provide highly useful information in linking transcriptomic data to proteomic studies for targeted disease biomarker discovery in serum.
A new algorithm for prediction of blood-secretory proteins is presented using an information-retrieval technique, called manifold ranking. On a dataset containing 305 known blood-secretory human proteins and a large number of other proteins that are either not blood-secretory or unknown, the new method performs better than the previous published method, measured in terms of the area under the recall-precision curve (AUC). A key advantage of the presented method is that it does not explicitly require a negative training set, which could often be noisy or difficult to derive for most biological problems, hence making our method more applicable than classification-based data mining methods in general biological studies.
We believe that our program will prove to be very useful to biomedical researchers who are interested in finding serum markers, especially when they have candidate proteins derived through transcriptomic or proteomic analyses of diseased tissues. A computer program is developed for prediction of blood-secretory proteins based on manifold ranking, which is accessible at our website
PMCID: PMC2877692  PMID: 20465853
25.  Genome-scale identification of cell-wall related genes in Arabidopsis based on co-expression network analysis 
BMC Plant Biology  2012;12:138.
Identification of the novel genes relevant to plant cell-wall (PCW) synthesis represents a highly important and challenging problem. Although substantial efforts have been invested into studying this problem, the vast majority of the PCW related genes remain unknown.
Here we present a computational study focused on identification of the novel PCW genes in Arabidopsis based on the co-expression analyses of transcriptomic data collected under 351 conditions, using a bi-clustering technique. Our analysis identified 217 highly co-expressed gene clusters (modules) under some experimental conditions, each containing at least one gene annotated as PCW related according to the Purdue Cell Wall Gene Families database. These co-expression modules cover 349 known/annotated PCW genes and 2,438 new candidates. For each candidate gene, we annotated the specific PCW synthesis stages in which it is involved and predicted the detailed function. In addition, for the co-expressed genes in each module, we predicted and analyzed their cis regulatory motifs in the promoters using our motif discovery pipeline, providing strong evidence that the genes in each co-expression module are transcriptionally co-regulated. From the all co-expression modules, we infer that 108 modules are related to four major PCW synthesis components, using three complementary methods.
We believe our approach and data presented here will be useful for further identification and characterization of PCW genes. All the predicted PCW genes, co-expression modules, motifs and their annotations are available at a web-based database:
PMCID: PMC3463447  PMID: 22877077
Plant cell wall; Arabidopsis; Co-expression network analysis; Bi-clustering; Cis regulatory motifs

Results 1-25 (959024)