1.  EGFR-Mediated Phosphorylation of Beclin 1 in Autophagy Suppression, Tumor Progression and Tumor Chemoresistance 
Cell  2013;154(6):1269-1284.
Cell surface growth factor receptors couple environmental cues to the regulation of cytoplasmic homeostatic process including autophagy, and aberrant activation of such receptors is a common feature of human malignancies. Here, we defined the molecular basis by which the epidermal growth factor receptor (EGFR) tyrosine kinase regulates autophagy. Active EGFR binds to the autophagy protein Beclin 1, leading to its multisite tyrosine phosphorylation, enhanced binding to inhibitors, and decreased Beclin 1-associated Class III phosphatidylinositol-3 kinase activity. EGFR tyrosine kinase inhibitor (TKI) therapy disrupts Beclin 1 tyrosine phosphorylation and binding to its inhibitors, and restores autophagy in non-small cell lung carcinoma (NSCLC) cells with a TKI-sensitive EGFR mutation. In NSCLC tumor xenografts, the expression of a tyrosine phosphomimetic Beclin 1 mutant leads to reduced autophagy, enhanced tumor growth, tumor dedifferentiation, and resistance to TKI therapy. Thus, oncogenic receptor tyrosine kinases directly regulate the core autophagy machinery, which may contribute to tumor progression and chemoresistance.
PMCID: PMC3917713  PMID: 24034250
2.  A new Hermeuptychia (Lepidoptera, Nymphalidae, Satyrinae) is sympatric and synchronic with H. sosybius in southeast US coastal plains, while another new Hermeuptychia species – not hermes – inhabits south Texas and northeast Mexico 
ZooKeys  2014;43-91.
Hermeuptychia intricata Grishin, sp. n. is described from the Brazos Bend State Park in Texas, United States, where it flies synchronously with Hermeuptychia sosybius (Fabricius, 1793). The two species differ strongly in both male and female genitalia and exhibit 3.5% difference in the COI barcode sequence of mitochondrial DNA. Setting such significant genitalic and genotypic differences aside, we were not able to find reliable wing pattern characters to tell a difference between the two species. This superficial similarity may explain why H. intricata, only distantly related to H. sosybius, has remained unnoticed until now, despite being widely distributed in the coastal plains from South Carolina to Texas, USA (and possibly to Costa Rica). Obscuring the presence of a cryptic species even further, wing patterns are variable in both butterflies and ventral eyespots vary from large to almost absent. To avoid confusion with the new species, neotype for Papilio sosybius Fabricius, 1793, a common butterfly that occurs across northeast US, is designated from Savannah, Georgia, USA. It secures the universally accepted traditional usage of this name. Furthermore, we find that DNA barcodes of Hermeuptychia specimens from the US, even those from extreme south Texas, are at least 4% different from those of H. hermes (Fabricius, 1775)—type locality Brazil: Rio de Janeiro—and suggest that the name H. hermes should not be used for USA populations, but rather reserved for the South American species. This conclusion is further supported by comparison of male genitalia. However, facies, genitalia and 2.1% different DNA barcodes set Hermeuptychia populations in the lower Rio Grande Valley of Texas apart from H. sosybius. These southern populations, also found in northeastern Mexico, are described here as Hermeuptychia hermybius Grishin, sp. n. (type locality Texas: Cameron County). While being phylogenetically closer to H. sosybius than to any other Hermeuptychia species, H. hermybius can usually be recognized by wing patterns, such as the size of eyespots and the shape of brown lines on hindwing. “Intricate Satyr” and “South Texas Satyr” are proposed as the English names for H. intricata and H. hermybius, respectively.
PMCID: PMC3935228  PMID: 24574857
Biodiversity; cryptic species; DNA barcodes; neotropical; satyr; Hermeuptychia gisella; Hermeuptychia cucullina; Hermeuptychia sosybius kappeli; female genitalia
3.  Defining and predicting structurally conserved regions in protein superfamilies 
Bioinformatics  2012;29(2):175-181.
Motivation: The structures of homologous proteins are generally better conserved than their sequences. This phenomenon is demonstrated by the prevalence of structurally conserved regions (SCRs) even in highly divergent protein families. Defining SCRs requires the comparison of two or more homologous structures and is affected by their availability and divergence, and our ability to deduce structurally equivalent positions among them. In the absence of multiple homologous structures, it is necessary to predict SCRs of a protein using information from only a set of homologous sequences and (if available) a single structure. Accurate SCR predictions can benefit homology modelling and sequence alignment.
Results: Using pairwise DaliLite alignments among a set of homologous structures, we devised a simple measure of structural conservation, termed structural conservation index (SCI). SCI was used to distinguish SCRs from non-SCRs. A database of SCRs was compiled from 386 SCOP superfamilies containing 6489 protein domains. Artificial neural networks were then trained to predict SCRs with various features deduced from a single structure and homologous sequences. Assessment of the predictions via a 5-fold cross-validation method revealed that predictions based on features derived from a single structure perform similarly to ones based on homologous sequences, while combining sequence and structural features was optimal in terms of accuracy (0.755) and Matthews correlation coefficient (0.476). These results suggest that even without information from multiple structures, it is still possible to effectively predict SCRs for a protein. Finally, inspection of the structures with the worst predictions pinpoints difficulties in SCR definitions.
Availability: The SCR database and the prediction server can be found at
Contact: or
Supplementary information: Supplementary data are available at Bioinformatics Online
PMCID: PMC3546793  PMID: 23193223
4.  cor, a Novel Carbon Monoxide Resistance Gene, Is Essential for Mycobacterium tuberculosis Pathogenesis 
mBio  2013;4(6):e00721-13.
Tuberculosis, caused by Mycobacterium tuberculosis, remains a devastating human infectious disease, causing two million deaths annually. We previously demonstrated that M. tuberculosis induces an enzyme, heme oxygenase (HO1), that produces carbon monoxide (CO) gas and that M. tuberculosis adapts its transcriptome during CO exposure. We now demonstrate that M. tuberculosis carries a novel resistance gene to combat CO toxicity. We screened an M. tuberculosis transposon library for CO-susceptible mutants and found that disruption of Rv1829 (carbon monoxide resistance, Cor) leads to marked CO sensitivity. Heterologous expression of Cor in Escherichia coli rescued it from CO toxicity. Importantly, the virulence of the cor mutant is attenuated in a mouse model of tuberculosis. Thus, Cor is necessary and sufficient to protect bacteria from host-derived CO. Taken together, this represents the first report of a role for HO1-derived CO in controlling infection of an intracellular pathogen and the first identification of a CO resistance gene in a pathogenic organism.
Macrophages produce a variety of antimicrobial molecules, including nitric oxide (NO), hydrogen peroxide (H2O2), and acid (H+), that serve to kill engulfed bacteria. In addition to these molecules, human and mouse macrophages also produce carbon monoxide (CO) gas by the heme oxygenase (HO1) enzyme. We observed that, in contrast to other bacteria, mycobacteria are resistant to CO, suggesting that this might be an evolutionary adaptation of mycobacteria for survival within macrophages. We screened a panel of ~2,500 M. tuberculosis mutants to determine which genes are required for survival of M. tuberculosis in the presence of CO. Within this panel, we identified one such gene, cor, that specifically confers CO resistance. Importantly, we found that the ability of M. tuberculosis cells carrying a mutated copy of this gene to cause tuberculosis in a mouse disease model is significantly attenuated. This indicates that CO resistance is essential for mycobacterial survival in vivo.
PMCID: PMC3870250  PMID: 24255121
5.  A New Family of Predicted Krüppel-Like Factor Genes and Pseudogenes in Placental Mammals 
PLoS ONE  2013;8(11):e81109.
Krüppel-like factors (KLF) and specificity proteins (SP) constitute a family of zinc-finger-containing transcription factors that play important roles in a wide range of processes including differentiation and development of various tissues. The human genome possesses 17 KLF genes (KLF1–KLF17) and nine SP genes (SP1–SP9) with diverse functions. We used sequence similarity searches and gene synteny analysis to identify a new putative KLF gene/pseudogene named KLF18 that is present in most of the placental mammals with sequenced genomes. KLF18 is a chromosomal neighbor of the KLF17 gene and is likely a product of its duplication. Phylogenetic analyses revealed that mammalian predicted KLF18 proteins and KLF17 proteins experienced elevated rates of evolution and are grouped with KLF1/KLF2/KLF4 and non-mammalian KLF17. Predicted KLF18 proteins maintain conserved features in the zinc fingers of the SP/KLF family, while possessing repeats of a unique sequence motif in their N-terminal regions. No expression data have been reported for KLF18, suggesting that it either has highly restricted expression patterns and specialized functions, or could have become a pseudogene in extant placental mammals. Besides KLF18 genes/pseudogenes, we identified several KLF18-like genes such as Zfp352, Zfp352-like, and Zfp353 in the genomes of mouse and rat. These KLF18-like genes do not possess introns inside their coding regions, and gene expression data indicate that some of them may function in early embryonic development. They represent further expansions of KLF members in the murine lineage, most likely resulted from several events of retrotransposition and local gene duplication starting from an ancient spliced mRNA of KLF18.
PMCID: PMC3820594  PMID: 24244731
6.  The ABC transporters in Candidatus Liberibacter asiaticus 
Proteins  2012;80(11):2614-2628.
Candidatus Liberibacter asiaticus(Ca. L. asiaticus) is a Gram-negative bacterium and the pathogen of Citrus Greening disease (Huanglongbing, HLB). As a parasitic bacterium, Ca. L. asiaticus harbors ABC transporters that play important roles in exchanging chemical compounds between Ca. L. asiaticus and its host. Here we analyzed all the ABC transporter-related proteins in Ca. L. asiaticus. We identified 14 ABC transporter systems and predicted their structures and substrate specificities. In-depth sequence and structure analysis including multiple sequence alignment, phylogenetic tree reconstruction and structure comparison further support their function predictions. Our study shows that this bacterium could utilize these ABC transporters to import metabolites (amino acids and phosphates) and enzyme cofactors (choline, thiamine, iron, manganese and zinc), resist to organic solvent, heavy metal and lipid-like drugs, construct and maintain the composition of the outer membrane, and secrete virulence factors. While the features of most ABC systems could be deduced from the abundant experimental data on their orthologs, we reported several novel observations within ABC system proteins. Moreover, we identified seven non-transport ABC systems that are likely involved in virulence gene expression regulation, transposon excision regulation and DNA repair. Our analysis reveals several candidates for further studies to understand and control the disease, including the type I virulence factor secretion system and its substrate that are likely related to Ca. L. asiaticus pathogenicity, and the ABC transporter systems responsible for bacterial outer membrane biosynthesis that are good drug targets.
PMCID: PMC3688454  PMID: 22807026
Genomic annotation; function prediction; ATPase; transmembrane protein; multiple sequence alignment; phylogenetic tree; protein homology; structure comparison
7.  SURVEY AND SUMMARY: Structural classification of zinc fingers 
Nucleic Acids Research  2003;31(2):532-550.
Zinc fingers are small protein domains in which zinc plays a structural role contributing to the stability of the domain. Zinc fingers are structurally diverse and are present among proteins that perform a broad range of functions in various cellular processes, such as replication and repair, transcription and translation, metabolism and signaling, cell proliferation and apoptosis. Zinc fingers typically function as interaction modules and bind to a wide variety of compounds, such as nucleic acids, proteins and small molecules. Here we present a comprehensive classification of zinc finger spatial structures. We find that each available zinc finger structure can be placed into one of eight fold groups that we define based on the structural properties in the vicinity of the zinc-binding site. Three of these fold groups comprise the majority of zinc fingers, namely, C2H2-like finger, treble clef finger and the zinc ribbon. Evolutionary relatedness of proteins within fold groups is not implied, but each group is divided into families of potential homologs. We compare our classification to existing groupings of zinc fingers and find that we define more encompassing fold groups, which bring together proteins whose similarities have previously remained unappreciated. We analyze functional properties of different zinc fingers and overlay them onto our classification. The classification helps in understanding the relationship between the structure, function and evolutionary history of these domains. The results are available as an online database of zinc finger structures.
PMCID: PMC140525  PMID: 12527760
8.  Secreted Kinase Phosphorylates Extracellular Proteins That Regulate Biomineralization 
Science (New York, N.Y.)  2012;336(6085):1150-1153.
Protein phosphorylation is a fundamental mechanism regulating nearly every aspect of cellular life. Several secreted proteins are phosphorylated, but the kinases responsible are unknown. We identified a family of atypical protein kinases that localize within the Golgi apparatus and are secreted. Fam20C appears to be the Golgi casein kinase that phosphorylates secretory pathway proteins within S-x-E motifs. Fam20C phosphorylates the caseins and several secreted proteins implicated in biomineralization, including the small integrin-binding ligand, N-linked glycoproteins (SIBLINGs). Consequently, mutations in Fam20C cause an osteosclerotic bone dysplasia in humans known as Raine syndrome. Fam20C is thus a protein kinase dedicated to the phosphorylation of extracellular proteins.
PMCID: PMC3754843  PMID: 22582013
PMCID: PMC3688454  PMID: 22807026
10.  Membrane Protein Structure Predictions for Exploration 
Cell  2012;149(7):1424-1425.
A daring experiment is performed. Using sequence alignments to predict contacts between residues in protein spatial structures, Hopf et al. (2012) are publishing untested de novo structure models for 11 transmembrane protein families. Will their models stand the test of time and hold up to experimentation? The prospects are excellent.
PMCID: PMC3688449  PMID: 22726429
11.  Discrete - Continuous Duality of Protein Structure Space 
Recently, the nature of protein structure space has been widely discussed in the literature. The traditional discrete view of protein universe as a set of separate folds has been criticized in the light of growing evidence that almost any arrangement of secondary structures is possible and the whole protein space can be traversed through a path of similar structures. Here we argue that the discrete and continuous descriptions are not mutually exclusive, but complementary: the space is largely discrete in evolutionary sense, but continuous geometrically when purely structural similarities are quantified. Evolutionary connections are mainly confined to separate structural prototypes corresponding to folds as islands of structural stability, with few remaining traceable links between the islands. However, for a geometric similarity measure, it is usually possible to find a reasonable cutoff that yields paths connecting any two structures through intermediates.
PMCID: PMC3688466  PMID: 19482467
12.  Structure prediction for CASP8 with all-atom refinement using Rosetta 
Proteins  2009;77(0 9):89-99.
We describe predictions made using the Rosetta structure prediction methodology for the Eighth Critical Assessment of Techniques for Protein Structure Prediction. Aggressive sampling and all-atom refinement were carried out for nearly all targets. A combination of alignment methodologies was used to generate starting models from a range of templates, and the models were then subjected to Rosetta all atom refinement. For 50 targets with readily identified templates, the best submitted model was better than the best alignment to the best template in the Protein Data Bank for 24 domains, and improved over the best starting model for 43 domains. For 13 targets where only very distant sequence relationships to proteins of known structure were detected, models were generated using the Rosetta de novo structure prediction methodology followed by all-atom refinement; in several cases the submitted models were better than those based on the available templates. Of the 12 refinement challenges, the best submitted model improved on the starting model in 7 cases. These improvements over the starting template-based models and refinement tests demonstrate the power of Rosetta structure refinement in improving model accuracy.
PMCID: PMC3688471  PMID: 19701941
13.  The Rho GTPase inactivation domain in Vibrio cholerae MARTX toxin has a circularly permuted papain-like thiol protease fold 
Proteins  2009;77(2):413-419.
A Rho GTPase inactivation domain (RID) has been discovered in the multifunctional, autoprocessing RTX toxin RtxA from Vibrio cholerae. The RID domain causes actin depolymerization and rounding of host cells through inactivation of the small Rho GTPases Rho, Rac and Cdc42. With only a few toxin proteins containing RID domains in the current sequence database, the structure and molecular mechanisms of this domain are unknown. Using comparative sequence and structural analyses, we report homology inference, fold recognition, and active site prediction for RID domains. Remote homologs of RID domains were identified in two other experimentally characterized bacterial virulence factors: IcsB of Shigella flexneri and BopA of Burkholderia pseudomallei, as well as in a group of uncharacterized bacterial membrane proteins. IcsB plays an important role in helping Shigella to evade the host autophagy defense system. RID domain homologs share a conserved diad of cysteine and histidine residues, and are predicted to adopt a circularly permuted papain-like thiol protease fold. RID domains from MARTX toxins and virulence factors IcsB and BopA thus could function as proteases or acyltransferases acting on host molecules. Our results provide structural and mechanistic insights into several important proteins functioning in bacterial pathogenesis.
PMCID: PMC3688474  PMID: 19434753
Rho GTPase inactivation; cysteine protease domain; papain-like fold; multifunctional; autoprocessing RTX toxins; Shigella virulence factor IcsB; structure prediction; homology inference
14.  An E3 Ligase Possessing an Iron-Responsive Hemerythrin Domain Is a Regulator of Iron Homeostasis 
Science (New York, N.Y.)  2009;326(5953):722-726.
Cellular iron homeostasis is maintained by the coordinate posttranscriptional regulation of genes responsible for iron uptake, release, use, and storage through the actions of the iron regulatory proteins IRP1 and IRP2. However, the manner in which iron levels are sensed to affect IRP2 activity is poorly understood. We found that an E3 ubiquitin ligase complex containing the FBXL5 protein targets IRP2 for proteasomal degradation. The stability of FBXL5 itself was regulated, accumulating under iron- and oxygen-replete conditions and degraded upon iron depletion. FBXL5 contains an iron- and oxygen-binding hemerythrin domain that acted as a ligand-dependent regulatory switch mediating FBXL5's differential stability. These observations suggest a mechanistic link between iron sensing via the FBXL5 hemerythrin domain, IRP2 regulation, and cellular responses to maintain mammalian iron homeostasis.
PMCID: PMC3582197  PMID: 19762597
15.  Seq2Ref: a web server to facilitate functional interpretation 
BMC Bioinformatics  2013;14:30.
The size of the protein sequence database has been exponentially increasing due to advances in genome sequencing. However, experimentally characterized proteins only constitute a small portion of the database, such that the majority of sequences have been annotated by computational approaches. Current automatic annotation pipelines inevitably introduce errors, making the annotations unreliable. Instead of such error-prone automatic annotations, functional interpretation should rely on annotations of ‘reference proteins’ that have been experimentally characterized or manually curated.
The Seq2Ref server uses BLAST to detect proteins homologous to a query sequence and identifies the reference proteins among them. Seq2Ref then reports publications with experimental characterizations of the identified reference proteins that might be relevant to the query. Furthermore, a plurality-based rating system is developed to evaluate the homologous relationships and rank the reference proteins by their relevance to the query.
The reference proteins detected by our server will lend insight into proteins of unknown function and provide extensive information to develop in-depth understanding of uncharacterized proteins. Seq2Ref is available at:
PMCID: PMC3573977  PMID: 23356573
Web server; Functional interpretation; Sequence homology; Reference protein; PubMed literature
16.  An automatic method for CASP9 free modeling structure prediction assessment 
Bioinformatics  2011;27(24):3371-3378.
Motivation: Manual inspection has been applied to and is well accepted for assessing critical assessment of protein structure prediction (CASP) free modeling (FM) category predictions over the years. Such manual assessment requires expertise and significant time investment, yet has the problems of being subjective and unable to differentiate models of similar quality. It is beneficial to incorporate the ideas behind manual inspection to an automatic score system, which could provide objective and reproducible assessment of structure models.
Results: Inspired by our experience in CASP9 FM category assessment, we developed an automatic superimposition independent method named Quality Control Score (QCS) for structure prediction assessment. QCS captures both global and local structural features, with emphasis on global topology. We applied this method to all FM targets from CASP9, and overall the results showed the best agreement with Manual Inspection Scores among automatic prediction assessment methods previously applied in CASPs, such as Global Distance Test Total Score (GDT_TS) and Contact Score (CS). As one of the important components to guide our assessment of CASP9 FM category predictions, this method correlates well with other scoring methods and yet is able to reveal good-quality models that are missed by GDT_TS.
Availability: The script for QCS calculation is available at
Supplementary Information: Supplementary data are available at Bioinformatics online.
PMCID: PMC3232368  PMID: 21994223
17.  NESdb: a database of NES-containing CRM1 cargoes 
Molecular Biology of the Cell  2012;23(18):3673-3676.
There are 221 experimentally validated, leucine-rich nuclear export signal (NES)–containing CRM1 cargoes in a database named NESdb. Entries in NESdb are annotated with sequence and structural information on both NES and cargo proteins, as well as with experimental evidence on NES-mapping and CRM1-mediated nuclear export.
The leucine-rich nuclear export signal (NES) is the only known class of targeting signal that directs macromolecules out of the cell nucleus. NESs are short stretches of 8–15 amino acids with regularly spaced hydrophobic residues that bind the export karyopherin CRM1. NES-containing proteins are involved in numerous cellular and disease processes. We compiled a database named NESdb that contains 221 NES-containing CRM1 cargoes that were manually curated from the published literature. Each NESdb entry is annotated with information about sequence and structure of both the NES and the cargo protein, as well as information about experimental evidence of NES-mapping and CRM1-mediated nuclear export. NESdb will be updated regularly and will serve as an important resource for nuclear export signals. NESdb is freely available to nonprofit organizations at
PMCID: PMC3442414  PMID: 22833564
18.  Sequence and structural analyses of nuclear export signals in the NESdb database 
Molecular Biology of the Cell  2012;23(18):3677-3693.
A 221-entry NESdb database produces data sets of true- and false-positive nuclear export signals (NES). Analysis of these data sets leads to identification of a set of sequence and structural properties that distinguishes true NESs from peptides without export capability that merely conform to the NES consensus sequences.
We compiled >200 nuclear export signal (NES)–containing CRM1 cargoes in a database named NESdb. We analyzed the sequences and three-dimensional structures of natural, experimentally identified NESs and of false-positive NESs that were generated from the database in order to identify properties that might distinguish the two groups of sequences. Analyses of amino acid frequencies, sequence logos, and agreement with existing NES consensus sequences revealed strong preferences for the Φ1-X3-Φ2-X2-Φ3-X-Φ4 pattern and for negatively charged amino acids in the nonhydrophobic positions of experimentally identified NESs but not of false positives. Strong preferences against certain hydrophobic amino acids in the hydrophobic positions were also revealed. These findings led to a new and more precise NES consensus. More important, three-dimensional structures are now available for 68 NESs within 56 different cargo proteins. Analyses of these structures showed that experimentally identified NESs are more likely than the false positives to adopt α-helical conformations that transition to loops at their C-termini and more likely to be surface accessible within their protein domains or be present in disordered or unobserved parts of the structures. Such distinguishing features for real NESs might be useful in future NES prediction efforts. Finally, we also tested CRM1-binding of 40 NESs that were found in the 56 structures. We found that 16 of the NES peptides did not bind CRM1, hence illustrating how NESs are easily misidentified.
PMCID: PMC3442415  PMID: 22833565
19.  Succination of Keap1 and activation of Nrf2-dependent antioxidant pathways in FH-deficient papillary renal cell carcinoma type-2 
Cancer cell  2011;20(4):418-420.
Fumarate hydratase (FH) is a tumor suppressor, but how it acts is unclear. Two reports in this issue of Cancer Cell reveal that FH-deficiency leads to succination of Keap1, stabilization of Nrf2, and induction of stress-response genes including HMOX1, which is important for the survival of FH-deficient cells.
PMCID: PMC3226726  PMID: 22014567
20.  CASP9 Assessment of Free Modeling Target Predictions 
Proteins  2011;79(Suppl 10):59-73.
We present an overview of the ninth round of Critical Assessment of Protein Structure Prediction (CASP9) ‘Template free modeling’ category (FM). Prediction models were evaluated using a combination of established structural and sequence comparison measures and a novel automated method designed to mimic manual inspection by capturing both global and local structural features. These scores were compared to those assigned manually over a diverse subset of target domains. Scores were combined to compare overall performance of participating groups and to estimate rank significance. Moreover, we discuss a few examples of free modeling targets to highlight the progress and bottlenecks of current prediction methods. Notably, a server prediction model for a single target (T0581) improved significantly over the closest structure template (44% GDT increase). This accomplishment represents the ‘winner’ of the CASP9 FM category. A number of human expert groups submitted slight variations of this model, highlighting a trend for human experts to act as “meta predictors” by correctly selecting among models produced by the top-performing automated servers. The details of evaluation are available at
PMCID: PMC3226891  PMID: 21997521
protein fold prediction; structure comparison; alignment quality; ab-initio; domain structure; CASP9
21.  CASP9 Target Classification 
Proteins  2011;79(Suppl 10):21-36.
The Critical Assessment of Protein Structure Prediction round 9 (CASP9) aimed to evaluate predictions for 129 experimentally determined protein structures. To assess tertiary structure predictions, these target structures were divided into domain-based evaluation units that were then classified into two assessment categories: template based modeling (TBM) and template free modeling (FM). CASP9 targets were split into domains of structurally compact evolutionary modules. For the targets with more than one defined domain, the decision to split structures into domains for evaluation was based on server performance. Target domains were categorized based on their evolutionary relatedness to existing templates as well as their difficulty levels indicated by server performance. Those target domains with sequence-related templates and high server prediction performance were classified as TMB, while those targets without identifiable templates and low server performance were classified as FM. However, using these generalizations for classification resulted in a blurred boundary between CASP9 assessment categories. Thus, the FM category included those domains without sequence detectable templates (25 target domains) as well as some domains with difficult to detect templates whose predictions were as poor as those without templates (5 target domains). Several interesting examples are discussed, including targets with sequence related templates that exhibit unusual structural differences, targets with homologous or analogous structure templates that are not detectable by sequence, and targets with new folds.
PMCID: PMC3226894  PMID: 21997778
Protein Structure; CASP9; Classification; Fold space; sequence homologs; structure analogs; free modeling; template based modeling; structure prediction
22.  MESSA: MEta-Server for protein Sequence Analysis 
BMC Biology  2012;10:82.
Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together.
We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted.
MESSA is free for non-commercial use at
PMCID: PMC3519821  PMID: 23031578
23.  Unexpected diversity in Shisa-like proteins suggests the importance of their roles as transmembrane adaptors 
Cellular signalling  2011;24(3):758-769.
The Shisa family of single-transmembrane proteins is characterized by an N-terminal cysteine-rich domain and a proline-rich C-terminal region. Its founding member, Xenopus Shisa, promotes head development by antagonizing Wnt and FGF signaling. Recently, a mouse brain-specific Shisa protein CKAMP44 (Shisa9) was shown to play an important role in AMPA receptor desensitization. We used sequence similarity searches against protein, genome and EST databases to study the evolutionary origin and phylogenetic distribution of Shisa homologs. In addition to nine Shisa subfamilies in vertebrates, we detected distantly related Shisa homologs that possess an N-terminal domain with six conserved cysteines. These Shisa-like proteins include FAM159 and KIAA1644 mainly from vertebrates, and members from various bilaterian invertebrates and Porifera, suggesting their presence in the last common ancestor of Metazoa. Shisa-like genes have undergone large expansions in Branchiostoma floridae and Saccoglossus kowalevskii, and appear to have been lost in certain insects. Pattern-based searches against eukaryotic proteomes also uncovered several other families of predicted single-transmembrane proteins with a similar cysteine-rich domain. We refer to these proteins (Shisa/Shisa-like, WBP1/VOPP1, CX, DUF2650, TMEM92, and CYYR1) as STMC6 proteins (single-transmembrane proteins with conserved 6 cysteines). STMC6 genes are widespread in Metazoa, with the human genome containing 17 members. Frequently occurrences of PY motifs in STMC6 proteins suggest that most of them could interact with WW-domain-containing proteins, such as the NEDD4 family E3 ubiquitin ligases, and could play critical roles in protein degradation and sorting. STMC6 proteins are likely transmembrane adaptors that regulate membrane proteins such as cell surface receptors.
PMCID: PMC3295595  PMID: 22120523
Shisa-like proteins; WBP1/VOPP1; CX and DUF2650; TMEM92; CYYR1; transmembrane adaptors
24.  Self consistency grouping: a stringent clustering method 
BMC Bioinformatics  2012;13(Suppl 13):S3.
Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency.
Our method, self consistency grouping, i.e. SCG, yields clusters whose members are closer in rank to each other than to any member outside the cluster. We do not define a distance metric; we use the best known distance metric and presume that it measures the correct distance. SCG does not impose any restriction on the size or the number of the clusters that it finds. The boundaries of clusters are determined by the inconsistencies in the ranks. In addition to the direct implementation that finds the complete structure of the (sub)clusters we implemented two faster versions. The fastest version is guaranteed to find only the clusters that are not subclusters of any other clusters and the other version yields the same output as the direct implementation but does so more efficiently.
Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision.
SCG has potential for finding biological relationships under stringent conditions.
PMCID: PMC3426801  PMID: 23320864
25.  Predictive Sequence Analysis of the Candidatus Liberibacter asiaticus Proteome 
PLoS ONE  2012;7(7):e41071.
Candidatus Liberibacter asiaticus (Ca. L. asiaticus) is a parasitic Gram-negative bacterium that is closely associated with Huanglongbing (HLB), a worldwide citrus disease. Given the difficulty in culturing the bacterium and thus in its experimental characterization, computational analyses of the whole Ca. L. asiaticus proteome can provide much needed insights into the mechanisms of the disease and guide the development of treatment strategies. In this study, we applied state-of-the-art sequence analysis tools to every Ca. L. asiaticus protein. Our results are available as a public website at In particular, we manually curated the results to predict the subcellular localization, spatial structure and function of all Ca. L. asiaticus proteins ( This extensive information should facilitate the study of Ca. L. asiaticus proteome function and its relationship to disease. Pilot studies based on the information from our website have revealed several potential virulence factors, discussed herein.
PMCID: PMC3399792  PMID: 22815919

