Cellular iron homeostasis is maintained by the coordinate posttranscriptional regulation of genes responsible for iron uptake, release, use, and storage through the actions of the iron regulatory proteins IRP1 and IRP2. However, the manner in which iron levels are sensed to affect IRP2 activity is poorly understood. We found that an E3 ubiquitin ligase complex containing the FBXL5 protein targets IRP2 for proteasomal degradation. The stability of FBXL5 itself was regulated, accumulating under iron- and oxygen-replete conditions and degraded upon iron depletion. FBXL5 contains an iron- and oxygen-binding hemerythrin domain that acted as a ligand-dependent regulatory switch mediating FBXL5's differential stability. These observations suggest a mechanistic link between iron sensing via the FBXL5 hemerythrin domain, IRP2 regulation, and cellular responses to maintain mammalian iron homeostasis.
The size of the protein sequence database has been exponentially increasing due to advances in genome sequencing. However, experimentally characterized proteins only constitute a small portion of the database, such that the majority of sequences have been annotated by computational approaches. Current automatic annotation pipelines inevitably introduce errors, making the annotations unreliable. Instead of such error-prone automatic annotations, functional interpretation should rely on annotations of ‘reference proteins’ that have been experimentally characterized or manually curated.
The Seq2Ref server uses BLAST to detect proteins homologous to a query sequence and identifies the reference proteins among them. Seq2Ref then reports publications with experimental characterizations of the identified reference proteins that might be relevant to the query. Furthermore, a plurality-based rating system is developed to evaluate the homologous relationships and rank the reference proteins by their relevance to the query.
The reference proteins detected by our server will lend insight into proteins of unknown function and provide extensive information to develop in-depth understanding of uncharacterized proteins. Seq2Ref is available at: http://prodata.swmed.edu/seq2ref.
Web server; Functional interpretation; Sequence homology; Reference protein; PubMed literature
Motivation: Manual inspection has been applied to and is well accepted for assessing critical assessment of protein structure prediction (CASP) free modeling (FM) category predictions over the years. Such manual assessment requires expertise and significant time investment, yet has the problems of being subjective and unable to differentiate models of similar quality. It is beneficial to incorporate the ideas behind manual inspection to an automatic score system, which could provide objective and reproducible assessment of structure models.
Results: Inspired by our experience in CASP9 FM category assessment, we developed an automatic superimposition independent method named Quality Control Score (QCS) for structure prediction assessment. QCS captures both global and local structural features, with emphasis on global topology. We applied this method to all FM targets from CASP9, and overall the results showed the best agreement with Manual Inspection Scores among automatic prediction assessment methods previously applied in CASPs, such as Global Distance Test Total Score (GDT_TS) and Contact Score (CS). As one of the important components to guide our assessment of CASP9 FM category predictions, this method correlates well with other scoring methods and yet is able to reveal good-quality models that are missed by GDT_TS.
Availability: The script for QCS calculation is available at http://prodata.swmed.edu/QCS/.
Supplementary Information: Supplementary data are available at Bioinformatics online.
There are 221 experimentally validated, leucine-rich nuclear export signal (NES)–containing CRM1 cargoes in a database named NESdb. Entries in NESdb are annotated with sequence and structural information on both NES and cargo proteins, as well as with experimental evidence on NES-mapping and CRM1-mediated nuclear export.
The leucine-rich nuclear export signal (NES) is the only known class of targeting signal that directs macromolecules out of the cell nucleus. NESs are short stretches of 8–15 amino acids with regularly spaced hydrophobic residues that bind the export karyopherin CRM1. NES-containing proteins are involved in numerous cellular and disease processes. We compiled a database named NESdb that contains 221 NES-containing CRM1 cargoes that were manually curated from the published literature. Each NESdb entry is annotated with information about sequence and structure of both the NES and the cargo protein, as well as information about experimental evidence of NES-mapping and CRM1-mediated nuclear export. NESdb will be updated regularly and will serve as an important resource for nuclear export signals. NESdb is freely available to nonprofit organizations at http://prodata.swmed.edu/LRNes.
A 221-entry NESdb database produces data sets of true- and false-positive nuclear export signals (NES). Analysis of these data sets leads to identification of a set of sequence and structural properties that distinguishes true NESs from peptides without export capability that merely conform to the NES consensus sequences.
We compiled >200 nuclear export signal (NES)–containing CRM1 cargoes in a database named NESdb. We analyzed the sequences and three-dimensional structures of natural, experimentally identified NESs and of false-positive NESs that were generated from the database in order to identify properties that might distinguish the two groups of sequences. Analyses of amino acid frequencies, sequence logos, and agreement with existing NES consensus sequences revealed strong preferences for the Φ1-X3-Φ2-X2-Φ3-X-Φ4 pattern and for negatively charged amino acids in the nonhydrophobic positions of experimentally identified NESs but not of false positives. Strong preferences against certain hydrophobic amino acids in the hydrophobic positions were also revealed. These findings led to a new and more precise NES consensus. More important, three-dimensional structures are now available for 68 NESs within 56 different cargo proteins. Analyses of these structures showed that experimentally identified NESs are more likely than the false positives to adopt α-helical conformations that transition to loops at their C-termini and more likely to be surface accessible within their protein domains or be present in disordered or unobserved parts of the structures. Such distinguishing features for real NESs might be useful in future NES prediction efforts. Finally, we also tested CRM1-binding of 40 NESs that were found in the 56 structures. We found that 16 of the NES peptides did not bind CRM1, hence illustrating how NESs are easily misidentified.
Fumarate hydratase (FH) is a tumor suppressor, but how it acts is unclear. Two reports in this issue of Cancer Cell reveal that FH-deficiency leads to succination of Keap1, stabilization of Nrf2, and induction of stress-response genes including HMOX1, which is important for the survival of FH-deficient cells.
We present an overview of the ninth round of Critical Assessment of Protein Structure Prediction (CASP9) ‘Template free modeling’ category (FM). Prediction models were evaluated using a combination of established structural and sequence comparison measures and a novel automated method designed to mimic manual inspection by capturing both global and local structural features. These scores were compared to those assigned manually over a diverse subset of target domains. Scores were combined to compare overall performance of participating groups and to estimate rank significance. Moreover, we discuss a few examples of free modeling targets to highlight the progress and bottlenecks of current prediction methods. Notably, a server prediction model for a single target (T0581) improved significantly over the closest structure template (44% GDT increase). This accomplishment represents the ‘winner’ of the CASP9 FM category. A number of human expert groups submitted slight variations of this model, highlighting a trend for human experts to act as “meta predictors” by correctly selecting among models produced by the top-performing automated servers. The details of evaluation are available at http://prodata.swmed.edu/CASP9/
protein fold prediction; structure comparison; alignment quality; ab-initio; domain structure; CASP9
The Critical Assessment of Protein Structure Prediction round 9 (CASP9) aimed to evaluate predictions for 129 experimentally determined protein structures. To assess tertiary structure predictions, these target structures were divided into domain-based evaluation units that were then classified into two assessment categories: template based modeling (TBM) and template free modeling (FM). CASP9 targets were split into domains of structurally compact evolutionary modules. For the targets with more than one defined domain, the decision to split structures into domains for evaluation was based on server performance. Target domains were categorized based on their evolutionary relatedness to existing templates as well as their difficulty levels indicated by server performance. Those target domains with sequence-related templates and high server prediction performance were classified as TMB, while those targets without identifiable templates and low server performance were classified as FM. However, using these generalizations for classification resulted in a blurred boundary between CASP9 assessment categories. Thus, the FM category included those domains without sequence detectable templates (25 target domains) as well as some domains with difficult to detect templates whose predictions were as poor as those without templates (5 target domains). Several interesting examples are discussed, including targets with sequence related templates that exhibit unusual structural differences, targets with homologous or analogous structure templates that are not detectable by sequence, and targets with new folds.
Protein Structure; CASP9; Classification; Fold space; sequence homologs; structure analogs; free modeling; template based modeling; structure prediction
Zinc fingers are small protein domains in which zinc plays a structural role contributing to the stability of the domain. Zinc fingers are structurally diverse and are present among proteins that perform a broad range of functions in various cellular processes, such as replication and repair, transcription and translation, metabolism and signaling, cell proliferation and apoptosis. Zinc fingers typically function as interaction modules and bind to a wide variety of compounds, such as nucleic acids, proteins and small molecules. Here we present a comprehensive classification of zinc finger spatial structures. We find that each available zinc finger structure can be placed into one of eight fold groups that we define based on the structural properties in the vicinity of the zinc-binding site. Three of these fold groups comprise the majority of zinc fingers, namely, C2H2-like finger, treble clef finger and the zinc ribbon. Evolutionary relatedness of proteins within fold groups is not implied, but each group is divided into families of potential homologs. We compare our classification to existing groupings of zinc fingers and find that we define more encompassing fold groups, which bring together proteins whose similarities have previously remained unappreciated. We analyze functional properties of different zinc fingers and overlay them onto our classification. The classification helps in understanding the relationship between the structure, function and evolutionary history of these domains. The results are available as an online database of zinc finger structures.
Computational sequence analysis, that is, prediction of local sequence properties, homologs, spatial structure and function from the sequence of a protein, offers an efficient way to obtain needed information about proteins under study. Since reliable prediction is usually based on the consensus of many computer programs, meta-severs have been developed to fit such needs. Most meta-servers focus on one aspect of sequence analysis, while others incorporate more information, such as PredictProtein for local sequence feature predictions, SMART for domain architecture and sequence motif annotation, and GeneSilico for secondary and spatial structure prediction. However, as predictions of local sequence properties, three-dimensional structure and function are usually intertwined, it is beneficial to address them together.
We developed a MEta-Server for protein Sequence Analysis (MESSA) to facilitate comprehensive protein sequence analysis and gather structural and functional predictions for a protein of interest. For an input sequence, the server exploits a number of select tools to predict local sequence properties, such as secondary structure, structurally disordered regions, coiled coils, signal peptides and transmembrane helices; detect homologous proteins and assign the query to a protein family; identify three-dimensional structure templates and generate structure models; and provide predictive statements about the protein's function, including functional annotations, Gene Ontology terms, enzyme classification and possible functionally associated proteins. We tested MESSA on the proteome of Candidatus Liberibacter asiaticus. Manual curation shows that three-dimensional structure models generated by MESSA covered around 75% of all the residues in this proteome and the function of 80% of all proteins could be predicted.
MESSA is free for non-commercial use at http://prodata.swmed.edu/MESSA/
The Shisa family of single-transmembrane proteins is characterized by an N-terminal cysteine-rich domain and a proline-rich C-terminal region. Its founding member, Xenopus Shisa, promotes head development by antagonizing Wnt and FGF signaling. Recently, a mouse brain-specific Shisa protein CKAMP44 (Shisa9) was shown to play an important role in AMPA receptor desensitization. We used sequence similarity searches against protein, genome and EST databases to study the evolutionary origin and phylogenetic distribution of Shisa homologs. In addition to nine Shisa subfamilies in vertebrates, we detected distantly related Shisa homologs that possess an N-terminal domain with six conserved cysteines. These Shisa-like proteins include FAM159 and KIAA1644 mainly from vertebrates, and members from various bilaterian invertebrates and Porifera, suggesting their presence in the last common ancestor of Metazoa. Shisa-like genes have undergone large expansions in Branchiostoma floridae and Saccoglossus kowalevskii, and appear to have been lost in certain insects. Pattern-based searches against eukaryotic proteomes also uncovered several other families of predicted single-transmembrane proteins with a similar cysteine-rich domain. We refer to these proteins (Shisa/Shisa-like, WBP1/VOPP1, CX, DUF2650, TMEM92, and CYYR1) as STMC6 proteins (single-transmembrane proteins with conserved 6 cysteines). STMC6 genes are widespread in Metazoa, with the human genome containing 17 members. Frequently occurrences of PY motifs in STMC6 proteins suggest that most of them could interact with WW-domain-containing proteins, such as the NEDD4 family E3 ubiquitin ligases, and could play critical roles in protein degradation and sorting. STMC6 proteins are likely transmembrane adaptors that regulate membrane proteins such as cell surface receptors.
Shisa-like proteins; WBP1/VOPP1; CX and DUF2650; TMEM92; CYYR1; transmembrane adaptors
Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency.
Our method, self consistency grouping, i.e. SCG, yields clusters whose members are closer in rank to each other than to any member outside the cluster. We do not define a distance metric; we use the best known distance metric and presume that it measures the correct distance. SCG does not impose any restriction on the size or the number of the clusters that it finds. The boundaries of clusters are determined by the inconsistencies in the ranks. In addition to the direct implementation that finds the complete structure of the (sub)clusters we implemented two faster versions. The fastest version is guaranteed to find only the clusters that are not subclusters of any other clusters and the other version yields the same output as the direct implementation but does so more efficiently.
Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision.
SCG has potential for finding biological relationships under stringent conditions.
Candidatus Liberibacter asiaticus (Ca. L. asiaticus) is a parasitic Gram-negative bacterium that is closely associated with Huanglongbing (HLB), a worldwide citrus disease. Given the difficulty in culturing the bacterium and thus in its experimental characterization, computational analyses of the whole Ca. L. asiaticus proteome can provide much needed insights into the mechanisms of the disease and guide the development of treatment strategies. In this study, we applied state-of-the-art sequence analysis tools to every Ca. L. asiaticus protein. Our results are available as a public website at http://prodata.swmed.edu/liberibacter_asiaticus/. In particular, we manually curated the results to predict the subcellular localization, spatial structure and function of all Ca. L. asiaticus proteins (http://prodata.swmed.edu/liberibacter_asiaticus/curated/). This extensive information should facilitate the study of Ca. L. asiaticus proteome function and its relationship to disease. Pilot studies based on the information from our website have revealed several potential virulence factors, discussed herein.
Evolutionary theory suggests that the force of natural selection decreases with age. To explore the extent to which this prediction directly affects protein structure and function, we used multiple regression to find longevity-selected positions, defined as the columns of a sequence alignment conserved in long-lived but not short-lived mammal species. We analyzed 7,590 orthologous protein families in 33 mammalian species, accounting for body mass, phylogeny, and species-specific mutation rate. Overall, we found that the number of longevity-selected positions in the mammalian proteome is much higher than would be expected by chance. Further, these positions are enriched in domains of several proteins that interact with one another in inflammation and other aging-related processes, as well as in organismal development. We present as an example the kinase domain of anti-Müllerian hormone type-2 receptor (AMHR2). AMHR2 inhibits ovarian follicle recruitment and growth, and a homology model of the kinase domain shows that its longevity-selected positions cluster near a SNP associated with delayed human menopause. Distinct from its canonical role in development, this region of AMHR2 may function to regulate the protein’s activity in a lifespan-specific manner.
Intramembrane proteases are responsible for a number of regulated proteolysis events occurring within or near the plasma and intracellular membranes. Members of one large and diverse family of putative intramembrane metalloproteases are widely distributed in all domains of life, including the type II CAAX prenyl proteases and their prokaryotic homologs with putative bacteriocin-related functions. We used sensitive sequence similarity searches to expand this large CPBP (CAAX Proteases and Bacteriocin-Processing enzymes) family to include more than 5,800 members, and infer its homologous relationships to several other protein families, including the PrsW proteases, the DUF2324 family and the γ-secretase subunit APH-1 proteins. They share four predicted core transmembrane segments and possess similar, yet distinct sets of sequence motifs. Remote similarity between APH-1 and membrane proteases sheds light on APH-1’s evolutionary origin and raises the possibility that APH-1 may possess proteolytic activity in the current or ancestral form of γ-secretase.
type II CAAX protease; APH-1; γ-secretase; PrsW; DUF2324; intramembrane protease
Most core components of the neurotransmitter release machinery have homologues in other types of intracellular membrane traffic, likely underlying a universal mechanism of intracellular membrane fusion. However, no clear similarity between Munc13s and protein families generally involved in membrane traffic has been reported, despite the essential nature of Munc13s for neurotransmitter release. This crucial function was ascribed to a minimal Munc13 region called the MUN domain, which likely participates in SNARE complex assembly and is also found in CAPS. We have now used comparative sequence and structural analyses to study the structure and evolutionary origin of the MUN domain. We found weak, yet significant sequence similarities between the MUN domain and a set of protein subunits from several related vesicle tethering complexes, such as Sec6 from the exocyst complex and Vps53 from the GARP complex. Such an evolutionary relationship allows structure prediction of the MUN domain and suggests functional similarities between MUN domain-containing proteins and multisubunit tethering complexes such as exocyst, COG, GARP and Dsl1p. These findings further unify the mechanism of neurotransmitter release with those of other types of intracellular membrane traffic, and in turn support a role for tethering complexes in SNARE complex assembly.
Munc13; CAPS; MUN domain; multisubunit tethering complexes exocyst, COG, GARP and Dsl1p complex; homology inference and structure prediction
A number of membrane-spanning proteins possess enzymatic activity and catalyze important reactions involving proteins, lipids or other substrates located within or near lipid bilayers. Alkaline ceramidases are seven-transmembrane proteins that hydrolyze the amide bond in ceramide to form sphingosine. Recently, a group of putative transmembrane receptors called progestin and adipoQ receptors (PAQRs) were found to be distantly related to alkaline ceramidases, raising the possibility that they may also function as membrane enzymes.
Using sensitive similarity search methods, we identified statistically significant sequence similarities among several transmembrane protein families including alkaline ceramidases and PAQRs. They were unified into a large and diverse superfamily of putative membrane-bound hydrolases called CREST (alkaline ceramidase, PAQR receptor, Per1, SID-1 and TMEM8). The CREST superfamily embraces a plethora of cellular functions and biochemical activities, including putative lipid-modifying enzymes such as ceramidases and the Per1 family of putative phospholipases involved in lipid remodeling of GPI-anchored proteins, putative hormone receptors, bacterial hemolysins, the TMEM8 family of putative tumor suppressors, and the SID-1 family of putative double-stranded RNA transporters involved in RNA interference. Extensive similarity searches and clustering analysis also revealed several groups of proteins with unknown function in the CREST superfamily. Members of the CREST superfamily share seven predicted core transmembrane segments with several conserved sequence motifs.
Universal conservation of a set of histidine and aspartate residues across all groups in the CREST superfamily, coupled with independent discoveries of hydrolase activities in alkaline ceramidases and the Per1 family as well as results from previous mutational studies of Per1, suggests that the majority of CREST members are metal-dependent hydrolases.
This article was reviewed by Kira S. Markarova, Igor B. Zhulin and Rob Knight.
RfaH, a paralog of the general transcription factor NusG, is recruited to elongating RNA polymerase at specific regulatory sites. The X-ray structure of Escherichia coli RfaH reported here reveals two domains. The N-terminal domain displays high similarity to that of NusG. In contrast, the α-helical coiled-coil C domain, while retaining sequence similarity, is strikingly different from the β barrel of NusG. To our knowledge, such an all-β to all-α transition of the entire domain is the most extreme example of protein fold evolution known to date. Both N domains possess a vast hydrophobic cavity that is buried by the C domain in RfaH but is exposed in NusG. We propose that this cavity constitutes the RNA polymerase-binding site, which becomes unmasked in RfaH only upon sequence-specific binding to the nontemplate DNA strand that triggers domain dissociation. Finally, we argue that RfaH binds to the β′ subunit coiled coil, the major target site for the initiation σ factors.
REDD1 is a conserved stress-response protein that regulates mTORC1, a critical regulator of cell growth and proliferation that is implicated in cancer. REDD1 is induced by hypoxia and REDD1 overexpression is sufficient to inhibit mTORC1. mTORC1 is regulated by the small GTPase Rheb, which in turn is regulated by the GTPase-activating protein complex, TSC1/TSC2. REDD1 induced-mTORC1 inhibition requires the TSC1/TSC2 complex, and REDD1 has been proposed to act by directly binding to and sequestering 14-3-3 proteins away from TSC2 leading to TSC2-depedent inhibition of mTORC1. Structure/function analyses have led us to identify two segments in REDD1 that are essential for function, which act in an interdependent manner. We have determined a crystal structure of REDD1 at 2.0 Å resolution, which shows that these two segments fold together to form an intact domain with a novel fold. This domain is characterized by an α/β sandwich consisting of two antiparallel α-helices and a mixed β-sheet encompassing an uncommon psi-loop motif. Structure-based docking and functional analyses suggest that REDD1 does not directly bind to 14-3-3 proteins. Sequence conservation mapping to the surface of the structure and mutagenesis studies demarcated a hotspot likely to interact with effector proteins that is essential for REDD1-mediated mTORC1 inhibition.
REDD1; DDIT4; 14-3-3; Hypoxia; mTOR; TSC2
Flavin mononucleotide adenylyltransferase (FMNAT) catalyzes the formation of the essential flavocoenzyme FAD and plays an important role in flavocoenzyme homeostasis regulation. By sequence comparison, bacterial and eukaryotic FMNAT enzymes belong to two different protein superfamilies and apparently utilize different set of active site residues to accomplish the same chemistry. Here we report the first structural characterization of a eukaryotic FMNAT from a pathogenic yeast Candida glabrata (CgFMNAT). Four crystal structures of CgFMNAT in different complexed forms were determined at 1.20–1.95 Å resolutions, capturing the enzyme active site states prior to and after catalysis. These structures reveal a novel flavin-binding mode and a unique enzyme-bound FAD conformation. Comparison of the bacterial and eukaryotic FMNATs provides a structural basis for understanding the convergent evolution of the same FMNAT activity from different protein ancestors. Structure-based investigation of the kinetic properties of FMNAT should offer insights into the regulatory mechanisms of FAD homeostasis by FMNAT in eukaryotic organisms.
flavocoenzymes; FAD biosynthesis; adenylyltransferase; Rossmann-like fold; convergent evolution
Summary: Profile-based similarity search is an essential step in structure-function studies of proteins. However, inclusion of non-homologous sequence segments into a profile causes its corruption and results in false positives. Profile corruption is common in multidomain proteins, and single domains with long insertions are a significant source of errors. We developed a procedure (HangOut) that, for a single domain with specified insertion position, cleans erroneously extended PSI-BLAST alignments to generate better profiles.
Availability: HangOut is implemented in Python 2.3 and runs on all Unix-compatible platforms. The source code is available under the GNU GPL license at http://prodata.swmed.edu/HangOut/
Contact: email@example.com; firstname.lastname@example.org
Supplementary information: Supplementary data are available at Bioinformatics online.
The catalytic engine of RNAi is the RNA-induced silencing complex (RISC), wherein the endoribonuclease Argonaute and single-stranded siRNA direct target mRNA cleavage. Here we have reconstituted long dsRNA- and duplex siRNA-initiated RISC activities using recombinant Drosophila Dicer-2, R2D2 and Ago2 proteins. We employ this core reconstitution system to purify an RNAi regulator-component 3 promoter of RISC (C3PO), a complex of Translin and Trax. C3PO is a Mg2+-dependent endoribonuclease that promotes RISC activation by removing siRNA passenger strand cleavage products. These studies establish an in vitro RNAi reconstitution system and identify C3PO as a key activator of the core RNAi machinery.
RNAi; RISC; Dcr-2/R2D2; Ago2; C3PO; endoribonuclease
Obesity and insulin resistance are associated with deposition of triglycerides in tissues other than adipose tissue. Previously, we showed that a missense mutation (I148M) in PNPLA3 (patatin-like phospholipase domain-containing 3 protein) is associated with increased hepatic triglyceride content in humans. Here we examined the effect of the I148M substitution on the enzymatic activity and cellular location of PNPLA3. Structural modeling predicted that the substitution of methionine for isoleucine at residue 148 would restrict access of substrate to the catalytic serine at residue 47. In vitro assays using recombinant PNPLA3 partially purified from Sf9 cells confirmed that the wild type enzyme hydrolyzes emulsified triglyceride and that the I148M substitution abolishes this activity. Expression of PNPLA3-I148M, but not wild type PNPLA3, in cultured hepatocytes or in the livers of mice increased cellular triglyceride content. Cell fractionation studies revealed that ∼90% of wild type PNPLA3 partitioned between membranes and lipid droplets; substitution of isoleucine for methionine at position 148 did not alter the subcellular distribution of the protein. These data are consistent with PNPLA3-I148M promoting triglyceride accumulation by limiting triglyceride hydrolysis.
Cell/Hepatocyte; Lipid; Lipid/Lipase; Lipid/Triacylglycerol; Membrane/Lipids; Metabolism/Fatty Acid; Metabolism/Lipid; Metabolism/Lipogenesis; Lipase; Lipolysis
Profile-based comparison of multiple sequence alignments is a powerful methodology for the detection remote protein sequence similarity, which is essential for the inference and analysis of protein structure, function, and evolution. Accurate estimation of statistical significance of detected profile similarities is essential for further development of this methodology. Here we analyze a novel approach to estimate the statistical significance of profile similarity: the explicit consideration of background score distributions for each database template (subject).
Using a simple scheme to combine and analytically approximate query- and subject-based distributions, we show that (i) inclusion of background distributions for the subjects increases the quality of homology detection; (ii) this increase is higher when the distributions are based on the scores to all known non-homologs of the subject rather than a small calibration subset of the database representatives; and (iii) these all known non-homolog distributions of scores for the subject make the dominant contribution to the improved performance: adding the calibration distribution of the query has a negligible additional effect.
The construction of distributions based on the complete sets of non-homologs for each subject is particularly relevant in the setting of structure prediction where the database consists of proteins with solved 3D structure (PDB, SCOP, CATH, etc.) and therefore structural relationships between proteins are known. These results point to a potential new direction in the development of more powerful methods for remote homology detection.
The crystal structure of the NGO1945 gene product from N. gonorrhoeae (UniProt Q5F5IO) reveals that the N-terminal domain assigned as a domain of unknown function (DUF2063) is likely to bind DNA and that the protein may be involved in transcriptional regulation.
Proteins with the DUF2063 domain constitute a new Pfam family, PF09836. The crystal structure of a member of this family, NGO1945 from Neisseria gonorrhoeae, has been determined and reveals that the N-terminal DUF2063 domain is likely to be a DNA-binding domain. In conjunction with the rest of the protein, NGO1945 is likely to be involved in transcriptional regulation, which is consistent with genomic neighborhood analysis. Of the 216 currently known proteins that contain a DUF2063 domain, the most significant sequence homologs of NGO1945 (∼40–99% sequence identity) are from various Neisseria and Haemophilus species. As these are important human pathogens, NGO1945 represents an interesting candidate for further exploration via biochemical studies and possible therapeutic intervention.
NGO1945; PF09836; DUF2063; putative DNA-binding proteins; putative transcription regulators; structural genomics