Alternative mRNA splicing is the main reason vast mammalian proteomic complexity can be achieved with a limited number of genes. Splicing is physically and functionally coupled to transcription, and is greatly affected by the rate of transcript elongation1,2,3. As the nascent pre-mRNA emerges from transcribing RNA polymerase II (RNAPII), it is assembled into a messenger ribonucleoprotein (mRNP) particle which is its functional form and determines the fate of the mature transcript4. However, factors that connect the transcribing polymerase with the mRNP particle and help integrate transcript elongation with mRNA splicing remain obscure. Here, we characterized the interactome of chromatin-associated mRNP particles. This led to the identification of Deleted in Breast Cancer 1 (DBC1) and a protein we named ZIRD as subunits of a novel protein complex, named DBIRD, which binds directly to RNAPII. DBIRD regulates alternative splicing of a large set of exons embedded in A/T-rich DNA, and is present at the affected exons. RNAi-mediated DBIRD depletion results in region-specific decreases in transcript elongation, particularly across areas encompassing affected exons. Together, these data indicate that DBIRD complex acts at the interface between mRNP particles and RNAPII, integrating transcript elongation with the regulation of alternative splicing.
The discovery of regulatory motifs enriched in sets of DNA or RNA sequences is fundamental to the analysis of a great variety of functional genomics experiments. These motifs usually represent binding sites of proteins or non-coding RNAs, which are best described by position weight matrices (PWMs). We have recently developed XXmotif, a de novo motif discovery method that is able to directly optimize the statistical significance of PWMs. XXmotif can also score conservation and positional clustering of motifs. The XXmotif server provides (i) a list of significantly overrepresented motif PWMs with web logos and E-values; (ii) a graph with color-coded boxes indicating the positions of selected motifs in the input sequences; (iii) a histogram of the overall positional distribution for selected motifs and (iv) a page for each motif with all significant motif occurrences, their P-values for enrichment, conservation and localization, their sequence contexts and coordinates. Free access: http://xxmotif.genzentrum.lmu.de.
The MR (Mre11 nuclease and Rad50 ABC ATPase) complex is an evolutionarily conserved sensor for DNA double-strand breaks, highly genotoxic lesions linked to cancer development. MR can recognize and process DNA ends even if they are blocked and misfolded. To reveal its mechanism, we determined the crystal structure of the catalytic head of Thermotoga maritima MR and analyzed ATP dependent conformational changes. MR adopts an open form with a central Mre11 nuclease dimer and two peripheral Rad50 molecules, a form suited for sensing obstructed breaks. The Mre11 C-terminal helix-loop-helix domain binds Rad50 and attaches flexibly to the nuclease domain, enabling large conformational changes. ATP binding to the two Rad50 subunits induces a rotation of the Mre11 helix-loop-helix and Rad50 coiled-coil domains, creating a clamp conformation with increased DNA binding activity. The results suggest that MR is an ATP controlled transient molecular clamp at DNA double-strand breaks
Rad50; Mre11; DNA double-strand break repair; X-ray crystallography; protein complex; homologous recombination; ABC ATPases
Initiation of RNA polymerase (Pol) II transcription requires assembly of the pre-initiation complex (PIC) at the promoter. In the classical view, PIC assembly starts with binding of the TATA box-binding protein (TBP) to the TATA box. However, a TATA box occurs in only 15% of promoters in the yeast Saccharomyces cerevisiae, posing the question how most yeast promoters nucleate PIC assembly. Here we show that one third of all yeast promoters contain a novel conserved DNA element, the GA element (GAE), that generally does not co-occur with the TATA box. The distance of the GAE to the transcription start site (TSS) resembles the distance of the TATA box to the TSS. The TATA-less TMT1 core promoter contains a GAE, recruits TBP, and supports formation of a TBP-TFIIB-DNA-complex. Mutation of the promoter region surrounding the GAE abolishes transcription in vivo and in vitro. A 32-nucleotide promoter region containing the GAE can functionally substitute for the TATA box in a TATA-containing promoter. This identifies the GAE as a conserved promoter element in TATA-less promoters.
The MOF (males absent on the first)-containing NSL (non-specific lethal) complex binds to a subset of active promoters in Drosophila melanogaster and is thought to contribute to proper gene expression. The determinants that target NSL to specific promoters and the circumstances in which the complex engages in regulating transcription are currently unknown. Here, we show that the NSL complex primarily targets active promoters and in particular housekeeping genes, at which it colocalizes with the chromatin remodeler NURF (nucleosome remodeling factor) and the histone methyltransferase Trithorax. However, only a subset of housekeeping genes associated with NSL are actually activated by it. Our analyses reveal that these NSL-activated promoters are depleted of certain insulator binding proteins and are enriched for the core promoter motif ‘Ohler 5’. Based on these results, it is possible to predict whether the NSL complex is likely to regulate a particular promoter. We conclude that the regulatory capacity of the NSL complex is highly context-dependent. Activation by the NSL complex requires a particular promoter architecture defined by combinations of chromatin regulators and core promoter motifs.
Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega
Multiple sequence alignments are fundamental to many sequence analysis methods. The new program Clustal Omega can align virtually any number of protein sequences quickly and has powerful features for adding sequences to existing precomputed alignments.
Multiple sequence alignments are fundamental to many sequence analysis methods. Most alignments are computed using the progressive alignment heuristic. These methods are starting to become a bottleneck in some analysis pipelines when faced with data sets of the size of many thousands of sequences. Some methods allow computation of larger data sets while sacrificing quality, and others produce high-quality alignments, but scale badly with the number of sequences. In this paper, we describe a new program called Clustal Omega, which can align virtually any number of protein sequences quickly and that delivers accurate alignments. The accuracy of the package on smaller test cases is similar to that of the high-quality aligners. On larger data sets, Clustal Omega outperforms other packages in terms of execution time and quality. Clustal Omega also has powerful features for adding sequences to and exploiting information in existing alignments, making use of the vast amount of precomputed information in public databases like Pfam.
bioinformatics; hidden Markov models; multiple sequence alignment
Several mammalian proteins involved in chromatin and DNA modification contain CXXC zinc finger domains. We compared the structure and function of the CXXC domains in the DNA methyltransferase Dnmt1 and the methylcytosine dioxygenase Tet1. Sequence alignment showed that both CXXC domains have a very similar framework but differ in the central tip region. Based on the known structure of a similar MLL1 domain we developed homology models and designed expression constructs for the isolated CXXC domains of Dnmt1 and Tet1 accordingly. We show that the CXXC domain of Tet1 has no DNA binding activity and is dispensable for catalytic activity in vivo. In contrast, the CXXC domain of Dnmt1 selectively binds DNA substrates containing unmethylated CpG sites. Surprisingly, a Dnmt1 mutant construct lacking the CXXC domain formed covalent complexes with cytosine bases both in vitro and in vivo and rescued DNA methylation patterns in dnmt1−/− embryonic stem cells (ESCs) just as efficiently as wild type Dnmt1. Interestingly, neither wild type nor ΔCXXC Dnmt1 re-methylated imprinted CpG sites of the H19a promoter in dnmt1−/− ESCs, arguing against a role of the CXXC domain in restraining Dnmt1 methyltransferase activity on unmethylated CpG sites.
Pur-α is a nucleic acid-binding protein involved in cell cycle control, transcription, and neuronal function. Initially no prediction of the three-dimensional structure of Pur-α was possible. However, recently we solved the X-ray structure of Pur-α from the fruitfly Drosophila melanogaster and showed that it contains a so-called PUR domain. Here we explain how we exploited bioinformatics tools in combination with X-ray structure determination of a bacterial homolog to obtain diffracting crystals and the high-resolution structure of Drosophila Pur-α. First, we used sensitive methods for remote-homology detection to find three repetitive regions in Pur-α. We realized that our lack of understanding how these repeats interact to form a globular domain was a major problem for crystallization and structure determination. With our information on the repeat motifs we then identified a distant bacterial homolog that contains only one repeat. We determined the bacterial crystal structure and found that two of the repeats interact to form a globular domain. Based on this bacterial structure, we calculated a computational model of the eukaryotic protein. The model allowed us to design a crystallizable fragment and to determine the structure of Drosophila Pur-α. Key for success was the fact that single repeats of the bacterial protein self-assembled into a globular domain, instructing us on the number and boundaries of repeats to be included for crystallization trials with the eukaryotic protein. This study demonstrates that the simpler structural domain arrangement of a distant prokaryotic protein can guide the design of eukaryotic crystallization constructs. Since many eukaryotic proteins contain multiple repeats or repeating domains, this approach might be instructive for structural studies of a range of proteins.
Outer membrane proteins (OMPs) are the transmembrane proteins found in the outer membranes of Gram-negative bacteria, mitochondria and plastids. Most prediction methods have focused on analogous features, such as alternating hydrophobicity patterns. Here, we start from the observation that almost all β-barrel OMPs are related by common ancestry. We identify proteins as OMPs by detecting their homologous relationships to known OMPs using sequence similarity. Given an input sequence, HHomp builds a profile hidden Markov model (HMM) and compares it with an OMP database by pairwise HMM comparison, integrating OMP predictions by PROFtmb. A crucial ingredient is the OMP database, which contains profile HMMs for over 20 000 putative OMP sequences. These were collected with the exhaustive, transitive homology detection method HHsenser, starting from 23 representative OMPs in the PDB database. In a benchmark on TransportDB, HHomp detects 63.5% of the true positives before including the first false positive. This is 70% more than PROFtmb, four times more than BOMP and 10 times more than TMB-Hunt. In Escherichia coli, HHomp identifies 57 out of 59 known OMPs and correctly assigns them to their functional subgroups. HHomp can be accessed at http://toolkit.tuebingen.mpg.de/hhomp.
During the last years, methods for remote homology detection have grown more and more sensitive and reliable. Automatic structure prediction servers relying on these methods can generate useful 3D models even below 20% sequence identity between the protein of interest and the known structure (template). When no homologs can be found in the protein structure database (PDB), the user would need to rerun the same search at regular intervals in order to make timely use of a template once it becomes available.
PDBalert is a web-based automatic system that sends an email alert as soon as a structure with homology to a protein in the user's watch list is released to the PDB database or appears among the sequences on hold. The mail contains links to the search results and to an automatically generated 3D homology model. The sequence search is performed with the same software as used by the very sensitive and reliable remote homology detection server HHpred, which is based on pairwise comparison of Hidden Markov models.
PDBalert will accelerate the information flow from the PDB database to all those who can profit from the newly released protein structures for predicting the 3D structure or function of their proteins of interest.
Motivation: Phospholipid scramblases (PLSCRs) constitute a family of cytoplasmic membrane-associated proteins that were identified based upon their capacity to mediate a Ca2+-dependent bidirectional movement of phospholipids across membrane bilayers, thereby collapsing the normally asymmetric distribution of such lipids in cell membranes. The exact function and mechanism(s) of these proteins nevertheless remains obscure: data from several laboratories now suggest that in addition to their putative role in mediating transbilayer flip/flop of membrane lipids, the PLSCRs may also function to regulate diverse processes including signaling, apoptosis, cell proliferation and transcription. A major impediment to deducing the molecular details underlying the seemingly disparate biology of these proteins is the current absence of any representative molecular structures to provide guidance to the experimental investigation of their function.
Results: Here, we show that the enigmatic PLSCR family of proteins is directly related to another family of cellular proteins with a known structure. The Arabidopsis protein At5g01750 from the DUF567 family was solved by X-ray crystallography and provides the first structural model for this family. This model identifies that the presumed C-terminal transmembrane helix is buried within the core of the PLSCR structure, suggesting that palmitoylation may represent the principal membrane anchorage for these proteins. The fold of the PLSCR family is also shared by Tubby-like proteins. A search of the PDB with the HHpred server suggests a common evolutionary ancestry. Common functional features also suggest that tubby and PLSCR share a functional origin as membrane tethered transcription factors with capacity to modulate phosphoinositide-based signaling.
The outer membrane protein OmpW from E. coli was overexpressed in inclusion bodies and refolded with the help of detergent. The protein has been crystallized and the crystals diffract to 3.5 Å resolution.
OmpW is an eight-stranded 21 kDa molecular-weight β-barrel protein from the outer membrane of Gram-negative bacteria. It is a major antigen in bacterial infections and has implications in antibiotic resistance and in the oxidative degradation of organic compounds. OmpW from Escherichia coli was cloned and the protein was expressed in inclusion bodies. A method for refolding and purification was developed which yields properly folded protein according to circular-dichroism measurements. The protein has been crystallized and crystals were obtained that diffracted to a resolution limit of 3.5 Å. The crystals belong to space group P422, with unit-cell parameters a = 122.5, c = 105.7 Å. A homology model of OmpW is presented based on known structures of eight-stranded β-barrels, intended for use in molecular-replacement trials.
OmpW; membrane proteins; outer membrane; homology modelling
Histones organize the genomic DNA of eukaryotes into chromatin. The four core histone subunits consist of two consecutive helix-strand-helix motifs and are interleaved into heterodimers with a unique fold. We have searched for the evolutionary origin of this fold using sequence and structure comparisons, based on the hypothesis that folded proteins evolved by combination of an ancestral set of peptides, the antecedent domain segments.
Our results suggest that an antecedent domain segment, corresponding to one helix-strand-helix motif, gave rise divergently to the N-terminal substrate recognition domain of Clp/Hsp100 proteins and to the helical part of the extended ATPase domain found in AAA+ proteins. The histone fold arose subsequently from the latter through a 3D domain-swapping event. To our knowledge, this is the first example of a genetically fixed 3D domain swap that led to the emergence of a protein family with novel properties, establishing domain swapping as a mechanism for protein evolution.
The helix-strand-helix motif common to these three folds provides support for our theory of an 'ancient peptide world' by demonstrating how an ancestral fragment can give rise to 3 different folds.
Solenoid repeat proteins of the Tetratrico Peptide Repeat (TPR) family are involved as scaffolds in a broad range of protein-protein interactions. Several resources are available for the prediction of TPRs, however, they often fail to detect divergent repeat units.
We have developed TPRpred, a profile-based method which uses a P-value-dependent score offset to include divergent repeat units and which exploits the tendency of repeats to occur in tandem. TPRpred detects not only TPR-like repeats, but also the related Pentatrico Peptide Repeats (PPRs) and SEL1-like repeats. The corresponding profiles were generated through iterative searches, by varying the threshold parameters for inclusion of repeat units into the profiles, and the best profiles were selected based on their performance on proteins of known structure. We benchmarked the performance of TPRpred in detecting TPR-containing proteins and in delineating the individual repeats therein, against currently available resources.
TPRpred performs significantly better in detecting divergent repeats in TPR-containing proteins, and finds more individual repeats than the existing methods. The web server is available at , and the C++ and Perl sources of TPRpred along with the profiles can be downloaded from .
HHrep is a web server for the de novo identification of repeats in protein sequences, which is based on the pairwise comparison of profile hidden Markov models (HMMs). Its main strength is its sensitivity, allowing it to detect highly divergent repeat units in protein sequences whose repeats could as yet only be detected from their structures. Examples include sequences with β-propellor fold, ferredoxin-like fold, double psi barrels or (βα)8 (TIM) barrels. We illustrate this with proteins from four superfamilies of TIM barrels by revealing a clear 4- and 8-fold symmetry, which we detect solely from their sequences. This symmetry might be the trace of an ancient origin through duplication of a βαβα or βα unit. HHrep can be accessed at .
The MPI Bioinformatics Toolkit is an interactive web service which offers access to a great variety of public and in-house bioinformatics tools. They are grouped into different sections that support sequence searches, multiple alignment, secondary and tertiary structure prediction and classification. Several public tools are offered in customized versions that extend their functionality. For example, PSI-BLAST can be run against regularly updated standard databases, customized user databases or selectable sets of genomes. Another tool, Quick2D, integrates the results of various secondary structure, transmembrane and disorder prediction programs into one view. The Toolkit provides a friendly and intuitive user interface with an online help facility. As a key feature, various tools are interconnected so that the results of one tool can be forwarded to other tools. One could run PSI-BLAST, parse out a multiple alignment of selected hits and send the results to a cluster analysis tool. The Toolkit framework and the tools developed in-house will be packaged and freely available under the GNU Lesser General Public Licence (LGPL). The Toolkit can be accessed at .
HHsenser is the first server to offer exhaustive intermediate profile searches, which it combines with pairwise comparison of hidden Markov models. Starting from a single protein sequence or a multiple alignment, it can iteratively explore whole superfamilies, producing few or no false positives. The output is a multiple alignment of all detected homologs. HHsenser's sensitivity should make it a useful tool for evolutionary studies. It may also aid applications that rely on diverse multiple sequence alignments as input, such as homology-based structure and function prediction, or the determination of functional residues by conservation scoring and functional subtyping.
HHsenser can be accessed at . It has also been integrated into our structure and function prediction server HHpred () to improve predictions for near-singleton sequences.
HHpred is a fast server for remote protein homology detection and structure prediction and is the first to implement pairwise comparison of profile hidden Markov models (HMMs). It allows to search a wide choice of databases, such as the PDB, SCOP, Pfam, SMART, COGs and CDD. It accepts a single query sequence or a multiple alignment as input. Within only a few minutes it returns the search results in a user-friendly format similar to that of PSI-BLAST. Search options include local or global alignment and scoring secondary structure similarity. HHpred can produce pairwise query-template alignments, multiple alignments of the query with a set of templates selected from the search results, as well as 3D structural models that are calculated by the MODELLER software from these alignments. A detailed help facility is available. As a demonstration, we analyze the sequence of SpoVT, a transcriptional regulator from Bacillus subtilis. HHpred can be accessed at .
REPPER (REPeats and their PERiodicities) is an integrated server that detects and analyzes regions with short gapless repeats in protein sequences or alignments. It finds periodicities by Fourier Transform (FTwin) and internal similarity analysis (REPwin). FTwin assigns numerical values to amino acids that reflect certain properties, for instance hydrophobicity, and gives information on corresponding periodicities. REPwin uses self-alignments and displays repeats that reveal significant internal similarities. Both programs use a sliding window to ensure that different periodic regions within the same protein are detected independently. FTwin and REPwin are complemented by secondary structure prediction (PSIPRED) and coiled coil prediction (COILS), making the server a versatile analysis tool for sequences of fibrous proteins. REPPER is available at .