|Home | About | Journals | Submit | Contact Us | Français|
Summary: Co-crystallization experiments of proteins with nucleic acids do not guarantee that both components are present in the crystal. We have previously developed DIBER to predict crystal content when protein and DNA are present in the crystallization mix. Here, we present RIBER, which should be used when protein and RNA are in the crystallization drop. The combined RIBER/DIBER suite builds on machine learning techniques to make reliable, quantitative predictions of crystal content for non-expert users and high-throughput crystallography.
Availability: The program source code, Linux binaries and a web server are available at http://diber.iimcb.gov.pl/ RIBER/DIBER requires diffraction data to at least 3.0 Å resolution in MTZ or CIF (web server only) format. The RIBER/DIBER code is subject to the GNU Public License.
Supplementary information: Supplementary data are available at Bioinformatics online.
Protein crystallographers who work on protein–nucleic acid complexes routinely face the problem that crystals that grow in a co-crystallization experiment do not necessarily contain all components present in the solution. The crystal content can be clarified by spectroscopic methods, but the equipment for such measurements is not commonly available. Alternatively, crystals can be washed, dissolved and analyzed by gel electrophoresis with appropriate staining, but this method is labor intensive, destructive and does not always provide a clear-cut answer.
In this work, we present the new program RIBER for detecting the presence of RNA stems in macromolecular crystals based on diffraction data alone. RIBER complements the previously developed program DIBER (Chojnowski and Bochtler, 2010) intended to search for double-stranded B-DNA and not double-stranded A-RNA (Table 1). The two programs are implemented as a stand-alone software suite and a web server RIBER/DIBER providing an easy way to judge nucleic acid content of a crystal based on a diffraction dataset, before the crystal structure is solved. The method may help to avoid a laborious phasing procedure when the component or the complex of interest is not present in the crystal.
Crystal structures solved at 3.0 Å or better were downloaded from the Protein Data Bank (Bernstein, et al., 1977) together with the corresponding experimental diffraction data. All reported calculations are based on experimental diffraction data. Structural information was only used to select and classify datasets according to their macromolecular content. Detailed information about curating the datasets used for training the RIBER classifier and DIBER benchmarks are available in Supplementary Material.
The RIBER classification performance was estimated using a repeated subsampling validation procedure. The classifier was trained with equal numbers of randomly selected diffraction datasets from each class (50% of instances of least numerous set of RNA only crystals). The remaining structures were used for testing. The average classification performance from 100 training and testing cycles was used as an estimate of the true classification performance.
Both RIBER and DIBER extract two parameters from the dataset: the first is a measure of a unit cell size and is primarily used to distinguish nucleic acid crystals from all others. The second parameter is a measure for the largest local average of reflection intensities. A large value for this parameter indicates the presence of very characteristic diffraction signals related to the regular stacking of A-RNA or B-DNA base pairs. A support vector machine (SVM) is used to make a prediction, using either only the two parameters described above or optionally also a third score (combined mode, available for DIBER only), which is calculated with the help of the molecular replacement program PHASER (McCoy et al., 2007) for those users who hold a license (free for academic users). DIBER and RIBER use similar parameterizations of the diffraction data and a SVM to classify crystal content.
The DIBER program has been benchmarked with the structures which have appeared in the PDB since the stand-alone version of the program was developed in 2009 (as described in the Supplementary Material). Within the error limits, the benchmark results obtained for protein and DNA-only agree with the previously published performance estimates [(Chojnowski and Bochtler, 2010), Table 1 and Supplementary Table S2]. Surprisingly, the currently observed correct classification rate for protein–DNA complexes is higher than previously reported. The discrepancy is due to a change in the composition of the test set. The new sample contains more protein DNA complexes with long helices, which produce strong diffraction signals that are easy to detect (75 versus 23% of molecules with more than 10 Watson–Crick base pairs).
Unlike DNA, naturally occurring RNA molecules rarely display long, regular double-stranded helices (Saenger, 1984). More often, they form large, complex structures with short double-stranded stems connected by single-stranded loops. However, RNA and DNA crystals share common features. First, similarly to DNA-only crystals, crystals that contain only RNA tend to have smaller unit cells than crystals with both RNA and protein (Supplementary Fig. S1a). Second, the base pairs forming RNA stems are often regularly stacked and produce characteristic diffraction signals analogous to the ones observed for double-stranded B from DNA helices (Supplementary Fig. S1b).
Therefore, the RIBER classifier is based on the parameterization used originally in DIBER to judge DNA content of a crystal. However, the program and SVM parameters were optimized with respect to the classification performance between RNA, protein–RNA and protein-only crystals. The benchmarks of a resulting RIBER classifier are presented in Table 1.
The RIBER performance has also been tested on a set of structures containing single-stranded, but not double-stranded RNA (both alone and in complex with proteins) that were originally rejected from the training set. As could be expected from the paucity of regularly stacked bases, most of these were misclassified as pure proteins (Supplementary Table S1).
In this article, we confirm earlier estimates of a very high performance of DIBER in discriminating between crystals formed by protein alone, double-stranded B-DNA alone and protein–B-DNA complex (Chojnowski and Bochtler, 2010). We also show that DIBER performs poorly for RNA, i.e. it fails to confidently discriminate protein alone versus protein–RNA complex versus RNA alone (it performs well for protein and protein–RNA complexes at the expense of RNA-only crystals, Table 1). RIBER, however, performs for double-stranded RNA much better than DIBER. Hence, RIBER complements DIBER for analyses of crystal content in crystallization trials of protein–nucleic acid complexes. The overall performance of RIBER is noticeably weaker than DIBER which is not surprising. RNA and proteins are alike in terms of structural complexity. This makes their crystals difficult to distinguish based on limited information provided by the diffraction data. Nonetheless, the discriminative power of RIBER is significant and we believe it will be a useful tool, in particular in situations, where the selection of most promising crystals for the diffraction and structure solution is a crucial factor to maximize the success in structure determination.
Funding: Foundation for Polish Science (TEAM/2009-4/2 to J.M.B., START fellowship to G.Ch.); European Commission (Health-Prot, contract number 229676); National Science Center grants (N N302 654640 and N N301 425038 to M.B.).
Conflict of interest: none declared.