Magnetic resonance imaging (MRI) offers a non-radioactive alternative for the non-invasive detection of tumours. Low molecular weight MRI contrast agents currently in clinical use suffer either from a lack of specificity for tumour tissue or from low relaxivity and thus low contrast amplification. In this study, we present the newly designed two domain fusion protein Zarvin, which is able to bind to therapeutic IgG antibodies suitable for targeting, while facilitating contrast enhancement through high affinity binding sites for Gd3+. We show that the Zarvin fold is stable under serum conditions, specifically targets a cancer cell-line when bound to the Cetuximab IgG, and allows for imaging with high relaxivity, a property that would be advantageous for the detection of small tumours and metastases at 1.5 or 3 T.
The epidemiology of HIV-1 in China has unique features that may have led to unique viral strains. We therefore tested the hypothesis that it is possible to find distinctive patterns in HIV-1 genomes sampled in China. Using a rule inference algorithm we could indeed extract from sequences of the third variable loop (V3) of HIV-1 gp120 a set of 14 signature patterns that with 89% accuracy distinguished Chinese from non-Chinese sequences. These patterns were found to be specific to HIV-1 subtype, i.e. sequences complying with pattern 1 were of subtype B, pattern 2 almost exclusively covered sequences of subtype 01_AE, etc. We then analyzed the first of these signature patterns in depth, namely that L and W at two V3 positions are specifically occurring in Chinese sequences of subtype B/B' (3% false positives). This pattern was found to be in agreement with the phylogeny of HIV-1 of subtype B inside and outside of China. We could neither reject nor convincingly confirm that the pattern is stabilized by immune escape. For further interpretation of the signature pattern we used the recently developed measure of Direct Information, and in this way discovered evidence for physical interactions between V2 and V3. We conclude by a discussion of limitations of signature patterns, and the applicability of the approach to other genomic regions and other countries.
Background: Humans have two enzyme isoforms to produce the universal sulfate donor 3′-phosphoadenosine 5′-phosphosulfate (PAPS).
Results: The main difference between the two PAPS synthases is their stability, which is modulated by nucleotides.
Conclusion: Protein stability is a major contributing factor for PAPS availability.
Significance: Naturally occurring changes in APS concentrations may be sensed by the labile PAPS synthase 2 that might act as a novel biosensor.
Activated sulfate in the form of 3′-phosphoadenosine 5′-phosphosulfate (PAPS) is needed for all sulfation reactions in eukaryotes with implications for the build-up of extracellular matrices, retroviral infection, protein modification, and steroid metabolism. In metazoans, PAPS is produced by bifunctional PAPS synthases (PAPSS). A major question in the field is why two human protein isoforms, PAPSS1 and -S2, are required that cannot complement for each other. We provide evidence that these two proteins differ markedly in their stability as observed by unfolding monitored by intrinsic tryptophan fluorescence as well as circular dichroism spectroscopy. At 37 °C, the half-life for unfolding of PAPSS2 is in the range of minutes, whereas PAPSS1 remains structurally intact. In the presence of their natural ligand, the nucleotide adenosine 5′-phosphosulfate (APS), PAPS synthase proteins are stabilized. Invertebrates only possess one PAPS synthase enzyme that we classified as PAPSS2-type by sequence-based machine learning techniques. To test this prediction, we cloned and expressed the PPS-1 protein from the roundworm Caenorhabditis elegans and also subjected this protein to thermal unfolding. With respect to thermal unfolding and the stabilization by APS, PPS-1 behaved like the unstable human PAPSS2 protein suggesting that the less stable protein is evolutionarily older. Finally, APS binding more than doubled the half-life for unfolding of PAPSS2 at physiological temperatures and effectively prevented its aggregation on a time scale of days. We propose that protein stability is a major contributing factor for PAPS availability that has not as yet been considered. Moreover, naturally occurring changes in APS concentrations may be sensed by changes in the conformation of PAPSS2.
Biosensors; Heparan Sulfate; Nucleotide; Protein Stability; Sulfotransferase; PAPS Synthase; Phase II Biotransformation; Sulfation
Maturation inhibitors such as Bevirimat are a new class of antiretroviral drugs that hamper the cleavage of HIV-1 proteins into their functional active forms. They bind to these preproteins and inhibit their cleavage by the HIV-1 protease, resulting in non-functional virus particles. Nevertheless, there exist mutations in this region leading to resistance against Bevirimat. Highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients, who might benefit from the usage of these new drugs.
We tested several methods to improve Bevirimat resistance prediction in HIV-1. It turned out that combining structural and sequence-based information in classifier ensembles led to accurate and reliable predictions. Moreover, we were able to identify the most crucial regions for Bevirimat resistance computationally, which are in line with experimental results from other studies.
Our analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat. New maturation inhibitors are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Thus, accurate prediction tools are very useful to enable a personalized therapy.
Peptidyl-prolyl cis/trans isomerases (PPIases) are enzymes assisting protein folding and protein quality control in organisms of all kingdoms of life. In contrast to the other sub-classes of PPIases, the cyclophilins and the FK-506 binding proteins, little was formerly known about the parvulin type of PPIase in Archaea. Recently, the first solution structure of an archaeal parvulin, the PinA protein from Cenarchaeum symbiosum, was reported. Investigation of occurrence and frequency of PPIase sequences in numerous archaeal genomes now revealed a strong tendency for thermophilic microorganisms to reduce the number of PPIases. Single-domain parvulins were mostly found in the genomes of recently proposed deep-branching archaeal subgroups, the Thaumarchaeota and the ARMANs (archaeal Richmond Mine acidophilic nanoorganisms). Hence, we used the parvulin sequence to reclassify available archaeal metagenomic contigs, thereby, adding new members to these subgroups. A combination of genomic background analysis and phylogenetic approaches of parvulin sequences suggested that the assigned sequences belong to at least two distinct groups of Thaumarchaeota. Finally, machine learning approaches were applied to identify amino acid residues that separate archaeal and bacterial parvulin proteins from each other. When mapped onto the recent PinA solution structure, most of these positions form a cluster at one site of the protein possibly indicating a different functionality of the two groups of parvulin proteins.
archaeal protein; Pin1; PPIase; single-domain parvulin; Thaumarchaeota
Computational design of novel proteins with well-defined functions is an ongoing topic in computational biology. In this work, we generated and optimized a new synthetic fusion protein using an evolutionary approach. The optimization was guided by directed evolution based on hydrophobicity scores, molecular weight, and secondary structure predictions. Several methods were used to refine the models built from the resulting sequences. We have successfully combined two unrelated naturally occurring binding sites, the immunoglobin Fc-binding site of the Z domain and the DNA-binding motif of MyoD bHLH, into a novel stable protein.
Most machine learning techniques currently applied in the literature need a fixed dimensionality of input data. However, this requirement is frequently violated by real input data, such as DNA and protein sequences, that often differ in length due to insertions and deletions. It is also notable that performance in classification and regression is often improved by numerical encoding of amino acids, compared to the commonly used sparse encoding.
The software "Interpol" encodes amino acid sequences as numerical descriptor vectors using a database of currently 532 descriptors (mainly from AAindex), and normalizes sequences to uniform length with one of five linear or non-linear interpolation algorithms. Interpol is distributed with open source as platform independent R-package. It is typically used for preprocessing of amino acid sequences for classification or regression.
The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.
Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.
We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.
We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
Deep sequencing is able to generate a complete picture of the retroviral quasi-species in a patient. We demonstrate that the unprecedented power of deep sequencing in conjunction with computational data analysis has great potential for clinical diagnostics and basic research. Specifically, we analyzed longitudinal deep sequencing data from patients in a study with Vicriviroc, a drug that blocks the HIV-1 co-receptor CCR5. Sequences covered the V3-loop of gp120, known to be the main determinant of co-receptor tropism. First, we evaluated this data with a computational model for the interpretation of V3-sequences with respect to tropism, and we found complete agreement with results from phenotypic assays. Thus, the method could be applied in cases where phenotypic assays fail. Second, computational analysis led to the discovery of a characteristic pattern in the quasi-species that foreshadows switches of co-receptor tropism. This analysis could help to unravel the mechanism of tropism switches, and to predict these switches weeks to months before they can be detected by a phenotypic assay.
In this study we used a Random Forest-based approach for an assignment of small guanosine triphosphate proteins (GTPases) to specific subgroups. Small GTPases represent an important functional group of proteins that serve as molecular switches in a wide range of fundamental cellular processes, including intracellular transport, movement and signaling events. These proteins have further gained a special emphasis in cancer research, because within the last decades a huge variety of small GTPases from different subgroups could be related to the development of all types of tumors. Using a random forest approach, we were able to identify the most important amino acid positions for the classification process within the small GTPases superfamily and its subgroups. These positions are in line with the results of earlier studies and have been shown to be the essential elements for the different functionalities of the GTPase families. Furthermore, we provide an accurate and reliable software tool (GTPasePred) to identify potential novel GTPases and demonstrate its application to genome sequences.
cancer; machine learning; classification; Random Forests; proteins
Human Immunodeficiency Virus 1 uses for entry into host cells a receptor (CD4) and one of two co-receptors (CCR5 or CXCR4). Recently, a new class of antiretroviral drugs has entered clinical practice that specifically bind to the co-receptor CCR5, and thus inhibit virus entry. Accurate prediction of the co-receptor used by the virus in the patient is important as it allows for personalized selection of effective drugs and prognosis of disease progression. We have investigated whether it is possible to predict co-receptor usage accurately by analyzing the amino acid sequence of the main determinant of co-receptor usage, i.e., the third variable loop V3 of the gp120 protein. We developed a two-level machine learning approach that in the first level considers two different properties important for protein-protein binding derived from structural models of V3 and V3 sequences. The second level combines the two predictions of the first level. The two-level method predicts usage of CXCR4 co-receptor for new V3 sequences within seconds, with an area under the ROC curve of 0.937±0.004. Moreover, it is relatively robust against insertions and deletions, which frequently occur in V3. The approach could help clinicians to find optimal personalized treatments, and it offers new insights into the molecular basis of co-receptor usage. For instance, it quantifies the importance for co-receptor usage of a pocket that probably is responsible for binding sulfated tyrosine.
Human Immunodeficiency Virus is the pathogen causing the disease AIDS. A precondition for virus entry into human cells is the contact of its glycoprotein gp120 with two cellular proteins, a receptor and a co-receptor. Depending on the viral strain, one specific co-receptor is used. The type of co-receptor used is crucial for the aggressiveness of the viral strain and the available treatment options. Hence, it is important to identify which co-receptor is used by the virus in an individual patient. Since the genome of the virus in the patient can be readily sequenced, and thus the composition of the viral proteins be determined, it could be possible to predict co-receptor usage from the viral genome sequences. To this end, we developed a method that is motivated by the insight that physical properties of gp120 will determine its specificity for a co-receptor. The method learns a computational model from structures and sequences of a crucial part of gp120, and the corresponding experimentally measured co-receptor usage. It then employs the model to predict co-receptor usage for new sequences. The high accuracy of the method could make it helpful for diagnosis and suggests that the model captures the determinants of co-receptor usage.
Maturation inhibitors are a new class of antiretroviral drugs. Bevirimat (BVM) was the first substance in this class of inhibitors entering clinical trials. While the inhibitory function of BVM is well established, the molecular mechanisms of action and resistance are not well understood. It is known that mutations in the regions CS p24/p2 and p2 can cause phenotypic resistance to BVM. We have investigated a set of p24/p2 sequences of HIV-1 of known phenotypic resistance to BVM to test whether BVM resistance can be predicted from sequence, and to identify possible molecular mechanisms of BVM resistance in HIV-1.
We used artificial neural networks and random forests with different descriptors for the prediction of BVM resistance. Random forests with hydrophobicity as descriptor performed best and classified the sequences with an area under the Receiver Operating Characteristics (ROC) curve of 0.93 ± 0.001. For the collected data we find that p2 sequence positions 369 to 376 have the highest impact on resistance, with positions 370 and 372 being particularly important. These findings are in partial agreement with other recent studies. Apart from the complex machine learning models we derived a number of simple rules that predict BVM resistance from sequence with surprising accuracy. According to computational predictions based on the data set used, cleavage sites are usually not shifted by resistance mutations. However, we found that resistance mutations could shorten and weaken the α-helix in p2, which hints at a possible resistance mechanism.
We found that BVM resistance of HIV-1 can be predicted well from the sequence of the p2 peptide, which may prove useful for personalized therapy if maturation inhibitors reach clinical practice. Results of secondary structure analysis are compatible with a possible route to BVM resistance in which mutations weaken a six-helix bundle discovered in recent experiments, and thus ease Gag cleavage by the retroviral protease.
The default-mode network (DMN) is a functional network with increasing relevance for psychiatric research, characterized by increased activation at rest and decreased activation during task performance. The degree of DMN deactivation during a cognitively demanding task depends on its difficulty. However, the relation of hemodynamic responses in the resting phase after a preceding cognitive challenge remains relatively unexplored. We test the hypothesis that the degree of activation of the DMN following cognitive challenge is influenced by the cognitive load of a preceding working-memory task.
Twenty-five healthy subjects were investigated with functional MRI at 3 Tesla while performing a working-memory task with embedded short resting phases. Data were decomposed into statistically independent spatio-temporal components using Tensor Independent Component Analysis (TICA). The DMN was selected using a template-matching procedure. The spatial map contained rest-related activations in the medial frontal cortex, ventral anterior and posterior cingulate cortex. The time course of the DMN revealed increased activation at rest after 1-back and 2-back blocks compared to the activation after a 0-back block.
We present evidence that a cognitively challenging working-memory task is followed by greater activation of the DMN than a simple letter-matching task. This might be interpreted as a functional correlate of self-evaluation and reflection of the preceding task or as relocation of cerebral resources representing recovery from high cognitive demands. This finding is highly relevant for neuroimaging studies which include resting phases in cognitive tasks as stable baseline conditions. Further studies investigating the DMN should take possible interactions of tasks and subsequent resting phases into account.
DNA watermarks can be applied to identify the unauthorized use of genetically modified organisms. It has been shown that coding regions can be used to encrypt information into living organisms by using the DNA-Crypt algorithm. Yet, if the sequence of interest presents a non-coding DNA sequence, either the function of a resulting functional RNA molecule or a regulatory sequence, such as a promoter, could be affected. For our studies we used the small cytoplasmic RNA 1 in yeast and the lac promoter region of Escherichia coli.
The lac promoter was deactivated by the integrated watermark. In addition, the RNA molecules displayed altered configurations after introducing a watermark, but surprisingly were functionally intact, which has been verified by analyzing the growth characteristics of both wild type and watermarked scR1 transformed yeast cells. In a third approach we introduced a second overlapping watermark into the lac promoter, which did not affect the promoter activity.
Even though the watermarked RNA and one of the watermarked promoters did not show any significant differences compared to the wild type RNA and wild type promoter region, respectively, it cannot be generalized that other RNA molecules or regulatory sequences behave accordingly. Therefore, we do not recommend integrating watermark sequences into regulatory regions.
DNA-based watermarks are helpful tools to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. In silico analyses showed that in coding regions synonymous codons can be used to insert encrypted information into the genome of living organisms by using the DNA-Crypt algorithm.
We integrated an authenticating watermark in the Vam7 sequence. For our investigations we used a mutant Saccharomyces cerevisiae strain, called CG783, which has an amber mutation within the Vam7 sequence. The CG783 cells are unable to sporulate and in addition display an abnormal vacuolar morphology. Transformation of CG783 with pRS314 Vam7 leads to a phenotype very similar to the wildtype yeast strain CG781. The integrated watermark did not influence the function of Vam7 and the resulting phenotype of the CG783 cells transformed with pRS314 Vam7-TB shows no significant differences compared to the CG783 cells transformed with pRS314 Vam7.
From our experiments we conclude that the DNA watermarks produced by DNA-Crypt do not influence the translation from mRNA into protein. By analyzing the vacuolar morphology, growth rate and ability to sporulate we confirmed that the resulting Vam7 protein was functionally active.
The aim of this paper is to demonstrate the application of watermarks based on DNA sequences to identify the unauthorized use of genetically modified organisms (GMOs) protected by patents. Predicted mutations in the genome can be corrected by the DNA-Crypt program leaving the encrypted information intact. Existing DNA cryptographic and steganographic algorithms use synthetic DNA sequences to store binary information however, although these sequences can be used for authentication, they may change the target DNA sequence when introduced into living organisms.
The DNA-Crypt algorithm and image steganography are based on the same watermark-hiding principle, namely using the least significant base in case of DNA-Crypt and the least significant bit in case of the image steganography. It can be combined with binary encryption algorithms like AES, RSA or Blowfish. DNA-Crypt is able to correct mutations in the target DNA with several mutation correction codes such as the Hamming-code or the WDH-code. Mutations which can occur infrequently may destroy the encrypted information, however an integrated fuzzy controller decides on a set of heuristics based on three input dimensions, and recommends whether or not to use a correction code. These three input dimensions are the length of the sequence, the individual mutation rate and the stability over time, which is represented by the number of generations. In silico experiments using the Ypt7 in Saccharomyces cerevisiae shows that the DNA watermarks produced by DNA-Crypt do not alter the translation of mRNA into protein.
The program is able to store watermarks in living organisms and can maintain the original information by correcting mutations itself. Pairwise or multiple sequence alignments show that DNA-Crypt produces few mismatches between the sequences similar to all steganographic algorithms.