Biomarker discovery methods are essential to identify a minimal subset of features (e.g., serum markers in predictive medicine) that are relevant to develop prediction models with high accuracy. By now, there exist diverse feature selection methods, which either are embedded, combined, or independent of predictive learning algorithms. Many preceding studies showed the defectiveness of single feature selection results, which cause difficulties for professionals in a variety of fields (e.g., medical practitioners) to analyze and interpret the obtained feature subsets. Whereas each of these methods is highly biased, an ensemble feature selection has the advantage to alleviate and compensate for such biases. Concerning the reliability, validity, and reproducibility of these methods, we examined eight different feature selection methods for binary classification datasets and developed an ensemble feature selection system.
By using an ensemble of feature selection methods, a quantification of the importance of the features could be obtained. The prediction models that have been trained on the selected features showed improved prediction performance.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-016-0114-4) contains supplementary material, which is available to authorized users.
Machine learning; Feature selection; Ensemble learning; Biomarker discovery; Random forest
Protists are perhaps the most lineage-rich of microbial lifeforms, but remain largely unknown. High-throughput sequencing technologies provide opportunities to screen whole habitats in depth and enable detailed comparisons of different habitats to measure, compare and map protistan diversity. Such comparisons are often limited by low sample numbers within single studies and a lack of standardisation between studies. Here, we analysed 232 samples from 10 sampling campaigns using a standardised PCR protocol and bioinformatics pipeline. We show that protistan community patterns are highly consistent within habitat types and geographic regions, provided that sample processing is standardised. Community profiles are only weakly affected by fluctuations of the abundances of the most abundant taxa and, therefore, provide a sound basis for habitat comparison beyond random short-term fluctuations in the community composition. Further, we provide evidence that distribution patterns are not solely resulting from random processes. Distinct habitat types and distinct taxonomic groups are dominated by taxa with distinct distribution patterns that reflect their ecology with respect to dispersal and habitat colonisation. However, there is no systematic shift of the distribution pattern with taxon abundance.
Drug resistance testing is mandatory in antiretroviral therapy in human immunodeficiency virus (HIV) infected patients for successful treatment. The emergence of resistances against antiretroviral agents remains the major obstacle in inhibition of viral replication and thus to control infection. Due to the high mutation rate the virus is able to adapt rapidly under drug pressure leading to the evolution of resistant variants and finally to therapy failure.
We developed a web service for drug resistance prediction of commonly used drugs in antiretroviral therapy, i.e., protease inhibitors (PIs), reverse transcriptase inhibitors (NRTIs and NNRTIs), and integrase inhibitors (INIs), but also for the novel drug class of maturation inhibitors. Furthermore, co-receptor tropism (CCR5 or CXCR4) can be predicted as well, which is essential for treatment with entry inhibitors, such as Maraviroc. Currently, SHIVA provides 24 prediction models for several drug classes. SHIVA can be used with single RNA/DNA or amino acid sequences, but also with large amounts of next-generation sequencing data and allows prediction of a user specified selection of drugs simultaneously. Prediction results are provided as clinical reports which are sent via email to the user.
SHIVA represents a novel high performing alternative for hitherto developed drug resistance testing approaches able to process data derived from next-generation sequencing technologies. SHIVA is publicly available via a user-friendly web interface.
Infectious diseases; Machine learning; Retrovirus; HIV therapy
Antiretroviral treatment of Human Immunodeficiency Virus type-1 (HIV-1) infections with CCR5-antagonists requires the co-receptor usage prediction of viral strains. Currently available tools are mostly designed based on subtype B strains and thus are in general not applicable to non-B subtypes. However, HIV-1 infections caused by subtype B only account for approximately 11% of infections worldwide. We evaluated the performance of several sequence-based algorithms for co-receptor usage prediction employed on subtype A V3 sequences including circulating recombinant forms (CRFs) and subtype C strains. We further analysed sequence profiles of gp120 regions of subtype A, B and C to explore functional relationships to entry phenotypes. Our analyses clearly demonstrate that state-of-the-art algorithms are not useful for predicting co-receptor tropism of subtype A and its CRFs. Sequence profile analysis of gp120 revealed molecular variability in subtype A viruses. Especially, the V2 loop region could be associated with co-receptor tropism, which might indicate a unique pattern that determines co-receptor tropism in subtype A strains compared to subtype B and C strains. Thus, our study demonstrates that there is a need for the development of novel algorithms facilitating tropism prediction of HIV-1 subtype A to improve effective antiretroviral treatment in patients.
Antiretroviral therapy is essential for human immunodeficiency virus (HIV) infected patients to inhibit viral replication and therewith to slow progression of disease and prolong a patient’s life. However, the high mutation rate of HIV can lead to a fast adaptation of the virus under drug pressure and thereby to the evolution of resistant variants. In turn, these variants will lead to the failure of antiretroviral treatment. Moreover, these mutations cannot only lead to resistance against single drugs, but also to cross-resistance, i.e., resistance against drugs that have not yet been applied.
662 protease sequences and 715 reverse transcriptase sequences with complete resistance profiles were analyzed using machine learning techniques, namely binary relevance classifiers, classifier chains, and ensembles of classifier chains.
In our study, we applied multi-label classification models incorporating cross-resistance information to predict drug resistance for two of the major drug classes used in antiretroviral therapy for HIV-1, namely protease inhibitors (PIs) and non-nucleoside reverse transcriptase inhibitors (NNRTIs). By means of multi-label learning, namely classifier chains (CCs) and ensembles of classifier chains (ECCs), we were able to improve overall prediction accuracy for all drugs compared to hitherto applied binary classification models.
The development of fast and precise models to predict drug resistance in HIV-1 is highly important to enable a highly effective personalized therapy. Cross-resistance information can be exploited to improve prediction accuracy of computational drug resistance models.
Electronic supplementary material
The online version of this article (doi:10.1186/s13040-016-0089-1) contains supplementary material, which is available to authorized users.
Infectious diseases; Machine learning; Retrovirus; HIV therapy
The enzyme subclass of glycosyltransferases (GTs; EC 2.4) currently comprises 97 families as specified by CAZy classification. One of their important roles is in the biosynthesis of disaccharides, oligosaccharides, and polysaccharides by catalyzing the transfer of sugar moieties from activated donor molecules to other sugar molecules. In addition GTs also catalyze the transfer of sugar moieties onto aglycons, which is of great relevance for the synthesis of many high value natural products. Bacterial GTs show a higher sequence similarity in comparison to mammalian ones. Even when most GTs are poorly explored, state of the art technologies, such as protein engineering, domain swapping or computational analysis strongly enhance our understanding and utilization of these very promising classes of proteins. This perspective article will focus on bacterial GTs, especially on classification, screening and engineering strategies to alter substrate specificity. The future development in these fields as well as obstacles and challenges will be highlighted and discussed.
screening; bacterial glycosyltransferases; categorization of glycosyltransferases; substrate specificity; docking experiments; polysaccharide glycosyltransferases
Supplemental Digital Content is available in the text
Detection of high-risk subjects in acute myocardial infarction (AMI) by noninvasive means would reduce the need for intracardiac catheterization and associated complications. Liver enzymes are associated with cardiovascular disease risk. A potential predictive value for liver serum markers for the severity of stenosis in AMI was analyzed.
Patients with AMI undergoing percutaneous coronary intervention (PCI; n = 437) were retrospectively evaluated. Minimal lumen diameter (MLD) and percent stenosis diameter (SD) were determined from quantitative coronary angiography. Patients were classified according to the severity of stenosis (SD ≥ 50%, n = 357; SD < 50%, n = 80). Routine heart and liver parameters were associated with SD using random forests (RF). A prediction model (M10) was developed based on parameter importance analysis in RF.
Age, alkaline phosphatase (AP), aspartate aminotransferase (AST), and MLD differed significantly between SD ≥ 50 and SD < 50. Age, AST, alanine aminotransferase (ALT), and troponin correlated significantly with SD, whereas MLD correlated inversely with SD. M10 (age, BMI, AP, AST, ALT, gamma-glutamyltransferase, creatinine, troponin) reached an AUC of 69.7% (CI 63.8–75.5%, P < 0.0001).
Routine liver parameters are associated with SD in AMI. A small set of noninvasively determined parameters can identify SD in AMI, and might avoid unnecessary coronary angiography in patients with low risk. The model can be accessed via http://stenosis.heiderlab.de.
High throughput sequencing (HTSeq) of small ribosomal subunit amplicons has the potential for a comprehensive characterization of microbial community compositions, down to rare species. However, the error-prone nature of the multi-step experimental process requires that the resulting raw sequences are subjected to quality control procedures. These procedures often involve an abundance cutoff for rare sequences or clustering of sequences, both of which limit genetic resolution. Here we propose a simple experimental protocol that retains the high genetic resolution granted by HTSeq methods while effectively removing many low abundance sequences that are likely due to PCR and sequencing errors. According to this protocol, we split samples and submit both halves to independent PCR and sequencing runs. The resulting sequence data is graphically and quantitatively characterized by the discordance between the two experimental branches, allowing for a quick identification of problematic samples. Further, we discard sequences that are not found in both branches (“AmpliconDuo filter”). We show that the majority of sequences removed in this way, mostly low abundance but also some higher abundance sequences, show features expected from random modifications of true sequences as introduced by PCR and sequencing errors. On the other hand, the filter retains many low abundance sequences observed in both branches and thus provides a more reliable census of the rare biosphere. We find that the AmpliconDuo filter increases biological resolution as it increases apparent community similarity between biologically similar communities, while it does not affect apparent community similarities between biologically dissimilar communities. The filter does not distort overall apparent community compositions. Finally, we quantitatively explain the effect of the AmpliconDuo filter by a simple mathematical model.
Key features of the metabolic syndrome are insulin resistance and diabetes. The liver as central metabolic organ is not only affected by the metabolic syndrome as non-alcoholic fatty liver disease (NAFLD), but may contribute to insulin resistance and metabolic alterations. We aimed to identify potential associations between liver injury markers and diabetes in the population-based Heinz Nixdorf RECALL Study. Demographic and laboratory data were analyzed in participants (n = 4814, age 45 to 75y). ALT and AST values were significantly higher in males than in females. Mean BMI was 27.9 kg/m2 and type-2-diabetes (known and unkown) was present in 656 participants (13.7%). Adiponectin and vitamin D both correlated inversely with BMI. ALT, AST, and GGT correlated with BMI, CRP and HbA1c and inversely correlated with adiponectin levels. Logistic regression models using HbA1c and adiponectin or HbA1c and BMI were able to predict diabetes with high accuracy. Transaminase levels within normal ranges were closely associated with the BMI and diabetes risk. Transaminase levels and adiponectin were inversely associated. Re-assessment of current normal range limits should be considered, to provide a more exact indicator for chronic metabolic liver injury, in particular to reflect the situation in diabetic or obese individuals.
We analysed the impact of different parameters on genotypic tropism testing related to clinical outcome prediction in 108 patients on maraviroc (MVC) treatment.
87 RNA and 60 DNA samples were used. The viral tropism was predicted using the geno2pheno[coreceptor] and T-CUP tools with FPR cut-offs ranging from 1%-20%. Additionally, 27 RNA and 28 DNA samples were analysed in triplicate, 43 samples with the ESTA assay and 45 with next-generation sequencing. The influence of the genotypic susceptibility score (GSS) and 16 MVC-resistance mutations on clinical outcome was also studied.
Concordance between single-amplification testing compared to ESTA and to NGS was in the order of 80%. Concordance with NGS was higher at lower FPR cut-offs. Detection of baseline R5 viruses in RNA and DNA samples by all methods significantly correlated with treatment success, even with FPR cut-offs of 3.75%-7.5%. Triple amplification did not improve the prediction value but reduced the number of patients eligible for MVC. No influence of the GSS or MVC-resistance mutations but adherence to treatment, on the clinical outcome was detected.
Proviral DNA is valid to select candidates for MVC treatment. FPR cut-offs of 5%-7.5% and single amplification from RNA or DNA would assure a safe administration of MVC without excluding many patients who could benefit from this drug. In addition, the new prediction system T-CUP produced reliable results.
Human Immunodeficiency Virus 1 enters host cells through interaction of its V3 loop (which is part of the gp120 protein) with the host cell receptor CD4 and one of two co-receptors, namely CCR5 or CXCR4. Entry inhibitors binding the CCR5 co-receptor can prevent viral entry. As these drugs are only available for CCR5-using viruses, accurate prediction of this so-called co-receptor tropism is important in order to ensure an effective personalized therapy. With the development of next-generation sequencing technologies, it is now possible to sequence representative subpopulations of the viral quasispecies.
Here we present T-CUP 2.0, a model for predicting co-receptor tropism. Based on our recently published T-CUP model, we developed a more accurate and even faster solution. Similarly to its predecessor, T-CUP 2.0 models co-receptor tropism using information of the electrostatic potential and hydrophobicity of V3-loops. However, extracting this information from a simplified structural vacuum-model leads to more accurate and faster predictions. The area-under-the-ROC-curve (AUC) achieved with T-CUP 2.0 on the training set is 0.968±0.005 in a leave-one-patient-out cross-validation. When applied to an independent dataset, T-CUP 2.0 has an improved prediction accuracy of around 3% when compared to the original T-CUP.
We found that it is possible to model co-receptor tropism in HIV-1 based on a simplified structure-based model of the V3 loop. In this way, genotypic prediction of co-receptor tropism is very accurate, fast and can be applied to large datasets derived from next-generation sequencing technologies. The reduced complexity of the electrostatic modeling makes T-CUP 2.0 independent from third-party software, making it easy to install and use.
Background & Objective
Currently, a major clinical challenge is to distinguish between chronic liver disease caused by metabolic syndrome (non-alcoholic fatty liver disease, NAFLD) from that caused by long term or excessive alcohol consumption (ALD). The etiology of severe liver disease affects treatment options and priorities for liver transplantation and organ allocation. Thus we compared physiologically similar NAFLD and ALD patients to detect biochemical differences for improved separation of these mechanistically overlapping etiologies.
In a cohort of 31 NAFLD patients with BMI below 30 and a cohort of ALD patient with (ALDC n = 51) or without cirrhosis (ALDNC n = 51) serum transaminases, cell death markers and (adipo-)cytokines were assessed. Groups were compared with One-way ANOVA and Tukey's correction. Predictive models were built by machine learning techniques.
NAFLD, ALDNC or ALDC patients did not differ in demographic parameters. The ratio of alanine aminotransferase/aspartate aminotransferase - common serum parameters for liver damage - was significantly higher in the NAFLD group compared to both ALD groups (each p<0.0001). Adiponectin and tumor necrosis factor(TNF)-alpha were significantly lower in NAFLD than in ALDNC (p<0.05) or ALDC patients (p<0.0001). Significantly higher serum concentrations of cell death markers, hyaluronic acid, adiponectin, and TNF-alpha (each p<0.0001) were found in ALDC compared to ALDNC. Using machine learning techniques we were able to discern NAFLD and ALDNC (up to an AUC of 0.9118±0.0056) or ALDC and ALDNC (up to an AUC of 0.9846±0.0018), respectively.
Machine learning techniques relying on ALT/AST ratio, adipokines and cytokines distinguish NAFLD and ALD. In addition, severity of ALD may be non-invasively diagnosed via serum cytokine concentrations.
Magnetic resonance imaging (MRI) offers a non-radioactive alternative for the non-invasive detection of tumours. Low molecular weight MRI contrast agents currently in clinical use suffer either from a lack of specificity for tumour tissue or from low relaxivity and thus low contrast amplification. In this study, we present the newly designed two domain fusion protein Zarvin, which is able to bind to therapeutic IgG antibodies suitable for targeting, while facilitating contrast enhancement through high affinity binding sites for Gd3+. We show that the Zarvin fold is stable under serum conditions, specifically targets a cancer cell-line when bound to the Cetuximab IgG, and allows for imaging with high relaxivity, a property that would be advantageous for the detection of small tumours and metastases at 1.5 or 3 T.
The epidemiology of HIV-1 in China has unique features that may have led to unique viral strains. We therefore tested the hypothesis that it is possible to find distinctive patterns in HIV-1 genomes sampled in China. Using a rule inference algorithm we could indeed extract from sequences of the third variable loop (V3) of HIV-1 gp120 a set of 14 signature patterns that with 89% accuracy distinguished Chinese from non-Chinese sequences. These patterns were found to be specific to HIV-1 subtype, i.e. sequences complying with pattern 1 were of subtype B, pattern 2 almost exclusively covered sequences of subtype 01_AE, etc. We then analyzed the first of these signature patterns in depth, namely that L and W at two V3 positions are specifically occurring in Chinese sequences of subtype B/B' (3% false positives). This pattern was found to be in agreement with the phylogeny of HIV-1 of subtype B inside and outside of China. We could neither reject nor convincingly confirm that the pattern is stabilized by immune escape. For further interpretation of the signature pattern we used the recently developed measure of Direct Information, and in this way discovered evidence for physical interactions between V2 and V3. We conclude by a discussion of limitations of signature patterns, and the applicability of the approach to other genomic regions and other countries.
Background: Humans have two enzyme isoforms to produce the universal sulfate donor 3′-phosphoadenosine 5′-phosphosulfate (PAPS).
Results: The main difference between the two PAPS synthases is their stability, which is modulated by nucleotides.
Conclusion: Protein stability is a major contributing factor for PAPS availability.
Significance: Naturally occurring changes in APS concentrations may be sensed by the labile PAPS synthase 2 that might act as a novel biosensor.
Activated sulfate in the form of 3′-phosphoadenosine 5′-phosphosulfate (PAPS) is needed for all sulfation reactions in eukaryotes with implications for the build-up of extracellular matrices, retroviral infection, protein modification, and steroid metabolism. In metazoans, PAPS is produced by bifunctional PAPS synthases (PAPSS). A major question in the field is why two human protein isoforms, PAPSS1 and -S2, are required that cannot complement for each other. We provide evidence that these two proteins differ markedly in their stability as observed by unfolding monitored by intrinsic tryptophan fluorescence as well as circular dichroism spectroscopy. At 37 °C, the half-life for unfolding of PAPSS2 is in the range of minutes, whereas PAPSS1 remains structurally intact. In the presence of their natural ligand, the nucleotide adenosine 5′-phosphosulfate (APS), PAPS synthase proteins are stabilized. Invertebrates only possess one PAPS synthase enzyme that we classified as PAPSS2-type by sequence-based machine learning techniques. To test this prediction, we cloned and expressed the PPS-1 protein from the roundworm Caenorhabditis elegans and also subjected this protein to thermal unfolding. With respect to thermal unfolding and the stabilization by APS, PPS-1 behaved like the unstable human PAPSS2 protein suggesting that the less stable protein is evolutionarily older. Finally, APS binding more than doubled the half-life for unfolding of PAPSS2 at physiological temperatures and effectively prevented its aggregation on a time scale of days. We propose that protein stability is a major contributing factor for PAPS availability that has not as yet been considered. Moreover, naturally occurring changes in APS concentrations may be sensed by changes in the conformation of PAPSS2.
Biosensors; Heparan Sulfate; Nucleotide; Protein Stability; Sulfotransferase; PAPS Synthase; Phase II Biotransformation; Sulfation
Maturation inhibitors such as Bevirimat are a new class of antiretroviral drugs that hamper the cleavage of HIV-1 proteins into their functional active forms. They bind to these preproteins and inhibit their cleavage by the HIV-1 protease, resulting in non-functional virus particles. Nevertheless, there exist mutations in this region leading to resistance against Bevirimat. Highly specific and accurate tools to predict resistance to maturation inhibitors can help to identify patients, who might benefit from the usage of these new drugs.
We tested several methods to improve Bevirimat resistance prediction in HIV-1. It turned out that combining structural and sequence-based information in classifier ensembles led to accurate and reliable predictions. Moreover, we were able to identify the most crucial regions for Bevirimat resistance computationally, which are in line with experimental results from other studies.
Our analysis demonstrated the use of machine learning techniques to predict HIV-1 resistance against maturation inhibitors such as Bevirimat. New maturation inhibitors are already under development and might enlarge the arsenal of antiretroviral drugs in the future. Thus, accurate prediction tools are very useful to enable a personalized therapy.
Peptidyl-prolyl cis/trans isomerases (PPIases) are enzymes assisting protein folding and protein quality control in organisms of all kingdoms of life. In contrast to the other sub-classes of PPIases, the cyclophilins and the FK-506 binding proteins, little was formerly known about the parvulin type of PPIase in Archaea. Recently, the first solution structure of an archaeal parvulin, the PinA protein from Cenarchaeum symbiosum, was reported. Investigation of occurrence and frequency of PPIase sequences in numerous archaeal genomes now revealed a strong tendency for thermophilic microorganisms to reduce the number of PPIases. Single-domain parvulins were mostly found in the genomes of recently proposed deep-branching archaeal subgroups, the Thaumarchaeota and the ARMANs (archaeal Richmond Mine acidophilic nanoorganisms). Hence, we used the parvulin sequence to reclassify available archaeal metagenomic contigs, thereby, adding new members to these subgroups. A combination of genomic background analysis and phylogenetic approaches of parvulin sequences suggested that the assigned sequences belong to at least two distinct groups of Thaumarchaeota. Finally, machine learning approaches were applied to identify amino acid residues that separate archaeal and bacterial parvulin proteins from each other. When mapped onto the recent PinA solution structure, most of these positions form a cluster at one site of the protein possibly indicating a different functionality of the two groups of parvulin proteins.
archaeal protein; Pin1; PPIase; single-domain parvulin; Thaumarchaeota
Computational design of novel proteins with well-defined functions is an ongoing topic in computational biology. In this work, we generated and optimized a new synthetic fusion protein using an evolutionary approach. The optimization was guided by directed evolution based on hydrophobicity scores, molecular weight, and secondary structure predictions. Several methods were used to refine the models built from the resulting sequences. We have successfully combined two unrelated naturally occurring binding sites, the immunoglobin Fc-binding site of the Z domain and the DNA-binding motif of MyoD bHLH, into a novel stable protein.
Most machine learning techniques currently applied in the literature need a fixed dimensionality of input data. However, this requirement is frequently violated by real input data, such as DNA and protein sequences, that often differ in length due to insertions and deletions. It is also notable that performance in classification and regression is often improved by numerical encoding of amino acids, compared to the commonly used sparse encoding.
The software "Interpol" encodes amino acid sequences as numerical descriptor vectors using a database of currently 532 descriptors (mainly from AAindex), and normalizes sequences to uniform length with one of five linear or non-linear interpolation algorithms. Interpol is distributed with open source as platform independent R-package. It is typically used for preprocessing of amino acid sequences for classification or regression.
The functionality of Interpol widens the spectrum of machine learning methods that can be applied to biological sequences, and it will in many cases improve their performance in classification and regression.
Machine learning techniques have been widely applied to biological sequences, e.g. to predict drug resistance in HIV-1 from sequences of drug target proteins and protein functional classes. As deletions and insertions are frequent in biological sequences, a major limitation of current methods is the inability to handle varying sequence lengths.
We propose to normalize sequences to uniform length. To this end, we tested one linear and four different non-linear interpolation methods for the normalization of sequence lengths of 19 classification datasets. Classification tasks included prediction of HIV-1 drug resistance from drug target sequences and sequence-based prediction of protein function. We applied random forests to the classification of sequences into "positive" and "negative" samples. Statistical tests showed that the linear interpolation outperforms the non-linear interpolation methods in most of the analyzed datasets, while in a few cases non-linear methods had a small but significant advantage. Compared to other published methods, our prediction scheme leads to an improvement in prediction accuracy by up to 14%.
We found that machine learning on sequences normalized by simple linear interpolation gave better or at least competitive results compared to state-of-the-art procedures, and thus, is a promising alternative to existing methods, especially for protein sequences of variable length.
Deep sequencing is able to generate a complete picture of the retroviral quasi-species in a patient. We demonstrate that the unprecedented power of deep sequencing in conjunction with computational data analysis has great potential for clinical diagnostics and basic research. Specifically, we analyzed longitudinal deep sequencing data from patients in a study with Vicriviroc, a drug that blocks the HIV-1 co-receptor CCR5. Sequences covered the V3-loop of gp120, known to be the main determinant of co-receptor tropism. First, we evaluated this data with a computational model for the interpretation of V3-sequences with respect to tropism, and we found complete agreement with results from phenotypic assays. Thus, the method could be applied in cases where phenotypic assays fail. Second, computational analysis led to the discovery of a characteristic pattern in the quasi-species that foreshadows switches of co-receptor tropism. This analysis could help to unravel the mechanism of tropism switches, and to predict these switches weeks to months before they can be detected by a phenotypic assay.
In this study we used a Random Forest-based approach for an assignment of small guanosine triphosphate proteins (GTPases) to specific subgroups. Small GTPases represent an important functional group of proteins that serve as molecular switches in a wide range of fundamental cellular processes, including intracellular transport, movement and signaling events. These proteins have further gained a special emphasis in cancer research, because within the last decades a huge variety of small GTPases from different subgroups could be related to the development of all types of tumors. Using a random forest approach, we were able to identify the most important amino acid positions for the classification process within the small GTPases superfamily and its subgroups. These positions are in line with the results of earlier studies and have been shown to be the essential elements for the different functionalities of the GTPase families. Furthermore, we provide an accurate and reliable software tool (GTPasePred) to identify potential novel GTPases and demonstrate its application to genome sequences.
cancer; machine learning; classification; Random Forests; proteins
Human Immunodeficiency Virus 1 uses for entry into host cells a receptor (CD4) and one of two co-receptors (CCR5 or CXCR4). Recently, a new class of antiretroviral drugs has entered clinical practice that specifically bind to the co-receptor CCR5, and thus inhibit virus entry. Accurate prediction of the co-receptor used by the virus in the patient is important as it allows for personalized selection of effective drugs and prognosis of disease progression. We have investigated whether it is possible to predict co-receptor usage accurately by analyzing the amino acid sequence of the main determinant of co-receptor usage, i.e., the third variable loop V3 of the gp120 protein. We developed a two-level machine learning approach that in the first level considers two different properties important for protein-protein binding derived from structural models of V3 and V3 sequences. The second level combines the two predictions of the first level. The two-level method predicts usage of CXCR4 co-receptor for new V3 sequences within seconds, with an area under the ROC curve of 0.937±0.004. Moreover, it is relatively robust against insertions and deletions, which frequently occur in V3. The approach could help clinicians to find optimal personalized treatments, and it offers new insights into the molecular basis of co-receptor usage. For instance, it quantifies the importance for co-receptor usage of a pocket that probably is responsible for binding sulfated tyrosine.
Human Immunodeficiency Virus is the pathogen causing the disease AIDS. A precondition for virus entry into human cells is the contact of its glycoprotein gp120 with two cellular proteins, a receptor and a co-receptor. Depending on the viral strain, one specific co-receptor is used. The type of co-receptor used is crucial for the aggressiveness of the viral strain and the available treatment options. Hence, it is important to identify which co-receptor is used by the virus in an individual patient. Since the genome of the virus in the patient can be readily sequenced, and thus the composition of the viral proteins be determined, it could be possible to predict co-receptor usage from the viral genome sequences. To this end, we developed a method that is motivated by the insight that physical properties of gp120 will determine its specificity for a co-receptor. The method learns a computational model from structures and sequences of a crucial part of gp120, and the corresponding experimentally measured co-receptor usage. It then employs the model to predict co-receptor usage for new sequences. The high accuracy of the method could make it helpful for diagnosis and suggests that the model captures the determinants of co-receptor usage.
Maturation inhibitors are a new class of antiretroviral drugs. Bevirimat (BVM) was the first substance in this class of inhibitors entering clinical trials. While the inhibitory function of BVM is well established, the molecular mechanisms of action and resistance are not well understood. It is known that mutations in the regions CS p24/p2 and p2 can cause phenotypic resistance to BVM. We have investigated a set of p24/p2 sequences of HIV-1 of known phenotypic resistance to BVM to test whether BVM resistance can be predicted from sequence, and to identify possible molecular mechanisms of BVM resistance in HIV-1.
We used artificial neural networks and random forests with different descriptors for the prediction of BVM resistance. Random forests with hydrophobicity as descriptor performed best and classified the sequences with an area under the Receiver Operating Characteristics (ROC) curve of 0.93 ± 0.001. For the collected data we find that p2 sequence positions 369 to 376 have the highest impact on resistance, with positions 370 and 372 being particularly important. These findings are in partial agreement with other recent studies. Apart from the complex machine learning models we derived a number of simple rules that predict BVM resistance from sequence with surprising accuracy. According to computational predictions based on the data set used, cleavage sites are usually not shifted by resistance mutations. However, we found that resistance mutations could shorten and weaken the α-helix in p2, which hints at a possible resistance mechanism.
We found that BVM resistance of HIV-1 can be predicted well from the sequence of the p2 peptide, which may prove useful for personalized therapy if maturation inhibitors reach clinical practice. Results of secondary structure analysis are compatible with a possible route to BVM resistance in which mutations weaken a six-helix bundle discovered in recent experiments, and thus ease Gag cleavage by the retroviral protease.