In order to study the stoichiometry of monoclonal antibody (MAb) neutralization of T-cell line-adapted human immunodeficiency virus type 1 (HIV-1) in antibody excess and under equilibrium conditions, we exploited the ability of HIV-1 to generate mixed oligomers when different env genes are coexpressed. By the coexpression of Env glycoproteins that either can or cannot bind a neutralizing MAb in an env transcomplementation assay, virions were generated in which the proportion of MAb binding sites could be regulated. As the proportion of MAb binding sites in Env chimeric virus increased, MAb neutralization gradually increased. Virus neutralization by virion aggregation was minimal, as MAb binding to HIV-1 Env did not interfere with an AMLV Env-mediated infection by HIV-1(AMLV/HIV-1) pseudotypes of CD4− HEK293 cells. MAb neutralization of chimeric virions could be described as a third-order function of the proportion of Env antigen refractory to MAb binding. This scenario is consistent with the Env oligomer constituting the minimal functional unit and neutralization occurring incrementally as each Env oligomer binds MAb. Alternatively, the data could be fit to a sigmoid function. Thus, these data could not exclude the existence of a threshold for neutralization. However, results from MAb neutralization of chimeric virus containing wild-type Env and Env defective in CD4 binding was readily explained by a model of incremental MAb neutralization. In summary, the data indicate that MAb neutralization of T-cell line-adapted HIV-1 is incremental rather than all or none and that each MAb binding an Env oligomer reduces the likelihood of infection.
Human populations worldwide are increasingly confronted with infectious diseases and antimicrobial resistance spreading faster and appearing more frequently. Knowledge regarding their occurrence and worldwide transmission is important to control outbreaks and prevent epidemics. Here, we performed shotgun sequencing of toilet waste from 18 international airplanes arriving in Copenhagen, Denmark, from nine cities in three world regions. An average of 18.6 Gb (14.8 to 25.7 Gb) of raw Illumina paired end sequence data was generated, cleaned, trimmed and mapped against reference sequence databases for bacteria and antimicrobial resistance genes. An average of 106,839 (0.06%) reads were assigned to resistance genes with genes encoding resistance to tetracycline, macrolide and beta-lactam resistance genes as the most abundant in all samples. We found significantly higher abundance and diversity of genes encoding antimicrobial resistance, including critical important resistance (e.g. blaCTX-M) carried on airplanes from South Asia compared to North America. Presence of Salmonella enterica and norovirus were also detected in higher amounts from South Asia, whereas Clostridium difficile was most abundant in samples from North America. Our study provides a first step towards a potential novel strategy for global surveillance enabling simultaneous detection of multiple human health threatening genetic elements, infectious agents and resistance genes.
Since the first two complete bacterial genome sequences were published in 1995, the science of bacteria has dramatically changed. Using third-generation DNA sequencing, it is possible to completely sequence a bacterial genome in a few hours and identify some types of methylation sites along the genome as well. Sequencing of bacterial genome sequences is now a standard procedure, and the information from tens of thousands of bacterial genomes has had a major impact on our views of the bacterial world. In this review, we explore a series of questions to highlight some insights that comparative genomics has produced. To date, there are genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. However, the distribution is quite skewed towards a few phyla that contain model organisms. But the breadth is continuing to improve, with projects dedicated to filling in less characterized taxonomic groups. The clustered regularly interspaced short palindromic repeats (CRISPR)-Cas system provides bacteria with immunity against viruses, which outnumber bacteria by tenfold. How fast can we go? Second-generation sequencing has produced a large number of draft genomes (close to 90 % of bacterial genomes in GenBank are currently not complete); third-generation sequencing can potentially produce a finished genome in a few hours, and at the same time provide methlylation sites along the entire chromosome. The diversity of bacterial communities is extensive as is evident from the genome sequences available from 50 different bacterial phyla and 11 different archaeal phyla. Genome sequencing can help in classifying an organism, and in the case where multiple genomes of the same species are available, it is possible to calculate the pan- and core genomes; comparison of more than 2000 Escherichia coli genomes finds an E. coli core genome of about 3100 gene families and a total of about 89,000 different gene families. Why do we care about bacterial genome sequencing? There are many practical applications, such as genome-scale metabolic modeling, biosurveillance, bioforensics, and infectious disease epidemiology. In the near future, high-throughput sequencing of patient metagenomic samples could revolutionize medicine in terms of speed and accuracy of finding pathogens and knowing how to treat them.
Bacteria; Comparative genomics; Bacterial genomes; Metagenomics; Core-genome; Pan-genome; Next-generation sequencing
MAIT cells can discriminate between pathogen-derived ligands in a clonotype-dependent manner, and the TCR repertoire is distinct within individuals, indicating that the MAIT cell repertoire is shaped by prior microbial exposure.
Mucosal-associated invariant T (MAIT) cells express a semi-invariant T cell receptor (TCR) that detects microbial metabolites presented by the nonpolymorphic major histocompatibility complex (MHC)–like molecule MR1. The highly conserved nature of MR1 in conjunction with biased MAIT TCRα chain usage is widely thought to indicate limited ligand presentation and discrimination within a pattern-like recognition system. Here, we evaluated the TCR repertoire of MAIT cells responsive to three classes of microbes. Substantial diversity and heterogeneity were apparent across the functional MAIT cell repertoire as a whole, especially for TCRβ chain sequences. Moreover, different pathogen-specific responses were characterized by distinct TCR usage, both between and within individuals, suggesting that MAIT cell adaptation was a direct consequence of exposure to various exogenous MR1-restricted epitopes. In line with this interpretation, MAIT cell clones with distinct TCRs responded differentially to a riboflavin metabolite. These results suggest that MAIT cells can discriminate between pathogen-derived ligands in a clonotype-dependent manner, providing a basis for adaptive memory via recruitment of specific repertoires shaped by microbial exposure.
Building a population-specific catalogue of single nucleotide variants (SNVs), indels and structural variants (SVs) with frequencies, termed a national pan-genome, is critical for further advancing clinical and public health genetics in large cohorts. Here we report a Danish pan-genome obtained from sequencing 10 trios to high depth (50 × ). We report 536k novel SNVs and 283k novel short indels from mapping approaches and develop a population-wide de novo assembly approach to identify 132k novel indels larger than 10 nucleotides with low false discovery rates. We identify a higher proportion of indels and SVs than previous efforts showing the merits of high coverage and de novo assembly approaches. In addition, we use trio information to identify de novo mutations and use a probabilistic method to provide direct estimates of 1.27e−8 and 1.5e−9 per nucleotide per generation for SNVs and indels, respectively.
The generation of a national pan-genome, a population-specific catalogue of genetic variation, may advance the impact of clinical genetics studies. Here the Besenbacher et al. carry out deep sequencing and de novo assembly of 10 parent–child trios to generate a Danish pan-genome that provides insight into structural variation, de novo mutation rates and variant calling.
Major histocompatibility complex class II (MHCII) molecules play an important role in cell-mediated immunity. They present specific peptides derived from endosomal proteins for recognition by T helper cells. The identification of peptides that bind to MHCII molecules is therefore of great importance for understanding the nature of immune responses and identifying T cell epitopes for the design of new vaccines and immunotherapies. Given the large number of MHC variants, and the costly experimental procedures needed to evaluate individual peptide–MHC interactions, computational predictions have become particularly attractive as first-line methods in epitope discovery. However, only a few so-called pan-specific prediction methods capable of predicting binding to any MHC molecule with known protein sequence are currently available, and all of them are limited to HLA-DR. Here, we present the first pan-specific method capable of predicting peptide binding to any HLA class II molecule with a defined protein sequence. The method employs a strategy common for HLA-DR, HLA-DP and HLA-DQ molecules to define the peptide-binding MHC environment in terms of a pseudo sequence. This strategy allows the inclusion of new molecules even from other species. The method was evaluated in several benchmarks and demonstrates a significant improvement over molecule-specific methods as well as the ability to predict peptide binding of previously uncharacterised MHCII molecules. To the best of our knowledge, the NetMHCIIpan-3.0 method is the first pan-specific predictor covering all HLA class II molecules with known sequences including HLA-DR, HLA-DP, and HLA-DQ. The NetMHCpan-3.0 method is available at http://www.cbs.dtu.dk/services/NetMHCIIpan-3.0.
MHC class II; Tcell epitope; MHC binding specificity; Peptide–MHC binding; Human leukocyte antigens; Artificial neural networks
The identification of peptides binding to major histocompatibility complexes (MHC) is a critical step in the understanding of T cell immune responses. The human MHC genomic region (HLA) is extremely polymorphic comprising several thousand alleles, many encoding a distinct molecule. The potentially unique specificities remain experimentally uncharacterized for the vast majority of HLA molecules. Likewise, for nonhuman species, only a minor fraction of the known MHC molecules have been characterized. Here, we describe a tool, MHCcluster, to functionally cluster MHC molecules based on their predicted binding specificity. The method has a flexible web interface that allows the user to include any MHC of interest in the analysis. The output consists of a static heat map and graphical tree-based visualizations of the functional relationship between MHC variants and a dynamic TreeViewer interface where both the functional relationship and the individual binding specificities of MHC molecules are visualized. We demonstrate that conventional sequence-based clustering will fail to identify the functional relationship between molecules, when applied to MHC system, and only through the use of the predicted binding specificity can a correct clustering be found. Clustering of prevalent HLA-A and HLA-B alleles using MHCcluster confirms the presence of 12 major specificity groups (supertypes) some however with highly divergent specificities. Importantly, some HLA molecules are shown not to fit any supertype classification. Also, we use MHCcluster to show that chimpanzee MHC class I molecules have a reduced functional diversity compared to that of HLA class I molecules. MHCcluster is available at www.cbs.dtu.dk/services/MHCcluster-2.0.
MHC; HLA; Binding motif; Functional clustering; MHC specificity; Supertypes
Whole genome sequencing (WGS) shows great potential for real-time monitoring and identification of infectious disease outbreaks. However, rapid and reliable comparison of data generated in multiple laboratories and using multiple technologies is essential. So far studies have focused on using one technology because each technology has a systematic bias making integration of data generated from different platforms difficult. We developed two different procedures for identifying variable sites and inferring phylogenies in WGS data across multiple platforms. The methods were evaluated on three bacterial data sets and sequenced on three different platforms (Illumina, 454, Ion Torrent). We show that the methods are able to overcome the systematic biases caused by the sequencers and infer the expected phylogenies. It is concluded that the cause of the success of these new procedures is due to a validation of all informative sites that are included in the analysis. The procedures are available as web tools.
C-function MADS-box transcription factors belong to the AGAMOUS (AG) lineage and specify both stamen and carpel identity and floral meristem determinacy. In core eudicots, the AG lineage is further divided into two branches, the euAG and PLE lineages. Functional analyses across flowering plants strongly support the idea that duplicated AG lineage genes have different degrees of subfunctionalization of the C-function. The legume Medicago truncatula contains three C-lineage genes in its genome: two euAG genes (MtAGa and MtAGb) and one PLENA-like gene (MtSHP). This species is therefore a good experimental system to study the effects of gene duplication within the AG subfamily. We have studied the respective functions of each euAG genes in M. truncatula employing expression analyses and reverse genetic approaches. Our results show that the M. truncatula euAG- and PLENA-like genes are an example of subfunctionalization as a result of a change in expression pattern. MtAGa and MtAGb are the only genes showing a full C-function activity, concomitant with their ancestral expression profile, early in the floral meristem, and in the third and fourth floral whorls during floral development. In contrast, MtSHP expression appears late during floral development suggesting it does not contribute significantly to the C-function. Furthermore, the redundant MtAGa and MtAGb paralogs have been retained which provides the overall dosage required to specify the C-function in M. truncatula.
CD8+ T cell exhaustion represents a major hallmark of chronic HIV infection. Two key transcription factors governing CD8+ T cell differentiation, T-bet and Eomesodermin (Eomes), have previously been shown in mice to differentially regulate T cell exhaustion in part through direct modulation of PD-1. Here, we examined the relationship between these transcription factors and the expression of several inhibitory receptors (PD-1, CD160, and 2B4), functional characteristics and memory differentiation of CD8+ T cells in chronic and treated HIV infection. The expression of PD-1, CD160, and 2B4 on total CD8+ T cells was elevated in chronically infected individuals and highly associated with a T-betdimEomeshi expressional profile. Interestingly, both resting and activated HIV-specific CD8+ T cells in chronic infection were almost exclusively T-betdimEomeshi cells, while CMV-specific CD8+ T cells displayed a balanced expression pattern of T-bet and Eomes. The T-betdimEomeshi virus-specific CD8+ T cells did not show features of terminal differentiation, but rather a transitional memory phenotype with poor polyfunctional (effector) characteristics. The transitional and exhausted phenotype of HIV-specific CD8+ T cells was longitudinally related to persistent Eomes expression after antiretroviral therapy (ART) initiation. Strikingly, these characteristics remained stable up to 10 years after ART initiation. This study supports the concept that poor human viral-specific CD8+ T cell functionality is due to an inverse expression balance between T-bet and Eomes, which is not reversed despite long-term viral control through ART. These results aid to explain the inability of HIV-specific CD8+ T cells to control the viral replication post-ART cessation.
CD8+ T cells display numerous traits of severe dysfunction in both treated and untreated HIV infection. Previous studies have demonstrated that HIV-specific CD8+ T cells in most individuals possess poor polyfunctionality, and an immature/skewed maturation phenotype. However, it remains unclear which transcriptional programming governs the regulation of CD8+ T cell differentiation and exhaustion in HIV infection. T-bet and Eomes represent two key transcription factors for CD8+ T cell differentiation and function, but surprisingly little is known about their influence of effector immunity following chronic viral infections in humans. In this study, we demonstrate that HIV-specific CD8+ T cells possess highly elevated levels of Eomes, but low T-bet expression. This differential relationship is linked to the up-regulation of several inhibitory receptors, impaired functional characteristics and a transitional memory differentiation phenotype for virus-specific CD8+ T cells. Importantly, these characteristics of HIV-specific CD8+ T cells remained stable despite suppressive ART for many years. These results implicate that reinvigoration of these cells might fail to elicit efficient responses to eradicate the viral reservoir.
In the work presented here, we designed and developed two easy-to-use Web tools for in silico detection and characterization of whole-genome sequence (WGS) and whole-plasmid sequence data from members of the family Enterobacteriaceae. These tools will facilitate bacterial typing based on draft genomes of multidrug-resistant Enterobacteriaceae species by the rapid detection of known plasmid types. Replicon sequences from 559 fully sequenced plasmids associated with the family Enterobacteriaceae in the NCBI nucleotide database were collected to build a consensus database for integration into a Web tool called PlasmidFinder that can be used for replicon sequence analysis of raw, contig group, or completely assembled and closed plasmid sequencing data. The PlasmidFinder database currently consists of 116 replicon sequences that match with at least at 80% nucleotide identity all replicon sequences identified in the 559 fully sequenced plasmids. For plasmid multilocus sequence typing (pMLST) analysis, a database that is updated weekly was generated from www.pubmlst.org and integrated into a Web tool called pMLST. Both databases were evaluated using draft genomes from a collection of Salmonella enterica serovar Typhimurium isolates. PlasmidFinder identified a total of 103 replicons and between zero and five different plasmid replicons within each of 49 S. Typhimurium draft genomes tested. The pMLST Web tool was able to subtype genomic sequencing data of plasmids, revealing both known plasmid sequence types (STs) and new alleles and ST variants. In conclusion, testing of the two Web tools using both fully assembled plasmid sequences and WGS-generated draft genomes showed them to be able to detect a broad variety of plasmids that are often associated with antimicrobial resistance in clinically relevant bacterial pathogens.
One of the first issues that emerges when a prokaryotic organism of interest is encountered is the question of what it is—that is, which species it is. The 16S rRNA gene formed the basis of the first method for sequence-based taxonomy and has had a tremendous impact on the field of microbiology. Nevertheless, the method has been found to have a number of shortcomings. In the current study, we trained and benchmarked five methods for whole-genome sequence-based prokaryotic species identification on a common data set of complete genomes: (i) SpeciesFinder, which is based on the complete 16S rRNA gene; (ii) Reads2Type that searches for species-specific 50-mers in either the 16S rRNA gene or the gyrB gene (for the Enterobacteraceae family); (iii) the ribosomal multilocus sequence typing (rMLST) method that samples up to 53 ribosomal genes; (iv) TaxonomyFinder, which is based on species-specific functional protein domain profiles; and finally (v) KmerFinder, which examines the number of cooccurring k-mers (substrings of k nucleotides in DNA sequence data). The performances of the methods were subsequently evaluated on three data sets of short sequence reads or draft genomes from public databases. In total, the evaluation sets constituted sequence data from more than 11,000 isolates covering 159 genera and 243 species. Our results indicate that methods that sample only chromosomal, core genes have difficulties in distinguishing closely related species which only recently diverged. The KmerFinder method had the overall highest accuracy and correctly identified from 93% to 97% of the isolates in the evaluations sets.
Fast and accurate identification and typing of pathogens are essential for effective surveillance and outbreak detection. The current routine procedure is based on a variety of techniques, making the procedure laborious, time-consuming, and expensive. With whole-genome sequencing (WGS) becoming cheaper, it has huge potential in both diagnostics and routine surveillance. The aim of this study was to perform a real-time evaluation of WGS for routine typing and surveillance of verocytotoxin-producing Escherichia coli (VTEC). In Denmark, the Statens Serum Institut (SSI) routinely receives all suspected VTEC isolates. During a 7-week period in the fall of 2012, all incoming isolates were concurrently subjected to WGS using IonTorrent PGM. Real-time bioinformatics analysis was performed using web-tools (www.genomicepidemiology.org) for species determination, multilocus sequence type (MLST) typing, and determination of phylogenetic relationship, and a specific VirulenceFinder for detection of E. coli virulence genes was developed as part of this study. In total, 46 suspected VTEC isolates were characterized in parallel during the study. VirulenceFinder proved successful in detecting virulence genes included in routine typing, explicitly verocytotoxin 1 (vtx1), verocytotoxin 2 (vtx2), and intimin (eae), and also detected additional virulence genes. VirulenceFinder is also a robust method for assigning verocytotoxin (vtx) subtypes. A real-time clustering of isolates in agreement with the epidemiology was established from WGS, enabling discrimination between sporadic and outbreak isolates. Overall, WGS typing produced results faster and at a lower cost than the current routine. Therefore, WGS typing is a superior alternative to conventional typing strategies. This approach may also be applied to typing and surveillance of other pathogens.
We present an Aboriginal Australian genomic sequence obtained from a 100-year-old lock of hair donated by an Aboriginal man from southern Western Australia in the early 20th century. We detect no evidence of European admixture and estimate contamination levels to be below 0.5%. We show that Aboriginal Australians are descendants of an early human dispersal into eastern Asia, possibly 62,000 to 75,000 years ago. This dispersal is separate from the one that gave rise to modern Asians 25,000 to 38,000 years ago. We also find evidence of gene flow between populations of the two dispersal waves prior to the divergence of Native Americans from modern Asian ancestors. Our findings support the hypothesis that present-day Aboriginal Australians descend from the earliest humans to occupy Australia, likely representing one of the oldest continuous populations outside Africa.
We aimed to investigate whether the character of the immunodominant HIV-Gag peptide (variable or conserved) targeted by CD8+ T cells in early HIV infection would influence the quality and quantity of T cell responses, and whether this would affect the rate of disease progression. Treatment-naive HIV-infected study subjects within the OPTIONS cohort at the University of California, San Francisco, were monitored from an estimated 44 days postinfection for up to 6 years. CD8+ T cells responses targeting HLA-matched HIV-Gag-epitopes were identified and characterized by multicolor flow cytometry. The autologous HIV gag sequences were obtained. We demonstrate that patients targeting a conserved HIV-Gag-epitope in early infection maintained their epitope-specific CD8+ T cell response throughout the study period. Patients targeting a variable epitope showed decreased immune responses over time, although there was no limitation of the functional profile, and they were likely to target additional variable epitopes. Maintained immune responses to conserved epitopes were associated with no or limited sequence evolution within the targeted epitope. Patients with immune responses targeting conserved epitopes had a significantly lower median viral load over time compared to patients with responses targeting a variable epitope (0.63 log10 difference). Furthermore, the rate of CD4+ T cell decline was slower for subjects targeting a conserved epitope (0.85% per month) compared to subjects targeting a variable epitope (1.85% per month). Previous studies have shown that targeting of antigens based on specific HLA types is associated with a better disease course. In this study we show that categorizing epitopes based on their variability is associated with clinical outcome.
Salmonella enterica is a common cause of minor and large food borne outbreaks. To achieve successful and nearly ‘real-time’ monitoring and identification of outbreaks, reliable sub-typing is essential. Whole genome sequencing (WGS) shows great promises for using as a routine epidemiological typing tool. Here we evaluate WGS for typing of S. Typhimurium including different approaches for analyzing and comparing the data. A collection of 34 S. Typhimurium isolates was sequenced. This consisted of 18 isolates from six outbreaks and 16 epidemiologically unrelated background strains. In addition, 8 S. Enteritidis and 5 S. Derby were also sequenced and used for comparison. A number of different bioinformatics approaches were applied on the data; including pan-genome tree, k-mer tree, nucleotide difference tree and SNP tree. The outcome of each approach was evaluated in relation to the association of the isolates to specific outbreaks. The pan-genome tree clustered 65% of the S. Typhimurium isolates according to the pre-defined epidemiology, the k-mer tree 88%, the nucleotide difference tree 100% and the SNP tree 100% of the strains within S. Typhimurium. The resulting outcome of the four phylogenetic analyses were also compared to PFGE reveling that WGS typing achieved the greater performance than the traditional method. In conclusion, for S. Typhimurium, SNP analysis and nucleotide difference approach of WGS data seem to be the superior methods for epidemiological typing compared to other phylogenetic analytic approaches that may be used on WGS. These approaches were also superior to the more classical typing method, PFGE. Our study also indicates that WGS alone is insufficient to determine whether strains are related or un-related to outbreaks. This still requires the combination of epidemiological data and whole genome sequencing results.
The binding of antigens to antibodies is one of the key events in an immune response against foreign molecules and is a critical element of several biomedical applications including vaccines and immunotherapeutics. For development of such applications, the identification of antibody binding sites (B-cell epitopes) is essential. However experimental epitope mapping is highly cost-intensive and computer-aided methods do in general have moderate performance. One major reason for this moderate performance is an incomplete understanding of what characterizes an epitope. To fill this gap, we here developed a novel framework for comparing and superimposing B-cell epitopes and applied it on a dataset of 107 non-similar antigen:antibody structures extracted from the PDB database. With the presented framework, we were able to describe the general B-cell epitope as a flat, oblong, oval shaped volume consisting of predominantly hydrophobic amino acids in the center flanked by charged residues. The average epitope was found to be made up of ~15 residues with one linear stretch of 5 or more residues constituting more than half of the epitope size. Furthermore, the epitope area is predominantly constrained to a plane above the antibody tip, in which the epitope is orientated in a −30 to 60 degree angle relative to the light to heavy chain antibody direction. Contrary to previously findings, we did not find a significant deviation between the amino acid composition in epitopes and the composition of equally exposed parts of the antigen surface. Our results, in combination with previously findings, give a detailed picture of the B-cell epitope that may be used in development of improved B-cell prediction methods.
Antibody; Antigen; Epitope; Structure; Amino acid distribution
Whole-genome sequencing (WGS) is becoming available as a routine tool for clinical microbiology. If applied directly on clinical samples, this could further reduce diagnostic times and thereby improve control and treatment. A major bottleneck is the availability of fast and reliable bioinformatic tools. This study was conducted to evaluate the applicability of WGS directly on clinical samples and to develop easy-to-use bioinformatic tools for the analysis of sequencing data. Thirty-five random urine samples from patients with suspected urinary tract infections were examined using conventional microbiology, WGS of isolated bacteria, and direct sequencing on pellets from the urine samples. A rapid method for analyzing the sequence data was developed. Bacteria were cultivated from 19 samples but in pure cultures from only 17 samples. WGS improved the identification of the cultivated bacteria, and almost complete agreement was observed between phenotypic and predicted antimicrobial susceptibilities. Complete agreement was observed between species identification, multilocus sequence typing, and phylogenetic relationships for Escherichia coli and Enterococcus faecalis isolates when the results of WGS of cultured isolates and urine samples were directly compared. Sequencing directly from the urine enabled bacterial identification in polymicrobial samples. Additional putative pathogenic strains were observed in some culture-negative samples. WGS directly on clinical samples can provide clinically relevant information and drastically reduce diagnostic times. This may prove very useful, but the need for data analysis is still a hurdle to clinical implementation. To overcome this problem, a publicly available bioinformatic tool was developed in this study.
Cheap DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data where a reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients: one running in a web browser, and one as a python script. Both are able to handle a large number of sequencing reads and from portable devices (the browser-based running on a tablet), perform its task within seconds, and consume an amount of bandwidth compatible with mobile broadband networks. Such client-server approaches could develop in the future, allowing a fully automated processing of sequencing data and routine instant quality check of sequencing runs from desktop sequencers. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data are available at http://bit.ly/1aURxkc.
Although the majority of bacteria are harmless or even beneficial to their host, others are highly virulent and can cause serious diseases, and even death. Due to the constantly decreasing cost of high-throughput sequencing there are now many completely sequenced genomes available from both human pathogenic and innocuous strains. The data can be used to identify gene families that correlate with pathogenicity and to develop tools to predict the pathogenicity of newly sequenced strains, investigations that previously were mainly done by means of more expensive and time consuming experimental approaches. We describe PathogenFinder (http://cge.cbs.dtu.dk/services/PathogenFinder/), a web-server for the prediction of bacterial pathogenicity by analysing the input proteome, genome, or raw reads provided by the user. The method relies on groups of proteins, created without regard to their annotated function or known involvement in pathogenicity. The method has been built to work with all taxonomic groups of bacteria and using the entire training-set, achieved an accuracy of 88.6% on an independent test-set, by correctly classifying 398 out of 449 completely sequenced bacteria. The approach here proposed is not biased on sets of genes known to be associated with pathogenicity, thus the approach could aid the discovery of novel pathogenicity factors. Furthermore the pathogenicity prediction web-server could be used to isolate the potential pathogenic features of both known and unknown strains.
Identifying which mutation(s) within a given genotype is responsible for an observable phenotype is important in many aspects of molecular biology. Here, we present SigniSite, an online application for subgroup-free residue-level genotype–phenotype correlation. In contrast to similar methods, SigniSite does not require any pre-definition of subgroups or binary classification. Input is a set of protein sequences where each sequence has an associated real number, quantifying a given phenotype. SigniSite will then identify which amino acid residues are significantly associated with the data set phenotype. As output, SigniSite displays a sequence logo, depicting the strength of the phenotype association of each residue and a heat-map identifying ‘hot’ or ‘cold’ regions. SigniSite was benchmarked against SPEER, a state-of-the-art method for the prediction of specificity determining positions (SDP) using a set of human immunodeficiency virus protease-inhibitor genotype–phenotype data and corresponding resistance mutation scores from the Stanford University HIV Drug Resistance Database, and a data set of protein families with experimentally annotated SDPs. For both data sets, SigniSite was found to outperform SPEER. SigniSite is available at: http://www.cbs.dtu.dk/services/SigniSite/.
The interaction between antibodies and antigens is one of the most important immune system mechanisms for clearing infectious organisms from the host. Antibodies bind to antigens at sites referred to as B-cell epitopes. Identification of the exact location of B-cell epitopes is essential in several biomedical applications such as; rational vaccine design, development of disease diagnostics and immunotherapeutics. However, experimental mapping of epitopes is resource intensive making in silico methods an appealing complementary approach. To date, the reported performance of methods for in silico mapping of B-cell epitopes has been moderate. Several issues regarding the evaluation data sets may however have led to the performance values being underestimated: Rarely, all potential epitopes have been mapped on an antigen, and antibodies are generally raised against the antigen in a given biological context not against the antigen monomer. Improper dealing with these aspects leads to many artificial false positive predictions and hence to incorrect low performance values. To demonstrate the impact of proper benchmark definitions, we here present an updated version of the DiscoTope method incorporating a novel spatial neighborhood definition and half-sphere exposure as surface measure. Compared to other state-of-the-art prediction methods, Discotope-2.0 displayed improved performance both in cross-validation and in independent evaluations. Using DiscoTope-2.0, we assessed the impact on performance when using proper benchmark definitions. For 13 proteins in the training data set where sufficient biological information was available to make a proper benchmark redefinition, the average AUC performance was improved from 0.791 to 0.824. Similarly, the average AUC performance on an independent evaluation data set improved from 0.712 to 0.727. Our results thus demonstrate that given proper benchmark definitions, B-cell epitope prediction methods achieve highly significant predictive performances suggesting these tools to be a powerful asset in rational epitope discovery. The updated version of DiscoTope is available at www.cbs.dtu.dk/services/DiscoTope-2.0.
The human immune system has an incredible ability to fight pathogens (bacterial, fungal and viral infections). One of the most important immune system events involved in clearing infectious organisms is the interaction between the antibodies and antigens (molecules such as proteins from the pathogenic organism). Antibodies bind to antigens at sites known as B-cell epitopes. Hence, identification of areas on the surface antigens capable of binding to antibodies (also known as B-cell epitopes) may aid the development of various immune related applications (e.g. vaccines and immunotherapeutic). However, experimental identification of B-cell epitopes is a resource intensive task, thereby making computer-aided methods an appealing complementary approach. Previously reported performances of methods for B cell epitope predictive have been moderate. Here, we present an updated version of the B-cell epitope prediction method; DiscoTope, that on the basis of a protein structure and epitope propensity scores predicts residues likely to be involved in B-cell epitopes. We demonstrate that the low performances to some extent can be explained by poorly defined benchmarks, and that inclusion of additional biological information greatly enhances the predictive performance. This suggests that, given proper benchmark definitions, state-of-the-art B cell epitope prediction methods perform significantly better than generally assumed.