Infectious diseases remain a major public health problem worldwide. Several intervention and control strategies have been devised throughout the years to manage these complex diseases. In this scenario, immunodiagnostics have been, and still are, essential tools for demonstrating infection, for follow up studies (clinical management, prognosis of a disease), and as tools to monitor success of control strategies, and to support infection surveillance campaigns 
. Particularly in the case of intracellular pathogens, the most straight-forward strategies for immunodetection of pathogens usually rely on the detection of antibodies that bind to whole-parasite extracts or some fraction of a parasite, e.g. a flagellar fraction. These methods, however, suffer from specificity problems, as cross-reactive antibodies are common, confounding the diagnostic and often requiring additional (and perhaps more complex) diagnostic tests.
Development of new diagnostics is partly limited by the availability of well characterized antigens. Peptide scanning is a widely used technique for mapping linear epitopes in a protein antigen 
. The recent availability of peptide microarray platforms allow rapid and inexpensive high-throughput serological screenings 
. This, coupled with the increasing number of complete pathogen genomes, means that it is now theoretically possible to identify immunodominant linear epitopes by scanning all predicted protein sequences using a similar approach. For pathogens with small genomes – e.g. viruses and small bacteria – it is therefore straightforward to synthesize and test the presence of antibodies directed against thousands of individually addressable peptides, that in concert represent the whole proteome. However, this approach cannot be applied directly to bigger bacterial or eukaryotic genomes, given their larger proteomes. Therefore computational methods are required to filter down the list of candidate peptides to be tested, while at the same time enriching them in potentially reacting epitopes.
The challenge for this bioinformatic exercise is thus to identify, within a given proteome, those peptides that could be good targets for a B-cell response. The problem of B-cell epitope prediction, refers to the identification of regions in an antigen that are recognized by the corresponding binding site (“paratope”) of antibodies. Over time, a number of algorithms have been developed for the computational prediction of B-cell epitopes. 
However, perhaps with the exception of immunodominant epitopes, the set of epitopes recognized by a polyclonal sera is not independent of the method of immunization (e.g. artificial immunization vs.
natural infection), immunized species, use of adjuvants, etc. As a consequence, prediction of diagnostic epitopes in the context of a particular disease or infection is a more complex problem, where many additional constraints apply, such as mechanism of entry of the infectious agent, expression pattern of parasite proteins (when, where, abundance) amongst others. All these additional variables affect the outcome of the immune response, and may explain the variability in responses observed, for example, against the same protein in different species 
A number of successful antigen discovery efforts have been published recently, in which a computational strategy guided the selection of candidates for experimental validation. In Trypanosoma cruzi
(a unicellular protozoan), Goto Y et al 
identified and experimentally validated 8 antigens by searching for proteins bearing large tandem repeats; Cooley and coworkers 
performed a high-throughput serological screening of T. cruzi
proteins, prioritizing their candidates by known expression in relevant lifecycle stages, proteomic evidence and secretion or surface exposure likelihood. In this latter study, and starting from 400 proteins expressed in an heterologous system, the authors identified 39 promising antigens for further testing, and selected 16 for a multi-bead assay. In Echinococcus
(a metazoan) List et al.
described a bioinformatic filtering strategy, where they targeted alpha helical coiled-coils and intrinsically unstructured regions in secreted or surface-exposed parasite proteins 
. Starting from 11 proteins from two Echinococcus
species they identified 45 candidate peptides between 24 and 30 amino acids in length that were then screened using peptide microarrays. These papers provide a proof of principle for the discovery of diagnostically relevant large peptides using a computational selection.
However, we argue that many additional criteria can be integrated and exploited in a computational strategy to further guide the process of diagnostic peptide discovery. Firstly, we consider that there are significant advantages in using a peptide-level prioritization, as opposed to a protein selection process followed by peptide selection. Furthermore, we propose a feature weighting approach, in contrast to a strict filtering strategy that excludes targets/peptides that don't match the specified criteria.
For this exercise, we chose to use the genome of the protozoan parasite Trypanosoma cruzi
, the causative agent of Chagas Disease, for a number of reasons. Firstly, the genome size of T. cruzi
is large and complex for a protozoan parasite. Furthermore, this is an interesting biological model for the application of a diagnostic peptide discovery strategy, not only due to its high health impact and the need for novel diagnostics 
, but also because many antigens have already been described which can be used either to identify predictive features or to assess our method's efficacy.
Chagas disease is endemic in 18 countries in Central and South America, affecting up to 8 million individuals 
. Vectorial transmission of the disease occurs in endemic countries through the bite of some hematophagous insects, or by consumption of food exposed to secretions from infected insects 
. However, in non-endemic countries transmission mother-to child, blood transfusion, and organ transplantation also occurs. Diagnosis of the disease is challenging, because T. cruzi
human infection evolves into a chronic stage where circulating parasites or their products are difficult to detect. In addition, serological diagnostic tests can be misleading due to cross-reactivity with other related protozoan pathogens that are geographically overlapped, such as Leishmania spp.
(causative agent of Leishmaniasis) and T. rangeli
(a south American trypanosome that does not cause disease). Currently, a “conclusive” diagnosis of T. cruzi
infection is reached only after multiple serological tests 
, and there are urgent needs to develop new diagnostics that can be used in the early detection of congenital infections, to monitor blood banks and drug treatments in clinical studies.
In this work we present a comprehensive computational strategy for the discovery of diagnostically relevant peptides that can be applied to large genomes. We demonstrate the utility of our method by predicting candidate diagnostic epitopes starting from a complete eukaryotic genome.