Among the annotations that can be attached to a protein, domains occupy a key position. Protein domains are sequential and structural motifs that are found independently in different proteins, in different combinations. As such, domains seem to be functional subunits of proteins above the raw amino acid sequence level [1
]. Several approaches have been developed to define and identify protein domains. Some are based on a structural classification scheme [2
], while others are inferred by clustering conserved sub-sequences [3
]. One of the most widely used domain schemata is the Pfam database [4
]. In Pfam, each domain family is defined using a set of distinct representative protein sequences which are manually selected and aligned, and used to learn a Hidden Markov Model (HMM) [5
] of the domain. HMMs are probabilistic models which use match
states to model the conserved positions of the multiple sequence alignment, and handle the gaps with specific (insert
The Pfam database (version 23.0) offers a collection of 10 340 HMMs/domains, which cover over 73% of all proteins in the Uniprot database [6
]. The InterPro consortium [3
] has functionally annotated a subset of Pfam HMMs using the Gene Ontology (GO) [7
]. According to the InterPro annotation policy, a domain is annotated with a given GO term if all proteins where this domain is known also share this GO term. This stringent rule allows, when a new domain is detected in a protein, to transfer its annotations to this protein. Enhancing domain detection is thus a fundamental step for improving structural and functional annotations of proteins.
When analyzing a new protein sequence, each Pfam HMM is used to compute a score that measures the similarity between the sequence and the domain. If the score is above a given threshold provided by Pfam (score thresholds differ depending on the HMMs), then the presence of the domain is asserted in the protein. This threshold is referred to as the gathering threshold
and is manually curated to ensure few false positives among detected domains [4
] However, when applied to highly divergent proteins, this strategy may miss numerous domains. This is the case with Plasmodium falciparum
, the main causal agent of human malaria, which kills nearly 800 000 people each year among the 106 malaria-endemic countries [8
]. No Pfam domains are detected in nearly 50% of P. falciparum
proteins, while many domain types seem to be missing from its repertory. Although this situation may be explained by the existence of genes that are unique to this organism, it is further exacerbated by the high evolutionary distance between P. falciparum
and the classical model organisms that were used to build the HMMs. Accurately estimating the number of Pfam domains that remain to be discovered in P. falciparum
is challenging. In classical model Eukaryotes, the number of Pfam occurrences per proteins is above 0.8 (for example the coverage of S. cerevisiae
and C. elegans
is 0.9 and 0.86, respectively). Assuming a coverage of 0.8, a total ~4500 Pfam occurrences should be present in the proteome of P. falciparum
. Subtracting the number of currently annotated domains from the expected 4500 would suggest that around 1 000 domains are yet to be detected. These “missing” occurrences might be explained by the highly atypical genome of P. falciparum
, which is composed of above 80% A+T, and involves long low-complexity insertions of unknown function believed to form non-globular domains [9
]. This strongly biases the amino-acid composition of P. falciparum
proteins, in which six amino acids account for more than 50% of the protein composition [10
]. In this context, fitting the HMM library to the specificities of the target proteome may help identify additional domains not detected by the standard library.
To the best of our knowledge, two studies address this problem. First, an a posteriori
correction of domain scores has been introduced by Coin et al.
]. This correction takes the prior probability of each domain family in the target species into account. Prior probabilities are estimated using asserted domain occurrences in the closest relatives of the species. A second approach is to build taxon-specific models by integrating known domain occurrences from the nearest species into the multiple sequence alignment. For example, this method has been successfully applied to fungi by Alam et al.
] thanks to the availability of 30 fungal genomes. However, both approaches have an obvious drawback: they can only discover new occurrences of domain families already asserted in the target or its closest relatives.
Here, we propose two new approaches to circumvent this limitation by correcting the entire HMM library. The principle of these approaches is to learn overall correction rules which are applied to the emission probabilities of the match states of all HMMs. In the first approach, an amino-acid substitution matrix dedicated to the target organism is estimated and applied to the emission probabilities of the match states to mimic the evolution toward the amino-acid composition of the target species. Our second approach involves partitioning all match states of the Pfam library in clusters with similar amino-acid emission probabilities, and to use the known domain occurrences in the target species to learn specific correction rules for each class of match state.
Once a new HMM library has been built, it is used to detect new domain occurrences with low E-values. As explained above, the original Pfam library provides, with each HMM, a manually curated threshold which ensures very low false positive rates among the detected domains. However, after HMM correction, these thresholds can no longer be safely used. We propose a simple approach to estimate the False Discovery Rate (FDR) of the newly discovered domains of each corrected library. This procedure enables us to compare the results achieved by each correction method at equivalent FDR.
In the following, we first review the previously described approaches to fit an HMM library to a target species. We describe our own approaches, and present the statistical procedure for FDR estimation. The four correction methods are used to detect new domain occurrences in the P. falciparum
genome. In these experiments, we distinguish two cases depending on whether genomes close to the target organism are available or not. Finally, we use the corrected libraries to find additional domains with the Co-Occurrence Domain Discovery (CODD) procedure we have recently proposed [13
]. This procedure identifies divergent domain occurrences on the basis of co-occurrence properties, and uses its own procedure to estimate FDRs associated with the results. All predictions achieved with the corrected libraries have been integrated into a dedicated website and can be browsed at http://www.lirmm.fr/~terrapon/HMMﬁt/
. A program implementing the two proposed approaches is available at the same address.