In this paper we present the CORP algorithm that was designed to tackle an important dilemma of many functional genetic studies: is a gene with an intact ORF necessarily functional? The answer is clearly negative, as mutations in promoters or other regulatory regions as well as changes in crucial protein residues may impair the gene's activity without any obvious sequence disruption. This issue is particularly relevant for human OR genes, a majority of which lost their function in recent human history [11
]. Using CORP we evaluated the probability of OR genes to encode an active protein by examining their deviation from an OR functionally-crucial sequence consensus. It is important to note that CORP does not consider the functional consequence of each amino acid substitutions in isolation, but rather the overall number and conservation level of the positions with intolerant mutations. Thus, the resulting pseudogene likelihood score (ψL
) is a reflection of the evolutionary status of the relevant OR gene, a score that is demonstrated here to be a very good predictor of functional status. Since both the conservation core and SIFT matrix were characterized probabilistically, some false positives signals may accrue. Such inaccuracies become less significant through the use of logistic regression analysis that takes into account many other variables. An exception could be an OR that deviates from the conservation core by accumulating so called intolerant mutation to acquire another function not yet identified.
A key parameter in this algorithm is the accurate characterization of potential deleterious amino acid substitutions in highly constrained positions along the OR protein sequence. Conserved sequence motifs of OR genes were previously characterized in various studies [9
]. However, the delineation of these conservation elements was based on human OR genes, many of which have evolve under minimal selective constraints [12
]. Therefore, these sequence motifs may not accurately reflect the functionally crucial positions. Another study that used a comparison of two genome assemblies of the mouse to characterize conserved motifs in OR genes [26
] is also inadequate for the present purpose, since its invariable motifs are likely masked by species-specific conservation. In contrast, we have constructed an OR gene conservation profile by comparing both OR orthologs and paralogs of mouse and dog. These two species still rely on their sense of smell for survival, augmenting the likelihood of positional conservation. Moreover, these two mammals are sufficiently divergent (~100 Mya) so as to allow better distinction between conserved and variable residues. Therefore it is likely that the resulted conservation core is a good reflection of the functionally crucial mammalian OR positions.
The CORP algorithm is better in correctly identifying functional genes (95% success) than in predicting the inactivation of frame-disrupted pseudogenes (65% success). The failure to identify ~1/3 of the pseudogenes as non-functional is rationalized by the observation that the large majority (>95%) of the misclassified pseudogenes had only ≤ 3 frame disruptions in their sequences suggesting that they are recently-formed pseudogenes (Fig. ). Such recently-formed pseudogenes may not have had time to sufficiently deviate from their conservation core.
Figure 5 Frame disruption counts of human OR pseudogenes. The cumulative frequencies of OR pseudogenes with respect to their number of coding frame disruptions. Continuous line, ORs that are annotated by CORP as 'functional'; Broken line, ORs that are annotated (more ...)
A previous study [2
] assessed the conservation level of a gene via the Ka/Ks ratio according to its divergence from its inferred ancestral sequence. The sequence in question is compared to its two closest homologs (one ortholog and one paralog). A low value of Ka/Ks is taken as indicative of Darwinian purifying selection, hence of its functional importance. Applying this method to our training dataset revealed that it correctly identified 77% of the human pseudogenes and 74% of mouse intact genes. While this method performs slightly better in detecting true pseudogenes (67% in our method), it was significantly worse in identifying intact genes (95% in our method). Furthermore, the receiver operating characteristic (ROC) curves were compared for both methods (Fig. ), indicating a significant advantage of our method.
Figure 6 Receiver operating curve for CORP and Ka/Ks. The OR pseudogene classification efficiency is indicated by the false positive/true positive ratio. The larger area under the continuous line (93.7% vs. 84.4%) suggests that our method performs better than (more ...)
Another method [27
] compares the query sequence to a consensus motif from the Pfam database [28
]and calculates whether the deviations from the motif are consistent with a neutral drift model. This algorithm (PSILC), similarly to ours, is based on sequence conservation signals. However, because it utilizes a specific Pfam domain (7TM1) from which ORs deviate considerably it classifies a large majority of intact ORs as pseudogenes. This situation could potentially be improved by a future definition of an OR-specific 7TM Pfam domain. Another potential problem with the application of PSILC to OR sequences is that OR genes are subjected to positive selection [25
], which may lead to the misclassification of functional genes as pseudogenes [27
]. The new version of PSILC which addresses this issue (R. Durbin, private communication) could alleviate this problem. In summary, we have demonstrated that the CORP algorithm is an effective means for in silico
OR pseudogene identification. It is likely that the same procedure will be applicable to other gene families with similar evolutionary features (e.g. taste or vomeronasal receptor genes). In contrast, in cases of small gene families or single genes it might be preferable to use one of the other existing pseudogene annotation methods.
The ultimate validation for CORP would be experimental examination of the activity of putatively active and inactive OR genes by expression methodologies. Recently, Gaillard et. al [30
] demonstrated, that individual amino acid substitutions can abolish the function of particular OR gene. In this experiment they examined the activity of OR 912–93 of several species (OR5G1P in human) and found that it is inactive in orangutan and human despite their intact open reading frame (in human they corrected the existing single in-frame stop codon). Applying the OR sequence of these two species to CORP revealed that both of them were predicted to be non-functional with ψL
= 0.76 and ψL
= 0.67 for human and orangutan respectively. In contrast, the sequences of the active ORs in the other 6 species from this study received very low pseudogene likelihoods scores by our method (an average of ψL
= 0.06) suggesting that they are functional. Interestingly, the function of the two inactive receptors was restored by restoration of the highly conserved Arginine of the DRY motif (located in the interface of TM3 and the 2nd
intracellular loop) which is common to many GPCRs and is one of the 60 conserved residues in our conservation matrix. When we introduced the same His-> Arg (orangutan) and Cys->Arg (human) correction to the OR sequences of these species, they were predicted as functional by our algorithm, with pseudogene likelihoods scores of ψL
= 0.15 and ψL
= 0.10 for human and orangutan respectively. This demonstrates the ability of CORP to distinguish between functional and non-functional ORs even if they differ by only one amino acid residue, and provides a limited experimental validation. Despite this supporting evidence for the validity of our algorithm, further studies would help to assess and improve the prediction efficacy of this algorithm.
The validation of functional activity could be based on a number of roles ascribed to OR proteins. The most widely used of these assays is odorant responses [15
], but other functional roles include plasma membrane targeting [31
], protein-mediate negative feedback mechanism that underlies clonal exclusion of OR expression [32
] as well as axonal guidance in olfactory bulb glomerular targeting [34
]. Obviously, an OR may become inactivated by mutations at sites related to one or more of the above functions, in other words inactive ORs may still show undisturbed odorant binding. An advantage of the presently proposed sequence-based functional classification is that it is global, namely will show a high value of ΨL
irrespective of the site or mode of inactivation.
Another major benefit of CORP is its ability to differentiate between functional and non-functional alleles. Here we used this capacity to predict the potential dichotomous functional status of 30 OR allele pairs in the human population. These more than double the known count of OR segregating pseudogenes (SPGs) in the human genome [16
], providing additional ground for future genetic studies [35
]. Interestingly, 15 of these segregating OR loci included a polymorphism in the conserved Arg130
. This residue is part of the highly conserved MAYDRY motif [36
]. The relatively high number of polymorphisms in Arg130
has been previously attributed to the suggestion that it is less functionally important than its neighboring conserved residues (e.g. A129
) and hence is less constrained by evolutionary selection [26
]. However, in our logistic regression analysis this residue received the highest coefficient weight in the comparison of functional and non-functional OR genes, thus suggesting that other biological mechanisms are responsible to the highly polymorphic nature of this residue.