The increasing amount of data from genome-wide experiments (genome sequencing, transcriptomic, proteomic, and interactomic data) with the parallel development of novel bioinformatics tools led to a remarkable improvement of our knowledge about the biology of malaria parasites. In particular, "in silico" approaches allow the annotation of previously uncharacterised proteins [41
], the identification of possible transcription start sites [43
] as well as candidates for transcription factor binding sites [44
]. In addition, efforts have been devoted in structural genomics experiments http://www.thesgc.org
with the aim to identify novel targets for drug and/or vaccine development.
In this framework we carried out a "in silico" study on RIFINs, the most abundant multigene family in P. falciparum genome whose products are potentially involved in host-parasite interactions. In 3D7 the family numbers about 159 members and it varies in other clones. In fact, the subtelomeric regions where these genes are located are subject to frequent recombination events leading to a high variability between genomes. Recently, additional sequencing data have become available for other P. falciparum clones such as HB3 and Dd2 differing in geographical origin and phenotypic characters.
In this work we exploited genome sequence data to carry out a comparative analysis on RIFIN repertoires between 3D7, HB3 and Dd2. Comparisons were carried out by means of MDS on coding as well as upstream and downstream regions. We found that corresponding sequences have a clear cluster structure which is maintained in the three examined clones. Furthermore, when we compared the observed occurrences of all 12 possible combinations ups-cds-dwn
with those expected on the basis of a simple probabilistic model, we found very similar distributions of subsets of such combinations in all the three genomes. In addition, despite the high recombination rate of subtelomeric regions and hence the high expected sequence variability, the majority of genes is conserved between clones (i.e. it is possible to identify pairs or triples of orthologs) as well as their cluster organisation. Our results confirm recent studies by Wang et al. [47
] which recognised diverse groups of sequences and demonstrated that subsets of genes are highly conserved across genomes. In addition, in the case of upstream and coding sequences we identified several outliers, while none or few were found for downstream sequences. Since outliers may be interpreted as novel sequence variants which are generated by genetic drift, their high numbers in 5' upstream and coding sequences, compared to the low number in 3' downstream sequences indicate that the portions of genes diverge differently.
All these data may be interpreted as a consequence of a balance between drift and homogenisation mechanisms acting on these subtelomeric genes. On one hand, this balance guarantees the emergence of novel gene variants, while on the other; it preserves the functionality of the diverse parts of genes (included gene products) and the overall organisation of the entire repertoire.
In the second part of the work we examined the amino acid sequences of RIFINs in the 3D7 clone. It is already known [14
] that RIFINs can be grouped into two subfamilies: RIF_As and RIF_Bs. The main difference between members is due to an insert sequence of 25 aa which is present only in RIF_As. In the last few years it has been proposed that despite these differences, RIF_As and RIF_Bs share a similar architectural organisation: a signal peptide at N-terminus; a PEXEL motif; two transmembrane domains, the second of which is C-terminally located. In this work, we carried out a detailed analysis of all the 159 bona fide
amino acid sequences of RIFINs in 3D7 and submitted every sequence to signal peptide and transmembrane domain predictors [31
]. Interestingly, while RIF_Bs structural organisation corresponds to that proposed previously, for the majority of RIF_As no signal peptide and only one transmembrane domain at C-terminus were predicted, and hence we proposed different structural architectures for members of the two-sub-families. This is in accordance with Petter et al. [17
] which demonstrated that RIF_As and RIF_Bs have different sub-cellular localisations. During the intraerythrocytic stages of life cycle of P. falciparum
, only RIF_As are exported outside the parasite cell, while RIF_Bs remain confined within the PV. In addition, our results suggest that a canonical signal peptide is necessary to target RIFINs to the PVM or to other sub-cellular compartments, whereas alternative signals are required for translocation outside the parasite cell as demonstrated at least by the other two antigenic proteins PfEMP1 and PfEMP2 [35
Since RIF_As are those likely to be involved in host-parasite protein interactions, we constructed a 3D-model for the portion of the protein between the putative PEXEL cleavage site [36
] and the N-terminus of the C-terminal TM. To do this we applied an ab initio
procedure starting from the output of the I-TASSER algorithm [20
]. Taking advantage from the high number of RIFIN sequences, we developed a strategy to determine the most reliable 3D-model for RIF_As using a subset of 53 non-redundant RIF_A sequences. Five 3D models were constructed by I-TASSER for each of the 53 sequences. When all 265 models were then compared using MDS, we observed that they clustered into three groups, the main group of which contains 177/265 predicted structures with at least one structure predicted for each RIF_A family member.
In order to establish the most reliable structures for RIF_As, we selected 24 models within a radius of 0.015 from the centroid of the main cluster in the MDS plane. These structures were submitted to standard methods (PROCHECK, PROSA) for assessing the model quality. The best models PFF0015c_3 and PFL2660w_5 were chosen as representatives and then analysed to try to gain insights into RIF_As function. We found that both structures strongly resemble the "Armadillo-like" fold [40
]. This fold is characterised by an arrangement of alpha-helices which form a wide cleft with an extensive solvent-accessible surface and is particularly suited to binding large substrates. In fact, this fold has been found in a wide range of proteins involved in very diverse cellular processes in which protein-protein interactions play an essential role. In particular, the structure matched by PFF0015c_3 is the Tog domain from C. elegans
gene Zyg9 (2of3). These domains are found in members of the XMAP215/Dis1 family of microtubule-associated proteins (MAPs) which are essential for microtubule growth and probably bind tubulin dimers and promote microtubule polymerization [48
]. The structure matched by PFFL2660w_5 (2f31) is the N-terminal regulatory domain of Diaphanous-related formins (DRFs) which regulate the nucleation and polymerisation of unbranched actin filaments [49
To our knowledge these data represent the first attempt to propose a structural model for the RIF_A proteins of P. falciparum based on an ab-initio approach implemented on the entire gene family, integrated by an MDS-based assessment of the similarities amongst the obtained 3D predictions. Importantly, these results predict a protein fold which suggests that RIF_As may participate in protein-protein interactions. Further work will be needed to establish the cell compartments where this domain is accessible for such interactions, and to identify the host and/or parasite partners involved.