Genetic recombination describes the generation of new combinations of alleles that occurs at each generation in diploid organisms. It is an important biological process and results from a physical exchange of chromosomal material (1
). As a main driving force of evolution, recombination provides new combinations of genetic variations and accelerates the evolution of sexual reproductive organisms. A schematic illustration to show the meiotic recombination pathways is given in .
Figure 1. A schematic drawing to show the meiotic recombination pathways in a DNA system. Recombination is initiated by a double-strand break (DSB) catalysed by the Spo11 protein (green ball), a relative of archaeal topoisomerase VI. After DSBs are formed, Spo11 (more ...)
As recombination is crucial to genome evolution, identification and characterization of recombination spots are substantially important. In the past decades, several global mapping studies have been performed to map double-strand breaks sites on chromosomes in yeast to determine the distribution pattern of recombination regions across genome (3–5
). They found that meiotic recombination events generally concentrate in
kilobase regions and does not occur randomly across the genome. Regions that exhibit elevated rates of recombination relative to a neutral expectation are called recombination hotspots, whereas those with low rates of recombination are recombination coldspots. Additionally, they also found that recombination regions do not share a consensus sequence. With the rapid increasing number of sequenced genomes, it is highly desired to develop reliable automated methods for timely identifying the recombination spots.
Although considerable progress has been made in this regard, the computational predictive accuracy of recombination spots still needs further improvements. The existing computational algorithm for recombination spots prediction was based on the nucleotide sequence contents (6
), in which little sequence-order effect was taken into account. To improve the prediction quality, it is necessary to take into account this kind of effect. However, the number of possible patterns for DNA sequences is extremely large, and their lengths vary widely, making it difficult to incorporate the sequence-order information into a statistical predictor. Facing such a difficulty, how can we take into account the sequence-order effect to improve the prediction quality? If it is not feasible to count all the sequence-order information, can we find an approximate way to partially take into account it? Similar problems were also encountered in computational proteomics. To cope with this kind of problems, the concept of pseudo amino acid composition (PseAAC) was proposed by Chou (7
). Since then, the concept of PseAAC has penetrated into almost all the fields of computational proteomics, such as predicting protein submitochondrial localization (8
), predicting protein structural class (9
), predicting DNA-binding proteins (10
), identifying bacterial virulent proteins (11
), predicting metalloproteinase family (12
), predicting protein folding rate (13
), predicting GABA(A) receptor proteins (14
), predicting protein supersecondary structure (15
), predicting cyclin proteins (16
), classifying amino acids (17
), predicting enzyme family class (18
), identifying risk type of human papillomaviruses (19
), predicting allergenic proteins (20
), identifying G protein-coupled receptors and their types (21
) and discriminating outer membrane proteins (22
), among many others [see a long list of references cited in a review (23
)]. Because of its wide and increasing usage, in 2012, a powerful software called PseAAC-Builder (http://www.pseb.sf.net
) was established for generating various special modes of PseAAC, in addition to the earlier web-server PseAAC (http://www.csbio.sjtu.edu.cn/bioinf/PseAAC
) built in 2008.
Encouraged by the successes of introducing the PseAAC approach (7
) into computational proteomics, the present study was initiated in an attempt to propose a novel feature vector, called ‘pseudo dinucleotide composition’ (PseDNC), to represent DNA sequence samples by incorporating more sequence-order effects so as to improve the quality of predicting the recombination spots.
As summarized in a review (23
) and demonstrated by a series of recent publications [see, e.g. (27–29
)], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark data set to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; and (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us elaborate how to deal with these procedures one by one.