Development of the synorths tool to identify gene pairs with syntenic relationships
A tool named SynOrths was developed to identify syntenic genes based on the protein sequences of B. rapa
and other related species (http://brassicadb.org/brad/tools/SynOrths/
). As shown in Figure , SynOrths determines two genes to be syntenic orthologs based on both their own sequence similarity and the homology of their flanking genes.
Figure 1 The principles of syntenic gene identification in SynOrths. When determining whether two genes are under synteny, both the sequence homology of the two genes themselves and their flanking genes are considered. (A) Syntenic genes in the same direction (more ...)
There are four main steps embedded in SynOrths: (1) finding orthologous gene pairs; (2) redundant tandem gene removal; (3) locating potential syntenic orthologs by the support of flanking genes; and (4) final syntenic gene pair determination. In the first step, SynOrths runs Blastp to get basic protein sequence homology information from pairwise genomes. Gene pairs that are the best hits or with Blastp e-values <1E-20 are selected for further analysis. For tandem duplicated genes, which would add complexity to syntenic gene finding, SynOrths keeps one gene from each tandem gene array as a representative. In the second step, we identify all tandem gene arrays across the genomes being compared. Each tandem array is composed of continuously distributed homologous genes (Blastp E-value <1E-20) and should not be interrupted by more than one non-homologous gene. After that, the genes of each tandem array are replaced by the first one in the corresponding tandem. The revised homologous gene pairs are then sent to step 3 to compute the supporting strength of the flanking genes. Here, we set a threshold to check if the gene pair in question is supported by their flanking genes and thus potentially syntenic. Genes located in both flanking regions of the two genes are selected and named as the flanking gene set. We then count the number of best hit genes between the pairwise genomes in the flanking gene set. If the ratio of the best hit genes is higher than the threshold, then the homologous gene pair is considered as potentially syntenic. In the fourth step, we further screen for the best syntenic gene pairs. After the first three steps, a certain gene might have more than one potential syntenic partner, so these candidate syntenic gene pairs are then compared based on the ratio of flanking gene support and their sequence homology. The gene pair with the highest supporting ratio of flanking genes and comparably higher sequence homology is finally determined as the best syntenic pair.
There are three main parameters that should be considered when using SynOrths. They are NumQ, the number of flanking genes on each side of the query gene; NumR, the number of flanking genes on each side of the reference gene; RatioQR, the ratio of best hit pairs among these flanking genes. Because B. rapa experienced whole genome triplication and subsequent intensive gene loss, the parameters should be carefully selected when using SynOrths to determine its syntenic genes with other species. Here, we chose three arrays for the three parameters (NumQ [5, 20, 60, 100], NumR [10, 40, 100, 150], and RatioQR [0.1, 0.2, 0.4, 0.8]) to perform SynOrths analysis between B. rapa and A. thaliana. As shown in Figure , the parameters NumQ = 20, NumR = 100, RatioQR = 0.2 gave a considerably better result. This parameter set was then chosen to identify syntenic gene pairs between B. rapa and A. thaliana, A. lyrata, and T. parvula.
Figure 2 Parameter estimation in SynOrths. The number of query (B. rapa) flanking genes [5, 20, 60, 100], the number of reference (A. thaliana) flanking genes [10, 40, 100, 150], and the threshold of the flanking genes' support ratio [0.1, 0.2, 0.4, 0.8] were (more ...)
Syntenic gene determination between B. rapa and each of A. thaliana, A. lyrata, and T. parvula
Syntenic gene pairs between B. rapa and A. thaliana, B. rapa and A. lyrata, and B. rapa and T. parvula were identified using SynOrths. There were a total of 41,174, 27,379, 33,410, and 28,910 annotated proteins for B. rapa, A. thaliana, A. lyrata, and T. parvula, respectively. After removing the redundancy of duplicated tandem genes (keeping one gene from each tandem array), 38,161, 24,939, 30,773, and 27,344 genes were left for syntenic gene determination (Table ). B. rapa returned 30,615 genes syntenic to 18,410 genes in A. thaliana; 30,250 genes syntenic to 18,125 A. lyrata genes; and 29,473 genes syntenic to 17,303 T. parvula genes. A. thaliana had the highest syntenic gene ratio (80.1%) compared to A. lyrata (79.6%) and T. parvula (77.5%).
The homologous relationships of genes between B. rapa and A. thaliana, A. lyrata, or T. parvulla.
The genome triplication event in B. rapa was well supported (Figure ), because many genes that were evenly distributed in genomes of A. thaliana (14.3%), A. lyrata (14.6%), and T. parvula (16.1%) had three syntenic copies in B. rapa. Additionally, among the 7,546 B. rapa genes that had no syntenic orthologs in A. thaliana, 849 were syntenic to genes in A. lyrata, and 1,416 syntenic to T. parvulla genes. In total, there were 32,310 B. rapa genes with at least one syntenic ortholog in either A. thaliana, A. lyrata, or T. pavula, and only 5,851 B. rapa genes that had no syntenic counterparts in any of the other species.
Figure 3 Syntenic genes identified by SynOrths between B. rapa and A. thaliana, A. lyrata, or T. parvula. For each segment in A. thaliana, A. lyrata, or T. parvula, there were three syntenic copies observed in B. rapa, which clearly reflected the genome triplication (more ...)
For these non-syntenic genes between B. rapa and the other species, we considered them non-syntenic orthologs if their similarity satisfied sequence identity >70%, and coverage for each of the two genes >60%. B. rapa returned 1,391, 1,226, and 1,605 non-syntenic orthologs to 2,561 genes in A. thaliana, 1,877 in A. lyrata, 3,909 in T. parvula, respectively. However, for the 5,851 genes that had no syntenic orthologs in all three species, only 808 genes were non-syntenic orthologs of at least one gene in the other three species. These 808 genes could have been generated by gene transposition in B. rapa after its divergence from A. thaliana, A. lyrata, and T. parvula.
Most of the tandem arrays in B. rapa showed a syntenic relationship to A. thaliana, A. lyrata, or T. parvula (Table ). For all 2,137 tandem arrays in B. rapa, 1,649 (77.16%) were syntenic to A. thaliana; 1,751 (81.94%) syntenic to A. lyrata, and 1,689 (79.04%) syntenic to T. parvula. In total, 1,864 (87.23%) tandem arrays in B. rapa had syntenic counterparts in at least one of the other species.
Syntenic tandem genes between B. rapa and A. thaliana, A. lyrata, or T. parvulla.
The dataset of genes' syntenic relationship among above four species had been integrated into BRAD (Brassica Database, http://brassicadb.org/brad/searchBrMultiSynteny.php
) (Cheng et al., 2011
). This resource built bridges between model plant A. thaliana
and other Brassicaceae species, so the information of genes' function studies in A. thaliana
were linked to the newly sequenced and annotated Brassicaceae genomes. For crop species such as B. rapa, B. oleracea
, and B. napus
, with the resource of syntenic relationships we can rapidly transfer knowledges from A. thaliana
to the breeding research and apllication, and further production of the crops.