1.  Mutation rate analysis via parent–progeny sequencing of the perennial peach. I. A low rate in woody perennials and a higher mutagenicity in hybrids 
Mutation rates vary between species, between strains within species and between regions within a genome. What are the determinants of these forms of variation? Here, via parent–offspring sequencing of the peach we ask whether (i) woody perennials tend to have lower per unit time mutation rates compared to annuals, and (ii) hybrid strains have high mutation rates. Between a leaf from a low heterozygosity individual, derived from an intraspecific cross, to a leaf of its selfed progeny, the mutation rate is 7.77 × 10−9 point mutations per bp per generation, similar to Arabidopsis thaliana (7.0–7.4 × 10−9 point mutations per bp per generation). This suggests a low per unit time mutation rate as the generation time is much longer in peach. This is supported by our estimate of 9.48 × 10−9 point mutations per bp per generation from a 200-year-old low heterozygosity peach to its progeny. From a more highly heterozygous individual derived from an interspecific cross to its selfed progeny, the mutation rate is 1.38 × 10−8 mutations per site per generation, consistent with raised rates in hybrids. Our data thus suggest that (i) peach has an approximately order of magnitude lower mutation rate per unit time than Arabidopsis, consistent with reports of low evolutionary rates in woody perennials, and (ii) hybridization may, indeed, be associated with increased mutation rates as considered over a century ago.
PMCID: PMC5095371  PMID: 27798292
peach; mutation rate; generation time; heterozygosity
2.  Mutation rate analysis via parent–progeny sequencing of the perennial peach. II. No evidence for recombination-associated mutation 
Mutation rates and recombination rates vary between species and between regions within a genome. What are the determinants of these forms of variation? Prior evidence has suggested that the recombination might be mutagenic with an excess of new mutations in the vicinity of recombination break points. As it is conjectured that domesticated taxa have higher recombination rates than wild ones, we expect domesticated taxa to have raised mutation rates. Here, we use parent–offspring sequencing in domesticated and wild peach to ask (i) whether recombination is mutagenic, and (ii) whether domesticated peach has a higher recombination rate than wild peach. We find no evidence that domesticated peach has an increased recombination rate, nor an increased mutation rate near recombination events. If recombination is mutagenic in this taxa, the effect is too weak to be detected by our analysis. While an absence of recombination-associated mutation might explain an absence of a recombination–heterozygozity correlation in peach, we caution against such an interpretation.
PMCID: PMC5095386  PMID: 27798307
peach; mutation rate; crossover rate; domestication
3.  A genome-wide survey reveals abundant rice blast R-genes in resistant cultivars 
Plant resistance genes (R-genes) harbor tremendous allelic diversity, constituting a robust immune system effective against microbial pathogens. Nevertheless, few functional R-genes have been identified for even the best-studied pathosystems. Does this limited repertoire reflect specificity, with most R-genes having been defeated by former pests, or do plants harbor a rich diversity of functional R-genes whose composite behavior is yet to be characterized? Here, we survey 332 NBS-LRR genes cloned from 5 resistant rice cultivars for their ability to confer recognition of 12 rice blast isolates when transformed into susceptible cultivars. Our survey reveals that 48.5% of the 132 NBS-LRR loci tested contain functional rice blast R-genes, with most R-genes deriving from multi-copy clades containing especially diversified loci. Each R-gene recognized, on average, 2.42 of the 12 isolates screened. The abundant R-genes identified in resistant genomes provide extraordinary redundancy in the ability of host genotypes to recognize particular isolates. If the same is true for other pathogens, many extant NBS-LRR genes retain functionality. Our success at identifying rice blast R-genes also validates a highly efficient cloning and screening strategy.
PMCID: PMC4591205  PMID: 26248689
plant resistance genes; Oryza sativa; Magnaporthe oryzae; rice blast; genome-wide survey
4.  Insertions/Deletions-Associated Nucleotide Polymorphism in Arabidopsis thaliana 
Although high levels of within-species variation are commonly observed, a general mechanism for the origin of such variation is still lacking. Insertions and deletions (indels) are a widespread feature of genomes and we hypothesize that there might be an association between indels and patterns of nucleotide polymorphism. Here, we investigate flanking sequences around 18 indels (>100 bp) among a large number of accessions of the plant, Arabidopsis thaliana. We found two distinct haplotypes, i.e., a nucleotide dimorphism, present around each of these indels and dimorphic haplotypes always corresponded to the indel-present/-absent patterns. In addition, the peaks of nucleotide diversity between the two divergent alleles were closely associated with these indels. Thus, there exists a close association between indels and dimorphisms. Further analysis suggests that indel-associated substitutions could be an important component of genetic variation shaping nucleotide polymorphism in Arabidopsis. Finally, we suggest a mechanism by which indels might generate these highly divergent haplotypes. This study provides evidence that nucleotide dimorphisms, which are frequently regarded as evidence of frequency-dependent selection, could be explained simply by structural variation in the genome.
PMCID: PMC5127803  PMID: 27965694
structural variation; insertion; deletion; nucleotide polymorphism; nucleotide dimorphism
5.  GC-Content of Synonymous Codons Profoundly Influences Amino Acid Usage 
G3: Genes|Genomes|Genetics  2015;5(10):2027-2036.
Amino acids typically are encoded by multiple synonymous codons that are not used with the same frequency. Codon usage bias has drawn considerable attention, and several explanations have been offered, including variation in GC-content between species. Focusing on a simple parameter—combined GC proportion of all the synonymous codons for a particular amino acid, termed GCsyn—we try to deepen our understanding of the relationship between GC-content and amino acid/codon usage in more details. We analyzed 65 widely distributed representative species and found a close association between GCsyn, GC-content, and amino acids usage. The overall usages of the four amino acids with the greatest GCsyn and the five amino acids with the lowest GCsyn both vary with the regional GC-content, whereas the usage of the remaining 11 amino acids with intermediate GCsyn is less variable. More interesting, we discovered that codon usage frequencies are nearly constant in regions with similar GC-content. We further quantified the effects of regional GC-content variation (low to high) on amino acid usage and found that GC-content determines the usage variation of amino acids, especially those with extremely high GCsyn, which accounts for 76.7% of the changed GC-content for those regions. Our results suggest that GCsyn correlates with GC-content and has impact on codon/amino acid usage. These findings suggest a novel approach to understanding the role of codon and amino acid usage in shaping genomic architecture and evolutionary patterns of organisms.
PMCID: PMC4592985  PMID: 26248983
GC-content; synonymous codon; amino acid usage; codon usage
6.  Causes and consequences of crossing-over evidenced via a high-resolution recombinational landscape of the honey bee 
Genome Biology  2015;16(1):15.
Social hymenoptera, the honey bee (Apis mellifera) in particular, have ultra-high crossover rates and a large degree of intra-genomic variation in crossover rates. Aligned with haploid genomics of males, this makes them a potential model for examining the causes and consequences of crossing over. To address why social insects have such high crossing-over rates and the consequences of this, we constructed a high-resolution recombination atlas by sequencing 55 individuals from three colonies with an average marker density of 314 bp/marker.
We find crossing over to be especially high in proximity to genes upregulated in worker brains, but see no evidence for a coupling with immune-related functioning. We detect only a low rate of non-crossover gene conversion, contrary to current evidence. This is in striking contrast to the ultrahigh crossing-over rate, almost double that previously estimated from lower resolution data. We robustly recover the predicted intragenomic correlations between crossing over and both population level diversity and GC content, which could be best explained as indirect and direct consequences of crossing over, respectively.
Our data are consistent with the view that diversification of worker behavior, but not immune function, is a driver of the high crossing-over rate in bees. While we see both high diversity and high GC content associated with high crossing-over rates, our estimate of the low non-crossover rate demonstrates that high non-crossover rates are not a necessary consequence of high recombination rates.
Electronic supplementary material
The online version of this article (doi:10.1186/s13059-014-0566-0) contains supplementary material, which is available to authorized users.
PMCID: PMC4305242  PMID: 25651211
7.  Functional requirements driving the gene duplication in 12 Drosophila species 
BMC Genomics  2013;14:555.
Gene duplication supplies the raw materials for novel gene functions and many gene families arisen from duplication experience adaptive evolution. Most studies of young duplicates have focused on mammals, especially humans, whereas reports describing their genome-wide evolutionary patterns across the closely related Drosophila species are rare. The sequenced 12 Drosophila genomes provide the opportunity to address this issue.
In our study, 3,647 young duplicate gene families were identified across the 12 Drosophila species and three types of expansions, species-specific, lineage-specific and complex expansions, were detected in these gene families. Our data showed that the species-specific young duplicate genes predominated (86.6%) over the other two types. Interestingly, many independent species-specific expansions in the same gene family have been observed in many species, even including 11 or 12 Drosophila species. Our data also showed that the functional bias observed in these young duplicate genes was mainly related to responses to environmental stimuli and biotic stresses.
This study reveals the evolutionary patterns of young duplicates across 12 Drosophila species on a genomic scale. Our results suggest that convergent evolution acts on young duplicate genes after the species differentiation and adaptive evolution may play an important role in duplicate genes for adaption to ecological factors and environmental changes in Drosophila.
PMCID: PMC3751352  PMID: 23945147
Young duplication; Environmental factor; Convergent evolution; Adaptive evolution
8.  Genome-Wide Survey of Pseudogenes in 80 Fully Re-sequenced Arabidopsis thaliana Accessions 
PLoS ONE  2012;7(12):e51769.
Pseudogenes (Ψs), including processed and non-processed Ψs, are ubiquitous genetic elements derived from originally functional genes in all studied genomes within the three kingdoms of life. However, systematic surveys of non-processed Ψs utilizing genomic information from multiple samples within a species are still rare. Here a systematic comparative analysis was conducted of Ψs within 80 fully re-sequenced Arabidopsis thaliana accessions, and 7546 genes, representing ∼28% of the genomic annotated open reading frames (ORFs), were found with disruptive mutations in at least one accession. The distribution of these Ψs on chromosomes showed a significantly negative correlation between Ψs/ORFs and their local gene densities, suggesting a higher proportion of Ψs in gene desert regions, e.g. near centromeres. On the other hand, compared with the non-Ψ loci, even the intact coding sequences (CDSs) in the Ψ loci were found to have shorter CDS length, fewer exon number and lower GC content. In addition, a significant functional bias against the null hypothesis was detected in the Ψs mainly involved in responses to environmental stimuli and biotic stress as reported, suggesting that they are likely important for adaptive evolution to rapidly changing environments by pseudogenization to accumulate successive mutations.
PMCID: PMC3521719  PMID: 23272162
9.  Variation of presence/absence genes among Arabidopsis populations 
Gene presence/absence (P/A) polymorphisms are commonly observed in plants and are important in individual adaptation and species differentiation. Detecting their abundance, distribution and variation among individuals would help to understand the role played by these polymorphisms in a given species. The recently sequenced 80 Arabidopsis genomes provide an opportunity to address these questions.
By systematically investigating these accessions, we identified 2,407 P/A genes (or 8.9%) absent in one or more genomes, averaging 444 absent genes per accession. 50.6% of P/A genes belonged to multi-copy gene families, or 31.0% to clustered genes. However, the highest proportion of P/A genes, outnumbered in singleton genes, was observed in the regions near centromeres. In addition, a significant correlation was observed between the P/A gene frequency among the 80 accessions and the diversity level at P/A loci. Furthermore, the proportion of P/A genes was different among functional gene categories. Finally, a P/A gene tree showed a diversified population structure in the worldwide Arabidopsis accessions.
An estimate of P/A genes and their frequency distribution in the worldwide Arabidopsis accessions was obtained. Our results suggest that there are diverse mechanisms to generate or maintain P/A genes, by which individuals and functionally different genes can selectively maintain P/A polymorphisms for a specific adaptation.
PMCID: PMC3433342  PMID: 22697058
10.  Genome-wide investigation reveals high evolutionary rates in annual model plants 
BMC Plant Biology  2010;10:242.
Rates of molecular evolution vary widely among species. While significant deviations from molecular clock have been found in many taxa, effects of life histories on molecular evolution are not fully understood. In plants, annual/perennial life history traits have long been suspected to influence the evolutionary rates at the molecular level. To date, however, the number of genes investigated on this subject is limited and the conclusions are mixed. To evaluate the possible heterogeneity in evolutionary rates between annual and perennial plants at the genomic level, we investigated 85 nuclear housekeeping genes, 10 non-housekeeping families, and 34 chloroplast genes using the genomic data from model plants including Arabidopsis thaliana and Medicago truncatula for annuals and grape (Vitis vinifera) and popular (Populus trichocarpa) for perennials.
According to the cross-comparisons among the four species, 74-82% of the nuclear genes and 71-97% of the chloroplast genes suggested higher rates of molecular evolution in the two annuals than those in the two perennials. The significant heterogeneity in evolutionary rate between annuals and perennials was consistently found both in nonsynonymous sites and synonymous sites. While a linear correlation of evolutionary rates in orthologous genes between species was observed in nonsynonymous sites, the correlation was weak or invisible in synonymous sites. This tendency was clearer in nuclear genes than in chloroplast genes, in which the overall evolutionary rate was small. The slope of the regression line was consistently lower than unity, further confirming the higher evolutionary rate in annuals at the genomic level.
The higher evolutionary rate in annuals than in perennials appears to be a universal phenomenon both in nuclear and chloroplast genomes in the four dicot model plants we investigated. Therefore, such heterogeneity in evolutionary rate should result from factors that have genome-wide influence, most likely those associated with annual/perennial life history. Although we acknowledge current limitations of this kind of study, mainly due to a small sample size available and a distant taxonomic relationship of the model organisms, our results indicate that the genome-wide survey is a promising approach toward further understanding of the mechanism determining the molecular evolutionary rate at the genomic level.
PMCID: PMC3095324  PMID: 21062446
11.  Important role of indels in somatic mutations of human cancer genes 
BMC Medical Genetics  2010;11:128.
Cancer is clonal proliferation that arises owing to mutations in a subset of genes that confer growth advantage. More and more cancer related genes are found to have accumulated somatic mutations. However, little has been reported about mutational patterns of insertions/deletions (indels) in these genes.
We analyzed indels' abundance and distribution, the relative ratio between indels and somatic base substitutions and the association between those two forms of mutations in a large number of somatic mutations in the Catalogue of Somatic Mutations in Cancer database. We found a strong correlation between indels and base substitutions in cancer-related genes and showed that they tend to concentrate at the same locus in the coding sequences within the same samples. More importantly, a much higher proportion of indels were observed in somatic mutations, as compared to meiotic ones. Furthermore, our analysis demonstrated a great diversity of indels at some loci of cancer-related genes. Particularly in the genes with abundant mutations, the proportion of 3n indels in oncogenes is 7.9 times higher than that in tumor suppressor genes.
There are three distinct patterns of indel distribution in somatic mutations: high proportion, great abundance and non-random distribution. Because of the great influence of indels on gene function (e.g., the effect of frameshift mutation), these patterns indicate that indels are frequently under positive selection and can often be the 'driver mutations' in oncogenesis. Such driver forces can better explain why much less frameshift mutations are in oncogenes while much more in tumor suppressor genes, because of their different function in oncogenesis. These findings contribute to our understanding of mutational patterns and the relationship between indels and cancer.
PMCID: PMC2940769  PMID: 20807447
12.  Patterns of exon-intron architecture variation of genes in eukaryotic genomes 
BMC Genomics  2009;10:47.
The origin and importance of exon-intron architecture comprises one of the remaining mysteries of gene evolution. Several studies have investigated the variations of intron length, GC content, ordinal position in a gene and divergence. However, there is little study about the structural variation of exons and introns.
We investigated the length, GC content, ordinal position and divergence in both exons and introns of 13 eukaryotic genomes, representing plant and animal. Our analyses revealed that three basic patterns of exon-intron variation were present in nearly all analyzed genomes (P < 0.001 in most cases): an ordinal reduction of length and divergence in both exon and intron, a co-variation between exon and its flanking introns in their length, GC content and divergence, and a decrease of average exon (or intron) length, GC content and divergence as the total exon numbers of a gene increased. In addition, we observed that the shorter introns had either low or high GC content, and the GC content of long introns was intermediate.
Although the factors contributing to these patterns have not been identified, our results provide three important clues: common factor(s) exist and may shape both exons and introns; the ordinal reduction patterns may reflect a time-orderly evolution; and the larger first and last exons may be splicing-required. These clues provide a framework for elucidating mechanisms involved in the organization of eukaryotic genomes and particularly in building exon-intron structures.
PMCID: PMC2636830  PMID: 19166620
13.  Highly asymmetric rice genomes 
BMC Genomics  2007;8:154.
Individuals in the same species are assumed to share the same genomic set. However, it is not unusual to find an orthologous gene only in small subset of the species, and recent genomic studies suggest that structural rearrangements are very frequent between genomes in the same species. Two recently sequenced rice genomes Oryza sativa L. var. Nipponbare and O. sativa L. var. 93-11 provide an opportunity to systematically investigate the extent of the gene repertoire polymorphism, even though the genomic data of 93-11 derived from whole-short-gun sequencing is not yet as complete as that of Nipponbare.
We compared gene contents and the genomic locations between two rice genomes. Our conservative estimates suggest that at least 10% of the genes in the genomes were either under presence/absence polymorphism (5.2%) or asymmetrically located between genomes (4.7%). The proportion of these "asymmetric genes" varied largely among gene groups, in which disease resistance (R) genes and the RLK kinase gene group had 11.6 and 7.8 times higher proportion of asymmetric genes than housekeeping genes (Myb and MADS). The significant difference in the proportion of asymmetric genes among gene groups suggests that natural selection is responsible for maintaining genomic asymmetry. On the other hand, the nucleotide diversity in 17 R genes under presence/absence polymorphism was generally low (average nucleotide diversity = 0.0051).
The genomic symmetry was disrupted by 10% of asymmetric genes, which could cause genetic variation through more unequal crossing over, because these genes had no allelic counterparts to pair and then they were free to pair with homologues at non-allelic loci, during meiosis in heterozygotes. It might be a consequence of diversifying selection that increased the structural divergence among genomes, and of purifying selection that decreased nucleotide divergence in each R gene locus.
PMCID: PMC1914357  PMID: 17555605

