Peanut (Arachis hypogaea L.) causes one of the most serious food allergies. Peanut seed proteins, Arah1, Arah2, and Arah3, are considered to be among the most important peanut allergens. To gain insights into genome organization and evolution of allergen-encoding genes, approximately 617 kb from the genome of cultivated peanut and 215 kb from a wild relative were sequenced including three Arah1, one Arah2, eight Arah3, and two Arah6 gene family members. To assign polarity to differences between homoeologous regions in peanut, we used as outgroups the single orthologous regions in Medicago, Lotus, common bean, chickpea, and pigeonpea, which diverged from peanut about 50 Ma and have not undergone subsequent polyploidy. These regions were also compared with orthologs in many additional dicot plant species to help clarify the timing of evolutionary events. The lack of conservation of allergenic epitopes between species, and the fact that many different proteins can be allergenic, makes the identification of allergens across species by comparative studies difficult. The peanut allergen genes are interspersed with low-copy genes and transposable elements. Phylogenetic analyses revealed lineage-specific expansion and loss of low-copy genes between species and homoeologs. Arah1 syntenic regions are conserved in soybean, pigeonpea, tomato, grape, Lotus, and Arabidopsis, whereas Arah3 syntenic regions show genome rearrangements. We infer that tandem and segmental duplications led to the establishment of the Arah3 gene family. Our analysis indicates differences in conserved motifs in allergen proteins and in the promoter regions of the allergen-encoding genes. Phylogenetic analysis and genomic organization studies provide new insights into the evolution of the major peanut allergen-encoding genes.
Arachis hypogaea L.; allergens; gene synteny; genome organization; homologs; evolution
Brassica oleracea is a valuable vegetable species that has contributed to human health and nutrition for hundreds of years and comprises multiple distinct cultivar groups with diverse morphological and phytochemical attributes. In addition to this phenotypic wealth, B. oleracea offers unique insights into polyploid evolution, as it results from multiple ancestral polyploidy events and a final Brassiceae-specific triplication event. Further, B. oleracea represents one of the diploid genomes that formed the economically important allopolyploid oilseed, Brassica napus. A deeper understanding of B. oleracea genome architecture provides a foundation for crop improvement strategies throughout the Brassica genus.
We generate an assembly representing 75% of the predicted B. oleracea genome using a hybrid Illumina/Roche 454 approach. Two dense genetic maps are generated to anchor almost 92% of the assembled scaffolds to nine pseudo-chromosomes. Over 50,000 genes are annotated and 40% of the genome predicted to be repetitive, thus contributing to the increased genome size of B. oleracea compared to its close relative B. rapa. A snapshot of both the leaf transcriptome and methylome allows comparisons to be made across the triplicated sub-genomes, which resulted from the most recent Brassiceae-specific polyploidy event.
Differential expression of the triplicated syntelogs and cytosine methylation levels across the sub-genomes suggest residual marks of the genome dominance that led to the current genome architecture. Although cytosine methylation does not correlate with individual gene dominance, the independent methylation patterns of triplicated copies suggest epigenetic mechanisms play a role in the functional diversification of duplicate genes.
Phylogenetic trees are widely used for genetic and evolutionary studies in various organisms. Advanced sequencing technology has dramatically enriched data available for constructing phylogenetic trees based on single nucleotide polymorphisms (SNPs). However, massive SNP data makes it difficult to perform reliable analysis, and there has been no ready-to-use pipeline to generate phylogenetic trees from these data.
We developed a new pipeline, SNPhylo, to construct phylogenetic trees based on large SNP datasets. The pipeline may enable users to construct a phylogenetic tree from three representative SNP data file formats. In addition, in order to increase reliability of a tree, the pipeline has steps such as removing low quality data and considering linkage disequilibrium. A maximum likelihood method for the inference of phylogeny is also adopted in generation of a tree in our pipeline.
Using SNPhylo, users can easily produce a reliable phylogenetic tree from a large SNP data file. Thus, this pipeline can help a researcher focus more on interpretation of the results of analysis of voluminous data sets, rather than manipulations necessary to accomplish the analysis.
Polymorphisms; Linkage disequilibrium; Maximum likelihood
Whole genome duplication (WGD) is widespread in flowering plants and is a driving force in angiosperm diversification. The redundancy introduced by WGD allows the evolution of novel gene interactions and functions, although the patterns and processes of diversification are poorly understood. We identified ∼2,000 pairs of paralogous genes in Gossypium raimondii (cotton) resulting from an approximately 60 My old 5- to 6-fold ploidy increase. Gene expression analyses revealed that, in G. raimondii, 99.4% of the gene pairs exhibit differential expression in at least one of the three tissues (petal, leaf, and seed), with 93% to 94% exhibiting differential expression on a per-tissue basis. For 1,666 (85%) pairs, differential expression was observed in all tissues. These observations were mirrored in a time series of G. raimondii seed, and separately in leaf, petal, and seed of G. arboreum, indicating expression level diversification before species divergence. A generalized linear model revealed 92.4% of the paralog pairs exhibited expression divergence, with most exhibiting significant gene and tissue interactions indicating complementary expression patterns in different tissues. These data indicate massive, near-complete expression level neo- and/or subfunctionalization among ancient gene duplicates, suggesting these processes are essential in their maintenance over ∼60 Ma.
We constructed a very-high-density, whole-genome marker map (WGMM) for cotton by using 18,597 DNA markers corresponding to 48,958 loci that were aligned to both a consensus genetic map and a reference genome sequence. The WGMM has a density of one locus per 15.6 kb, or an average of 1.3 loci per gene. The WGMM was anchored by the use of colinear markers to a detailed genetic map, providing recombinational information. Mapped markers occurred at relatively greater physical densities in distal chromosomal regions and lower physical densities in the central regions, with all 1 Mb bins having at least nine markers. Hotspots for quantitative trait loci and resistance gene analog clusters were aligned to the map and DNA markers identified for targeting of these regions of high practical importance. Based on the cotton D genome reference sequence, the locations of chromosome structural rearrangements plotted on the map facilitate its translation to other Gossypium genome types. The WGMM is a versatile genetic map for marker assisted breeding, fine mapping and cloning of genes and quantitative trait loci, developing new genetic markers and maps, genome-wide association mapping, and genome evolution studies.
quantitative trait loci; resistance gene analog; simple sequence repeat; restriction fragment length polymorphism; inversions
Acute myeloid leukemia (AML) is the most common malignant myeloid disorder of progenitor cells in myeloid hematopoiesis and exemplifies a genetically heterogeneous disease. The patients with AML also show a heterogeneous response to therapy. Although all-trans retinoic acid (ATRA) has been successfully introduced to treat acute promyelocytic leukemia (APL), it is rather ineffective in non-APL AML. In our present study, 1200 off-patent marketed drugs and natural compounds that have been approved by the Food and Drug Administration (FDA) were screened for anti-leukemia activity using the retrovirus transduction/transformation assay (RTTA). Furazolidone (FZD) was shown to inhibit bone marrow transformation mediated by several leukemia fusion proteins, including AML1-ETO. Furazolidone has been used in the treatment of certain bacterial and protozoan infections in human and animals for more than sixty years. We investigated the anti-leukemic activity of FZD in a series of AML cells. FZD displayed potent antiproliferative properties at submicromolar concentrations and induced apoptosis in AML cell lines. Importantly, FZD treatment of certain AML cells induced myeloid cell differentiation by morphology and flow cytometry for CD11b expression. Furthermore, FZD treatment resulted in increased stability of tumor suppressor p53 protein in AML cells. Our in vitro results suggest furazolidone as a novel therapeutic strategy in AML patients.
Genome duplication (GD) has permanently shaped the architecture and function of many higher eukaryotic genomes. The angiosperms (flowering plants) are outstanding models in which to elucidate consequences of GD for higher eukaryotes, owing to their propensity for chromosomal duplication or even triplication in a few cases. Duplicated genome structures often require both intra- and inter-genome alignments to unravel their evolutionary history, also providing the means to deduce both obvious and otherwise-cryptic orthology, paralogy and other relationships among genes. The burgeoning sets of angiosperm genome sequences provide the foundation for a host of investigations into the functional and evolutionary consequences of gene and GD. To provide genome alignments from a single resource based on uniform standards that have been validated by empirical studies, we built the Plant Genome Duplication Database (PGDD; freely available at http://chibba.agtec.uga.edu/duplication/), a web service providing synteny information in terms of colinearity between chromosomes. At present, PGDD contains data for 26 plants including bryophytes and chlorophyta, as well as angiosperms with draft genome sequences. In addition to the inclusion of new genomes as they become available, we are preparing new functions to enhance PGDD.
MCScan is an algorithm able to scan multiple genomes or subgenomes in order to identify putative homologous chromosomal regions, and align these regions using genes as anchors. The MCScanX toolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity that extends the original software by incorporating 14 utility programs for visualization of results and additional downstream analyses. Applications of MCScanX to several sequenced plant genomes and gene families are shown as examples. MCScanX can be used to effectively analyze chromosome structural changes, and reveal the history of gene family expansions that might contribute to the adaptation of lineages and taxa. An integrated view of various modes of gene duplication can supplement the traditional gene tree analysis in specific families. The source code and documentation of MCScanX are freely available at http://chibba.pgml.uga.edu/mcscan2/.
Both single gene and whole genome duplications (WGD) have recurred in angiosperm evolution. However, the evolutionary effects of different modes of gene duplication, especially regarding their contributions to genetic novelty or redundancy, have been inadequately explored.
In Arabidopsis thaliana and Oryza sativa (rice), species that deeply sample botanical diversity and for which expression data are available from a wide range of tissues and physiological conditions, we have compared expression divergence between genes duplicated by six different mechanisms (WGD, tandem, proximal, DNA based transposed, retrotransposed and dispersed), and between positional orthologs. Both neo-functionalization and genetic redundancy appear to contribute to retention of duplicate genes. Genes resulting from WGD and tandem duplications diverge slowest in both coding sequences and gene expression, and contribute most to genetic redundancy, while other duplication modes contribute more to evolutionary novelty. WGD duplicates may more frequently be retained due to dosage amplification, while inferred transposon mediated gene duplications tend to reduce gene expression levels. The extent of expression divergence between duplicates is discernibly related to duplication modes, different WGD events, amino acid divergence, and putatively neutral divergence (time), but the contribution of each factor is heterogeneous among duplication modes. Gene loss may retard inter-species expression divergence. Members of different gene families may have non-random patterns of origin that are similar in Arabidopsis and rice, suggesting the action of pan-taxon principles of molecular evolution.
Gene duplication modes differ in contribution to genetic novelty and redundancy, but show some parallels in taxa separated by hundreds of millions of years of evolution.
The highly repetitive Alu retroelements are regarded as methylation centres in the genome. Methylation in the gene promoters could be spreading from them. Promoter methylation of MLH1 is frequently detected in cancers, but the underlying mechanism is unclear. The aim of this study is to understand whether the methylation in the Alu elements is associated with promoter methylation in the MLH1 gene. Bisulfite genomic sequencing was used to analyse the CpG sites of the 5′ end (promoter, exon 1 and Alu-containing intron 1) of the MLH1 gene in colorectal cancer cells and tissues, and gastric cancer tissues. Hypomethylation in the Alu elements and hypermethylation in the promoters and the regions between the promoters and the Alu elements were detected in two cancer cell lines and seven cancer tissues. However, demethylation or hypomethylation of the MLH1 promoter and regions between promoter and the Alu elements, and hypermethylation in the Alu elements, were identified in the normal tissues. MLH1 promoter methylation may spread from Alu elements that are located in intron 1 of the MLH1 gene. The trans-acting elements binding to the mutation sites could play a role in the methylation spreading.
Evolution of the Brassica species has been recursively affected by polyploidy events, and comparison to their relative, Arabidopsis thaliana, provides means to explore their genomic complexity.
A genome-wide physical map of a rapid-cycling strain of B. oleracea was constructed by integrating high-information-content fingerprinting (HICF) of Bacterial Artificial Chromosome (BAC) clones with hybridization to sequence-tagged probes. Using 2907 contigs of two or more BACs, we performed several lines of comparative genomic analysis. Interspecific DNA synteny is much better preserved in euchromatin than heterochromatin, showing the qualitative difference in evolution of these respective genomic domains. About 67% of contigs can be aligned to the Arabidopsis genome, with 96.5% corresponding to euchromatic regions, and 3.5% (shown to contain repetitive sequences) to pericentromeric regions. Overgo probe hybridization data showed that contigs aligned to Arabidopsis euchromatin contain ~80% of low-copy-number genes, while genes with high copy number are much more frequently associated with pericentromeric regions. We identified 39 interchromosomal breakpoints during the diversification of B. oleracea and Arabidopsis thaliana, a relatively high level of genomic change since their divergence. Comparison of the B. oleracea physical map with Arabidopsis and other available eudicot genomes showed appreciable 'shadowing' produced by more ancient polyploidies, resulting in a web of relatedness among contigs which increased genomic complexity.
A high-resolution genetically-anchored physical map sheds light on Brassica genome organization and advances positional cloning of specific genes, and may help to validate genome sequence assembly and alignment to chromosomes.
All the physical mapping data is freely shared at a WebFPC site (http://lulu.pgml.uga.edu/fpc/WebAGCoL/brassica/WebFPC/; Temporarily password-protected: account: pgml; password: 123qwe123.
Comparative genomics; polyploidy; Arabidopsis thaliana
Genetically anchored physical maps of large eukaryotic genomes have proven useful both for their intrinsic merit and as an adjunct to genome sequencing. Cultivated tetraploid cottons, Gossypium hirsutum and G. barbadense, share a common ancestor formed by a merger of the A and D genomes about 1-2 million years ago. Toward the long-term goal of characterizing the spectrum of diversity among cotton genomes, the worldwide cotton community has prioritized the D genome progenitor Gossypium raimondii for complete sequencing.
A whole genome physical map of G. raimondii, the putative D genome ancestral species of tetraploid cottons was assembled, integrating genetically-anchored overgo hybridization probes, agarose based fingerprints and 'high information content fingerprinting' (HICF). A total of 13,662 BAC-end sequences and 2,828 DNA probes were used in genetically anchoring 1585 contigs to a cotton consensus genetic map, and 370 and 438 contigs, respectively to Arabidopsis thaliana (AT) and Vitis vinifera (VV) whole genome sequences.
Several lines of evidence suggest that the G. raimondii genome is comprised of two qualitatively different components. Much of the gene rich component is aligned to the Arabidopsis and Vitis vinifera genomes and shows promise for utilizing translational genomic approaches in understanding this important genome and its resident genes. The integrated genetic-physical map is of value both in assembling and validating a planned reference sequence.
Papaya, a fruit crop cultivated in tropical and subtropical regions, is known for its nutritional benefits and medicinal applications. Here we report a 3× draft genome sequence of ‘SunUp’ papaya, the first commercial virus-resistant transgenic fruit tree1 to be sequenced. The papaya genome is three times the size of the Arabidopsis genome, but contains fewer genes, including significantly fewer disease-resistance gene analogues. Comparison of the five sequenced genomes suggests a minimal angiosperm gene set of 13,311. A lack of recent genome duplication, atypical of other angiosperm genomes sequenced so far2–5, may account for the smaller papaya gene number in most functional groups. Nonetheless, striking amplifications in gene number within particular functional groups suggest roles in the evolution of tree-like habit, deposition and remobilization of starch reserves, attraction of seed dispersal agents, and adaptation to tropical daylengths. Transgenesis at three locations is closely associated with chloroplast insertions into the nuclear genome, and with topoisomerase I recognition sites. Papaya offers numerous advantages as a system for fruit-tree functional genomics, and this draft genome sequence provides the foundation for revealing the basis of Carica's distinguishing morpho-physiological, medicinal and nutritional properties.
Comparison of the sorghum, maize and rice genomes shows that gene duplication and functional innovation is common to evolution of most but not all genes in the C4 photosynthetic pathway
Sorghum is the first C4 plant and the second grass with a full genome sequence available. This makes it possible to perform a whole-genome-level exploration of C4 pathway evolution by comparing key photosynthetic enzyme genes in sorghum, maize (C4) and rice (C3), and to investigate a long-standing hypothesis that a reservoir of duplicated genes is a prerequisite for the evolution of C4 photosynthesis from a C3 progenitor.
We show that both whole-genome and individual gene duplication have contributed to the evolution of C4 photosynthesis. The C4 gene isoforms show differential duplicability, with some C4 genes being recruited from whole genome duplication duplicates by multiple modes of functional innovation. The sorghum and maize carbonic anhydrase genes display a novel mode of new gene formation, with recursive tandem duplication and gene fusion accompanied by adaptive evolution to produce C4 genes with one to three functional units. Other C4 enzymes in sorghum and maize also show evidence of adaptive evolution, though differing in level and mode. Intriguingly, a phosphoenolpyruvate carboxylase gene in the C3 plant rice has also been evolving rapidly and shows evidence of adaptive evolution, although lacking key mutations that are characteristic of C4 metabolism. We also found evidence that both gene redundancy and alternative splicing may have sheltered the evolution of new function.
Gene duplication followed by functional innovation is common to evolution of most but not all C4 genes. The apparently long time-lag between the availability of duplicates for recruitment into C4 and the appearance of C4 grasses, together with the heterogeneity of origins of C4 genes, suggests that there may have been a long transition process before the establishment of C4 photosynthesis.
The identification of chromosomal homology will shed light on such mysteries of genome evolution as DNA duplication, rearrangement and loss. Several approaches have been developed to detect chromosomal homology based on gene synteny or colinearity. However, the previously reported implementations lack statistical inferences which are essential to reveal actual homologies.
In this study, we present a statistical approach to detect homologous chromosomal segments based on gene colinearity. We implement this approach in a software package ColinearScan to detect putative colinear regions using a dynamic programming algorithm. Statistical models are proposed to estimate proper parameter values and evaluate the significance of putative homologous regions. Statistical inference, high computational efficiency and flexibility of input data type are three key features of our approach.
We apply ColinearScan to the Arabidopsis and rice genomes to detect duplicated regions within each species and homologous fragments between these two species. We find many more homologous chromosomal segments in the rice genome than previously reported. We also find many small colinear segments between rice and Arabidopsis genomes.
Human utilization of the mulberry–silkworm interaction started at least 5,000 years ago and greatly influenced world history through the Silk Road. Complementing the silkworm genome sequence, here we describe the genome of a mulberry species Morus notabilis. In the 330-Mb genome assembly, we identify 128 Mb of repetitive sequences and 29,338 genes, 60.8% of which are supported by transcriptome sequencing. Mulberry gene sequences appear to evolve ~3 times faster than other Rosales, perhaps facilitating the species’ spread worldwide. The mulberry tree is among a few eudicots but several Rosales that have not preserved genome duplications in more than 100 million years; however, a neopolyploid series found in the mulberry tree and several others suggest that new duplications may confer benefits. Five predicted mulberry miRNAs are found in the haemolymph and silk glands of the silkworm, suggesting interactions at molecular levels in the plant–herbivore relationship. The identification and analyses of mulberry genes involved in diversifying selection, resistance and protease inhibitor expressed in the laticifers will accelerate the improvement of mulberry plants.
Mulberry trees are the primary food source for silkworms, which are reared for the production of silk. In this study, He et al. present the draft genome sequence of Morus notabilis and find that it evolved significantly faster than other plants in the Rosales order.
Polyploidization has provided much genetic variation for plant adaptive evolution,
but the mechanisms by which the molecular evolution of polyploid genomes establishes
genetic architecture underlying species differentiation are unclear. Brassica
is an ideal model to increase knowledge of polyploid evolution. Here we describe a
draft genome sequence of Brassica oleracea, comparing it with that of its
sister species B. rapa to reveal numerous chromosome rearrangements and
asymmetrical gene loss in duplicated genomic blocks, asymmetrical amplification of
transposable elements, differential gene co-retention for specific pathways and
variation in gene expression, including alternative splicing, among a large number
of paralogous and orthologous genes. Genes related to the production of anticancer
phytochemicals and morphological variations illustrate consequences of genome
duplication and gene divergence, imparting biochemical and morphological variation
to B. oleracea. This study provides insights into Brassica genome
evolution and will underpin research into the many important crops in this
Brassica oleracea is plant species comprising economically important vegetable
crops. Here, the authors report the draft genome sequence of B. oleracea and,
through a comparative analysis with the closely related B. rapa, reveal insights
into Brassica evolution and divergence of interspecific genomes and intraspecific
We report improved whole-genome shotgun sequences for the genomes of indica and japonica rice, both with multimegabase contiguity, or almost 1,000-fold improvement over the drafts of 2002. Tested against a nonredundant collection of 19,079 full-length cDNAs, 97.7% of the genes are aligned, without fragmentation, to the mapped super-scaffolds of one or the other genome. We introduce a gene identification procedure for plants that does not rely on similarity to known genes to remove erroneous predictions resulting from transposable elements. Using the available EST data to adjust for residual errors in the predictions, the estimated gene count is at least 38,000–40,000. Only 2%–3% of the genes are unique to any one subspecies, comparable to the amount of sequence that might still be missing. Despite this lack of variation in gene content, there is enormous variation in the intergenic regions. At least a quarter of the two sequences could not be aligned, and where they could be aligned, single nucleotide polymorphism (SNP) rates varied from as little as 3.0 SNP/kb in the coding regions to 27.6 SNP/kb in the transposable elements. A more inclusive new approach for analyzing duplication history is introduced here. It reveals an ancient whole-genome duplication, a recent segmental duplication on Chromosomes 11 and 12, and massive ongoing individual gene duplications. We find 18 distinct pairs of duplicated segments that cover 65.7% of the genome; 17 of these pairs date back to a common time before the divergence of the grasses. More important, ongoing individual gene duplications provide a never-ending source of raw material for gene genesis and are major contributors to the differences between members of the grass family.
Comparative genome sequencing of indica and japonica rice reveals that duplication of genes and genomic regions has played a major part in the evolution of grass genomes