|Home | About | Journals | Submit | Contact Us | Français|
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The globe artichoke (Cynara cardunculus var. scolymus L.) is a significant crop in the Mediterranean basin. Despite its commercial importance and its both dietary and pharmaceutical value, knowledge of its genetics and genomics remains scant. Microsatellite markers have become a key tool in genetic and genomic analysis, and we have exploited recently acquired EST (expressed sequence tag) sequence data (Composite Genome Project - CGP) to develop an extensive set of microsatellite markers.
A unigene assembly was created from over 36,000 globe artichoke EST sequences, containing 6,621 contigs and 12,434 singletons. Over 12,000 of these unigenes were functionally assigned on the basis of homology with Arabidopsis thaliana reference proteins. A total of 4,219 perfect repeats, located within 3,308 unigenes was identified and the gene ontology (GO) analysis highlighted some GO term's enrichments among different classes of microsatellites with respect to their position. Sufficient flanking sequence was available to enable the design of primers to amplify 2,311 of these microsatellites, and a set of 300 was tested against a DNA panel derived from 28 C. cardunculus genotypes. Consistent amplification and polymorphism was obtained from 236 of these assays. Their polymorphic information content (PIC) ranged from 0.04 to 0.90 (mean 0.66). Between 176 and 198 of the assays were informative in at least one of the three available mapping populations.
EST-based microsatellites have provided a large set of de novo genetic markers, which show significant amounts of polymorphism both between and within the three taxa of C. cardunculus. They are thus well suited as assays for phylogenetic analysis, the construction of genetic maps, marker-assisted breeding, transcript mapping and other genomic applications in the species.
The globe artichoke Cynara cardunculus is a member of the Asteraceae (Compositae) family, and originates from the Mediterranean basin . The species is subdivided into three taxa - the globe artichoke (var. scolymus L.), the cultivated cardoon (var. altilis DC), and their progenitor, the wild cardoon [var. sylvestris (Lamk) Fiori]. The edible part of the globe artichoke plant is provided by its immature inflorescence, referred as a capitulum or head , and represents a significant component of the Mediterranean diet. The cultivated cardoon is grown for its fleshy stems, which are used in traditional cuisine. Leaf extracts of the species contain a number of bioactive compounds (e.g., quercetin, rutin, luteolin, gallic acid, di-caffeoylchinic acid, and sesquiterpene lactones) which have been shown to have anti-oxidative and anti-carcinogenic activity, to inhibit cholesterol biosynthesis, and to enhance lipid metabolism [3-8]. The antioxidant content per serving of globe artichoke ranks very highly among vegetables , while the roots provide a source of inulin, a proven enhancer of human intestinal flora [10,11]. In spite of its economic importance, however, little breeding effort has been applied to date in the globe artichoke.
Progress has been made in the development of DNA marker based genetic maps in globe artichoke [12-14]. The most saturated map has been recently developed from a set of F1 progeny of a cross between a globe artichoke and a cultivated cardoon genotypes . This map consisted of 20 linkage groups comprising 326 loci and spanned ~1500 cM with a mean inter-marker distance of ~4.5 cM. A set of loci common to this map and a previously developed one  allowed for map alignment and the definition of 17 homologous linkage groups, corresponding to the haploid chromosome number of the species.
It was long assumed that SSRs were primarily associated with non-coding DNA, but it has now become clear that they are more frequent in transcribed than non-transcribed sequences and equally frequent in the transcriptomes of plants with dramatically different nuclear DNA contents . EST databases therefore represent a convenient resource for the identification of microsatellites, some of which, as a result of their presence within coding DNA, have the potential to deliver informative within gene markers, exploitable as COS (conserved orthologous set) for genomic comparative analysis.
Here, we report: i) the unigene assembly based on the globe artichoke EST database deposited in GenBank by the Compositae Genome Project (CGP), ii) the identification of a wide set of EST-based microsatellite markers and iii) the evaluation of the informativeness of a subset of these markers using a panel of C. cardunculus genotypes. Furthermore, we performed a comprehensive functional annotation, inferred from sequence alignment (ISA), as well as a gene ontology categorisation inferred from sequence orthology (ISO) of the SSR-containing unigenes. At last we assessed whether motif type and relative position within CDSs (Coding DNA Sequences)/UTRs (Untraslated Regions) may be preferentially associated with a particular gene ontology term.
Globe artichoke ESTs were trimmed, assembled, and annotated using a customized bioinformatic pipeline (Figure (Figure1)1) into 19,055 unigenes (6,621 contigs and 12,434 singletons) spanning 15 Mbp. The transcript assembly and unigene consensus sequences are supplied as electronic supplementary materials (See Additional file 1, 2, 3: EST assembly, 19,055 unigenes, ACE assembly file). The unigenes had a mean length of 786 ± 1.7 bp, with a mean GC content of 43.9 ± 0.03% (range 24.7 - 67.9%, Figure Figure2)2) and a mean ambiguity code ratio of 0.51 ± 0.01. Within the unigene set, 3,308 contained 4,219 uninterrupted tracts of (perfect) di-, tri-, tetra-, penta-, and hexa-nucleotide repeats, delivering a mean microsatellite density of one per 3.6 kb. Comparable microsatellite frequencies and densities have been discovered in the transcriptomes of other Compositae species [16-18]. Only perfect repeats were selected, as these appear to be the more prone to strand slippage and, consequently, tend to be more polymorphic than interrupted ones . Sufficient flanking sequence (in terms of both length and read quality) was present in 1,974 of the unigenes, containing 2,311 perfect microsatellites. The resulting PCR primers designed for these loci are given in Additional File 4 (primer pairs designed).
A subset of 300 microsatellites (ranging in length from 10 to 84 bp, and representative of a broad spectrum of the repeat types) was surveyed for their informativeness. The targeted amplicon length ranged from 98 to 456 bp and the set was selected to optimize the possibility of multiplexing on the capillary genotyping platform employed. The test germplasm panel consisted of twelve genotypes of globe artichoke, nine of cultivated cardoon, and seven of wild cardoon (Table (Table1).1). In all, 238 (79.3%) of the assays were successful; of these, 236 were informative among the taxa, while 215, 216 and 223 were polymorphic within, respectively, globe artichoke, cultivated cardoon and wild cardoon (Table (Table22 and Additional file 5: full statistics on 300 SSR-containing loci). A total of 1,546 alleles was generated from the 238 successful assays, giving a mean of 3.8 (range 1-15) alleles per locus. The largest range in amplicon length observed was 196 - 252 bp, observed for a TCAn microsatellite (CyEM-171). In 85% of the loci, the assay generated the predicted length of amplicon, in 12.2% the amplicon was longer than expected, and in 2.8% it was shorter. The allelic range (in terms of amplicon length) was greater for the wild cardoon (17.1 ± 1.0 bp) than for globe artichoke (13.6 ± 0.8 bp) or cultivated cardoon (13.7 ± 0.7 bp).
Allelic diversity was, as expected given the breeding history of these taxa [2,20,21], greater in the wild than in the two cultivated forms (Figure (Figure3A).3A). The frequency of taxon-specific alleles was two fold more in the wild cardoon, and the polymorphic information content (PIC) was higher in the wild cardoon (0.576 ± 0.015) than in either the globe artichoke (0.484 ± 0.013) or the cultivated cardoon (0.466 ± 0.015). The observed heterozygosity level (Ho) was significantly less in the cultivated cardoon than in globe artichoke, presumably because the globe artichoke is primarily a vegetatively propagated plant, and thus able to maintain a high level of heterozygosity over time [20,21]; whereas cultivated cardoon is seed propagated and has been subjected to purifying selection aimed at increasing genetic uniformity for stabilizing its production. We previously developed three mapping population for the development of C. cardunculus genetic maps by crossing one globe artichoke non spiny genotype (common female parent) with a spiny genotype of globe artichoke or cultivated cardoon or wild cardoon. When the parents of the three mapping populations were tested with the set of microsatellites, 214 were informative in at least one combination, while 153 across all the three combinations, thus supplying landmarks for comparative mapping of phenotypic and quantitative trait loci (QTLs). As expected, the most polymorphic combination (198 microsatellites) was the one involving the cross between the most genetically divergent taxa: globe artichoke and wild cardoon (Figure (Figure3B3B).
The informativeness of the newly developed EST microsatellites was comparable with that described for microsatellite markers in globe artichoke [13,22], sunflower [23,24] and lettuce . A set of five EST microsatellites was sufficient to discriminate between each of the 28 members of the germplasm panel (e.g. CyEM-10, -37, -54, -105, -254). The inferred genetic relationships were in good concordance with those derived from AFLP profiling [20,21,26,27]. Thus, the globe artichoke accessions clustered with one another (Figure (Figure4A,4A, cluster A), but two sub-clusters, corresponding to the contrasting capitulum types (i.e. non spiny vs. violet, spiny types), could be recognised. The clade most closely related to the globe artichokes contained the cultivated cardoons (Figure (Figure4A,4A, cluster B), and among these, the most well differentiated accession was 'Bianco Pieno Migliorato', as previously observed . The Spanish cultivated cardoon accessions were genetically very similar to one another. The wild cardoon accessions formed a discrete, but rather loose group (Figure (Figure4A,4A, cluster C). A principal co-ordinate analysis further illustrated the genetic relationships between members of the germplasm panel (Figure (Figure4B).4B). Axes 1 and 2 accounted for ~ 73% of the genetic variation, the former contributing ~ 47%, and the latter ~ 26%. Axis 1 distinguished the globe artichokes from the cultivated and wild cardoons, while Axis 2 separated the two cardoon taxa. As expected, F1 progenies mapped to intermediate positions with respect to those of their parents (Figure (Figure4B4B).
Of the 4,219 microsatellites, trinucleotide motifs accounted for 49%, dinucleotides for 33%, hexanucleotides for 13%, tetranucleotides for 3% and pentanucleotides for 2% (Figure (Figure5).5). Only ESTs (2,498 of the 3,308) having a non-ambiguous ortholog in Arabidopsis thaliana were taken forward for the purpose of annotation. The position of the microsatellite tract (5'-UTR or 3'-UTR or CDS) was derived from the BlastX result in conjunction with the ORF (Open Reading Frame)-Predictor algorithm . About 33% of the microsatellites were present in the 5'-UTRs, ~ 20% in the 3'-UTRs and ~ 47% in the CDSs (Figure (Figure5),5), similar to the relative frequencies in both the A. thaliana and rice genomes . Most of the CDS microsatellites consisted of trinucleotides, while dinucleotides were the most abundant in the 5'-UTRs, and di- and trinucleotides were equally represented in the 3'-UTRs. Tetra- and pentanucleotide motifs were more frequent in the 3'-UTRs than in either the CDSs or the 5'-UTRs (data not shown). Trinucleotide (and hexanucleotide) motifs are expected to predominate in the population of CDS microsatellites, as variation in their repeat number is not associated with a frame shift event . The most abundant dinucleotide repeat was AG/CT, followed by AC/GT, although AT/TA predominated in the 3'-UTRs. Among the trinucleotides, the most frequent was AAG/CTT, followed by ATC/GAT and CAC/GTG (Figure (Figure6).6). This distribution is consistent with the situation in A. thaliana and Brassica spp. orthologs, in which a preference for AG/CT and AAG/CTT motifs has been identified in the 5'-UTRs, thought to be associated with the cis-acting regulation of transcription . In the globe artichoke 5' UTRs, dinucleotide motifs were over-represented, with AG/CT being the most abundant (Figure (Figure6),6), similar to the situation in the 5'-UTRs of many plant (both mono- and dicotyledonous species) genes [31,32], which has been reported to play a role in post-transcriptional gene regulation at the RNA level [33,34]. Dinucleotide motifs were also frequent in the 3'-UTRs, possibly because AT-rich elements are able to act as cis mediators of mRNA turnover . Overall, present data confirm that homopurine/homopyrimidine repeats contribute markedly in 5'-UTR and CDS, as previously reported by Morgante et al .
Microsatellites within coding sequences can have a major effect on gene activity, since the expansion/contraction of triplets within the coding sequence alters the gene product, thereby sometimes causing a significant phenotypic change. In humans, the effects on phenotype due to the presence of SSRs in coding regions of genes playing key roles in neuronal disorders and cancer have been extensively studied . Among the microsatellites in the globe artichoke transcriptome, the six most frequent amino acid stretches present in the CDS were poly-serine (94 unigenes), poly-aspartic acid (75 unigenes), poly-glutamic acid (57 unigenes), poly-lysine (46 unigenes), poly-glycine (45 unigenes) and poly-threonine (35 unigenes). It has been reported that particular amino acid repeats tend to be associated with specific classes of proteins . Acidic and polar amino acid repeats have generally been associated with transcription factors and protein kinases, whereas serine repeats are common within membrane transporter proteins . In the globe artichoke, poly-serine and poly-glycine stretches are particularly frequent in the CDS. Poly-serine linkers are common in eukaryotic genomes, and are thought to provide a flexible interdomain. They are frequently associated with modular proteins, and are involved in complex carbohydrate degradation  and the binding of proteins with extracellular matrix components, such as the laminin binding protein. Poly-glycine (also poly-asparagine and poly-proline) microsatellites may provide a domain for DNA binding or protein-protein interactions, and has been found to be necessary for chloroplast envelope targeting. Poly-glutamic and poly-aspartic acid tracts feature in many NLS (nuclear localisation signal) proteins , and it has been suggested that both basic karyophilic and acidic clusters can enhance their selective binding to transport machinery components . Poly-glutamic acid stretches have also been implicated in transcription activation/de-activation [41-45], and microsatellite allelic variants of these genes have been identified as the genetic basis of a number of human diseases .
To support the occurrence of certain repeated motifs in the assembled unigenes we have exploited sequence alignment and gene ontology in order to annotate their functions and assess whether their motif type and position are preferentially associated with a particular gene ontology term.
In preparation, the set of globe artichoke unigenes was filtered to include only those with a BlastX E-value of < 1e-29 when matched with the A. thaliana reference protein set. In all, 12,783 queries satisfied this criterion (Additional file 6: 12,783 globe artichoke unigenes annotation). The A. thaliana gene accession numbers were used to categorize the unigenes using TAIR gene ontology (data not shown). The GoStat2 web interface was then used to identify gene ontology categories which were over-represented. By comparing either the set of microsatellite-containing unigenes, or subsets of it (e.g.: genes including di- or trinucleotide motifs in their CDS or UTRs) with the complete set of annotated unigenes, it was possible to identify over-representation in gene ontology (GO) categories (Table (Table3).3). Microsatellites appeared to be over-represented in loci involved in certain biological processes and functions, while no significant association was found with GO cell components (data not shown).
Most of the unigenes containing trinucleotide motifs in their CDS were associated with nucleic acid metabolic processes (GO:0006139), transcription (GO:0006350) and the regulation of transcription (GO:0006355), consistent with the encoding by the GAT trinucleotide of aspartic acid, since stretches of this residue are characteristic of 'karyophilic' acidic clusters in NLS (nuclear localization signal) proteins. Similarly, the AAG/TTC motif, which occurred frequently in the CDS, encodes poly-glutamate, which is thought to be involved in both protein-DNA complex assembly (GO:0065004) and heterocyclic metabolic processes (GO:0046483). Unigenes carrying dinucleotide motifs in their CDS were found to be specifically associated with the response to stimulus (GO:0050896). The AG/CT repeats in the CDSs were over-represented among genes responding to stress (GO:0006950), involved in DNA repair (GO:0006281) and in nucleic acid binding (GO:0003676). This is consistent with the presence of domains involved in protein-RNA/protein-protein sticky interactions.
The commonest microsatellite motifs occurring in 5'-UTR of unigenes were dinucleotide repeats (mostly AG/CT). These unigenes were associated with nucleic acid metabolism (GO:0006139), the regulation of gene expression (GO:0010468), transcription (GO:0006350) and the regulation of transcription (GO:0006355). AG/CT repeats were also over-represented in genes involved in the response to ABA (GO:0009737 and GO:0009738). Moreover, trans-acting elements (GO:0003700: "transcription factor activity"), which show an over-representation of trinucleotidic (ATC/GAT) in their CDSs, were also frequently enriched in their 5'UTRs by AG/CT motifs, suggesting a cascade of signal transmission. Trinucleotide motifs were not common in the 5'-UTRs, except in genes involved in the response to auxin stimulus (GO:0009733).
We have demonstrated here the utility of a set of de novo globe artichoke EST-based microsatellite markers for the definition of genetic diversity, phylogeny and genetic mapping. Since EST microsatellites lie within expressed sequences, they have the potential to represent perfect markers for genes underlying phenotypic variation. Most of these assays are fully transferable to other C. cardunculus taxa, providing anchor points for the integration of taxon-specific genetic maps. The functional annotation of these EST sequences increases their utility as a source of gene-based markers for the study of synteny and other applications.
A collection of 36,321 EST, generated from the 'Green Globe' variety of C. cardunculus var. scolymus, as part of the output of the Compositae Genome Project http://compgenomics.ucdavis.edu, was downloaded from the NCBI database http://www.ncbi.nlm.nih.gov. To generate a set of unique assemblies, the sequences were first trimmed to remove any remaining vector fragments and polyA tails, using the perl script SeqCleaner, and assembled adopting a second perl script, TGICL, employing the following parameters: p = 95 (identity percentage), l = 40 (minimum overlap length), v = 10 (maximum length of unmatched overhangs); the maximum mismatch overhang was set to 10 bp, since the sequences had already been purged of vector stretches and polyA tails. The two scripts are available at http://compbio.dfci.harvard.edu/tgi/software. The unigene set was then searched for perfect microsatellite sequences using a modified SSRIT perl script , with the minimum number of dinucleotides set as five, of tri-, tetra- and penta-nucleotides set as four, and of hexanucleotides as three. A sample of 300 non redundant microsatellite-containing sequences, selected to include the longer microsatellite motifs, was taken forward for PCR screening. Primer design was carried out using BatchPrimer3 http://probes.pw.usda.gov/batchprimer with an optimal GC content of 50%, a maximum melting temperature difference of 3°C, variable amplicon size (to allow multiplexing), and all other parameters set to default values. The de novo microsatellite markers were prefixed with 'CyEM' (Cynara Expressed Microsatellite) and numbered sequentially.
DNA was extracted from young leaves following a modified CTAB method . The primers were used to amplify genomic DNA template extracted from a germplasm panel consisting of twelve globe artichoke genotypes, representative of crops grown in the Mediterranean Basin ; nine cultivated cardoon genotypes, representative of both the Spanish and Italian gene pools ; and seven wild cardoon genotypes sampled from both Sicily and Sardinia . Full genotypes details are reported in Table Table1.1. The set also included DNA of the four parents of three established mapping populations, i.e. two globe artichoke accessions ['Romanesco C3' (C3) and 'Spinoso di Palermo' (SP)], one cultivated cardoon ('Altilis 41') and one wild cardoon ('Creta 4'); furthermore six F1 individuals from each of the segregating populations (C3 × SP, C3 × Altilis 41 and C3 × Creta 4) were included in the analyses.
Primer pairs CyEM-001 to CyEM-300 (Additional file 4: primer pairs designed) were tested for their informativeness on the germplasm panel. Amplification was carried out in 10 μl reactions containing 7 ng template DNA, 1× PCR Buffer (Qiagen Inc., Venlo, Netherlands), 1.0 mM MgCl2, 0.5 U Taq DNA polymerase (Qiagen), 40 nM 5'-labelled (FAM, HEX or TAMRA) forward primer, 40 nM unlabelled reverse primer and 0.2 mM dNTPs. A touchdown cycling regime was applied, consisting of 1 cycle at 94°C for 150 sec, 9 cycles at 94°C for 30 sec, 63°C for 30 sec (-0,7°C/cycle) and 72°C for 60 sec, then 30 cycles at 94°C for 30 sec, 57°C for 30 sec and 72°C for 60 sec, followed by a final extension at 72°C for 5 min.
Weakly amplified reactions were re-run using 1.5 mM MgCl2 and applying a final annealing temperature of 55°C. Amplicons were separated on an ABI3730 capillary DNA sequencer (Applied Biosystem Inc., Foster City, CA, USA). Internal ROX-labelled GS500 size standards were included in each capillary. Fragment data were analysed using GeneMapper v3.5 software (Applied Biosystems). The genotypic data were analysed using the GenAlex Excel package . Genetic diversity was calculated separately for the globe artichoke, cultivated cardoon and wild cardoon genotypes on the basis of (1) the mean number of alleles observed per locus (no), (2) the effective number of alleles per locus (ne) as predicted by 1/Σpi2 where pi is the frequency of the ith allele at the locus, (3) the mean observed heterozygosity (Ho), and (4) the polymorphic information content (PIC), estimated following . A co-phenetic distance matrix for co-dominant markers was generated as described by Smouse and Peakall  and used to construct a UPGMA-based dendrogram  by means of NTSYS software package v2.10 . Principal co-ordinate analysis was based on the distance matrix, with data standardization provided by the GenAlex package.
The unigene set was aligned by a BlastX  search against the A. thaliana reference proteins database (NCBI), applying an E-value threshold of e-29. The location within the gene sequence of the microsatellite (5'-UTR, CDS or 3'-UTR) was inferred from this alignment, while ORF-Predictor  was used to predict the position of the stop codon. The frequencies of peptide repeat tracts within the CDS were used to identify any over-representation of particular triplets. For this purpose, the unigenes were divided into ten subgroups on the basis of the identity of the trinucleotide motif present in the CDS. Each subgroup was then subjected to an analysis based on the ORF-Predictor algorithm, considering only the positive reading frames (+1, +2, +3), since the sequenced transcripts were originally directionally cloned. All the unigenes were assigned a function based on the GeneOntology tool http://www.arabidopsis.org, using the A. thaliana orthologs (from BlastX output) as input (using AGI codes from TAIR). Enrichment within the GO hierarchical levels by mean of different subset of unigenes was estimated using the GoStat2 interface http://gostat.wehi.edu.au/cgi-bin/goStat2.pl based on Fisher's exact test , adopting a threshold p-value of 0.01 and considering terms starting from the 3rd hierarchical level of the DAG (directed acyclic graph; Table Table33).
SL and SK planned and supervised the work. DS were responsible for the in silico analysis of the EST sequence data, primer design and amplification; AA and EP selected the constitution of the C. cardunculus gerplasm panel; DS, AA, EP and CT analysed the data. All the authors contributed to the final version of the manuscript.
EST assembly. The data provided represent the list of 6,621 assembled globe artichoke contigs derived from 23,871 ESTs.
19,055 unigenes. The data provided represent the fasta sequences of the assembled unigenes (contigs and singletons).
ACE assembly file (EagleView available at http://bioinformatics.bc.edu/marthlab/EagleView. Contig representation file parsed from TGICL output file by a customised PERL script.
primer pairs designed. The data provided represent the list of the 2,311 designed primer pairs and loci/SSRs description.
full statistics on 300 SSR-containing loci. The data provided represent the list of 300 selected SSR-containing loci, their allele statistics, polymorphism information, repetitive motif position and gene annotation.
12,783 globe artichoke unigenes annotation. The data provided represent the list of 12,783 globe artichoke unigenes annotation using BlastX (e-value threshold < 1e-29) on Arabidopsis reference protein database (June, 2008).
Globe artichoke ESTs were produced by the Compositae Genome Project http://compgenomics.ucdavis.edu/. This research was supported by grants from: (i) the National Science Foundation Plant Genome Research Program (No. 0421630), (ii) the Georgia Research Alliance, (iii) the University of Georgia Research Foundation, and (iv) by MIPAAF (Ministero delle Politiche Agricole, Alimentari e Forestali - Italy) through the CAR-VARVI ("Valorizzazione di germoplasma di carciofo attraverso la costituzione varietale ed il risanamento da virus") project.
We are grateful to Prof. Giovanni Mauromicale and Dr. Rosario Mauro (D.A.C.P.A. Agronomical Sciences, University of Catania) for providing the C. cardunculus parental accessions and the F1 individuals.