The success of Genome Wide Association Studies in the discovery of sequence variation linked to complex traits in humans has increased interest in high throughput SNP genotyping assays in livestock species. Primary goals are QTL detection and genomic selection. The purpose here was design of a 50–60,000 SNP chip for goats. The success of a moderate density SNP assay depends on reliable bioinformatic SNP detection procedures, the technological success rate of the SNP design, even spacing of SNPs on the genome and selection of Minor Allele Frequencies (MAF) suitable to use in diverse breeds. Through the federation of three SNP discovery projects consolidated as the International Goat Genome Consortium, we have identified approximately twelve million high quality SNP variants in the goat genome stored in a database together with their biological and technical characteristics. These SNPs were identified within and between six breeds (meat, milk and mixed): Alpine, Boer, Creole, Katjang, Saanen and Savanna, comprising a total of 97 animals. Whole genome and Reduced Representation Library sequences were aligned on >10 kb scaffolds of the de novo goat genome assembly. The 60,000 selected SNPs, evenly spaced on the goat genome, were submitted for oligo manufacturing (Illumina, Inc) and published in dbSNP along with flanking sequences and map position on goat assemblies (i.e. scaffolds and pseudo-chromosomes), sheep genome V2 and cattle UMD3.1 assembly. Ten breeds were then used to validate the SNP content and 52,295 loci could be successfully genotyped and used to generate a final cluster file. The combined strategy of using mainly whole genome Next Generation Sequencing and mapping on a contig genome assembly, complemented with Illumina design tools proved to be efficient in producing this GoatSNP50 chip. Advances in use of molecular markers are expected to accelerate goat genomic studies in coming years.
For decades, French guinea fowl have been affected by fulminating enteritis of unclear origin. By using metagenomics, we identified a novel avian gammacoronavirus associated with this disease that is distantly related to turkey coronaviruses. Fatal respiratory diseases in humans have recently been caused by coronaviruses of animal origin.
coronavirus; guinea fowl; metagenomics; next-generation sequencing; viruses; France; avian coronavirus; zoonoses; fulminating disease
The regular decrease of female fertility over time is a major concern in modern dairy cattle industry. Only half of this decrease is explained by indirect response to selection on milk production, suggesting the existence of other factors such as embryonic lethal genetic defects. Genomic regions harboring recessive deleterious mutations were detected in three dairy cattle breeds by identifying frequent haplotypes (>1%) showing a deficit in homozygotes among Illumina Bovine 50k Beadchip haplotyping data from the French genomic selection database (47,878 Holstein, 16,833 Montbéliarde, and 11,466 Normande animals). Thirty-four candidate haplotypes (p<10−4) including previously reported regions associated with Brachyspina, CVM, HH1, and HH3 in Holstein breed were identified. Haplotype length varied from 1 to 4.8 Mb and frequencies from 1.7 up to 9%. A significant negative effect on calving rate, consistent in heifers and in lactating cows, was observed for 9 of these haplotypes in matings between carrier bulls and daughters of carrier sires, confirming their association with embryonic lethal mutations. Eight regions were further investigated using whole genome sequencing data from heterozygous bull carriers and control animals (45 animals in total). Six strong candidate causative mutations including polymorphisms previously reported in FANCI (Brachyspina), SLC35A3 (CVM), APAF1 (HH1) and three novel mutations with very damaging effect on the protein structure, according to SIFT and Polyphen-2, were detected in GART, SHBG and SLC37A2 genes. In conclusion, this study reveals a yet hidden consequence of the important inbreeding rate observed in intensively selected and specialized cattle breeds. Counter-selection of these mutations and management of matings will have positive consequences on female fertility in dairy cattle.
Despite massive research efforts, the molecular etiology of bovine polledness and the developmental pathways involved in horn ontogenesis are still poorly understood. In a recent article, we provided evidence for the existence of at least two different alleles at the Polled locus and identified candidate mutations for each of them. None of these mutations was located in known coding or regulatory regions, thus adding to the complexity of understanding the molecular basis of polledness. We confirm previous results here and exhaustively identify the causative mutation for the Celtic allele (PC) and four candidate mutations for the Friesian allele (PF). We describe a previously unreported eyelash-and-eyelid phenotype associated with regular polledness, and present unique histological and gene expression data on bovine horn bud differentiation in fetuses affected by three different horn defect syndromes, as well as in wild-type controls. We propose the ectopic expression of a lincRNA in PC/p horn buds as a probable cause of horn bud agenesis. In addition, we provide evidence for an involvement of OLIG2, FOXL2 and RXFP2 in horn bud differentiation, and draw a first link between bovine, ovine and caprine Polled loci. Our results represent a first and important step in understanding the genetic pathways and key process involved in horn bud differentiation in Bovidae.
Genetic information based on molecular markers has increasingly being used in cattle breeding improvement programmes, as a mean to improve conventionally phenotypic selection. Advances in molecular genetics have led to the identification of several genetic markers associated with genes affecting economic traits. Until recently, the identification of the causative genetic variants involved in the phenotypes of interest has remained a difficult task. The advent of novel sequencing technologies now offers a new opportunity for the identification of such variants. Despite sequencing costs plummeting, sequencing whole-genomes or large targeted regions is still too expensive for most laboratories. A transcriptomic-based sequencing approach offers a cheaper alternative to identify a large number of polymorphisms and possibly to discover causative variants. In the present study, we performed a gene-based single nucleotide polymorphism (SNP) discovery analysis in bovine Longissimus thoraci, using RNA-Seq. To our knowledge, this represents the first study done in bovine muscle.
Messenger RNAs from Longissimus thoraci from three Limousin bull calves were subjected to high-throughput sequencing. Approximately 36–46 million paired-end reads were obtained per library. A total of 19,752 transcripts were identified and 34,376 different SNPs were detected. Fifty-five percent of the SNPs were found in coding regions and ~22% resulted in an amino acid change. Applying a very stringent SNP quality threshold, we detected 8,407 different high-confidence SNPs, 18% of which are non synonymous coding SNPs. To analyse the accuracy of RNA-Seq technology for SNP detection, 48 SNPs were selected for validation by genotyping. No discrepancies were observed when using the highest SNP probability threshold. To test the usefulness of the identified SNPs, the 48 selected SNPs were assessed by genotyping 93 bovine samples, representing mostly the nine major breeds used in France. Principal component analysis indicates a clear separation between the nine populations.
The RNA-Seq data and the collection of newly discovered coding SNPs improve the genomic resources available for cattle, especially for beef breeds. The large amount of variation present in genes expressed in Limousin Longissimus thoracis, especially the large number of non synonymous coding SNPs, may prove useful to study the mechanisms underlying the genetic variability of meat quality traits.
Single Nucleotide Polymorphism; Cattle; Muscle; RNA-Seq; Beef; Non synonymous coding variants
The availability of a large expressed sequence tags (EST) resource and recent advances in high-throughput genotyping technology have made it possible to develop highly multiplexed SNP arrays for multi-objective genetic applications, including the construction of meiotic maps. Such approaches are particularly useful in species with a large genome size, precluding the use of whole-genome shotgun assembly with current technologies.
In this study, a 12 k-SNP genotyping array was developed for maritime pine from an extensive EST resource assembled into a unigene set. The offspring of three-generation outbred and inbred mapping pedigrees were then genotyped. The inbred pedigree consisted of a classical F2 population resulting from the selfing of a single inter-provenance (Landes x Corsica) hybrid tree, whereas the outbred pedigree (G2) resulted from a controlled cross of two intra-provenance (Landes x Landes) hybrid trees. This resulted in the generation of three linkage maps based on SNP markers: one from the parental genotype of the F2 population (1,131 markers in 1,708 centimorgan (cM)), and one for each parent of the G2 population (1,015 and 1,110 markers in 1,447 and 1,425 cM for the female and male parents, respectively). A comparison of segregation patterns in the progeny obtained from the two types of mating (inbreeding and outbreeding) led to the identification of a chromosomal region carrying an embryo viability locus with a semi-lethal allele. Following selfing and segregation, zygote mortality resulted in a deficit of Corsican homozygous genotypes in the F2 population. This dataset was also used to study the extent and distribution of meiotic recombination along the length of the chromosomes and the effect of sex and/or genetic background on recombination. The genetic background of trees in which meiotic recombination occurred was found to have a significant effect on the frequency of recombination. Furthermore, only a small proportion of the recombination hot- and cold-spots were common to all three genotypes, suggesting that the spatial pattern of recombination was genetically variable.
This study led to the development of classical genomic tools for this ecologically and economically important species. It also identified a chromosomal region bearing a semi-lethal recessive allele and demonstrated the genetic variability of recombination rate over the genome.
Unigene; SNP array; Linkage mapping; Segregation distortion; Recombination; Maritime pine; Pinus pinaster
Adaptation of avian influenza viruses (AIVs) from waterfowl to domestic poultry with a deletion in the neuraminidase (NA) stalk has already been reported. The way the virus undergoes this evolution, however, is thus far unclear. We address this question using pyrosequencing of duck and turkey low-pathogenicity AIVs. Ducks and turkeys were sampled at the very beginning of an H6N1 outbreak, and turkeys were swabbed again 8 days later. NA stalk deletions were evidenced in turkeys by Sanger sequencing. To further investigate viral evolution, 454 pyrosequencing was performed: for each set of samples, up to 41,500 reads of ca. 400 bp were generated and aligned. Genetic polymorphisms between duck and turkey viruses were tracked on the whole genome. NA deletion was detected in less than 2% of reads in duck feces but in 100% of reads in turkey tracheal specimens collected at the same time. Further variations in length were observed in NA from turkeys 8 days later. Similarly, minority mutants emerged on the hemagglutinin (HA) gene, with substitutions mostly in the receptor binding site on the globular head. These critical changes suggest a strong evolutionary pressure in turkeys. The increasing performances of next-generation sequencing technologies should enable us to monitor the genomic diversity of avian influenza viruses and early emergence of potentially pathogenic variants within bird flocks. The present study, based on 454 pyrosequencing, suggests that NA deletion, an example of AIV adaptation from waterfowl to domestic poultry, occurs by selection rather than de novo emergence of viral mutants.
Polled and Multisystemic Syndrome (PMS) is a novel developmental disorder occurring in the progeny of a single bull. Its clinical spectrum includes polledness (complete agenesis of horns), facial dysmorphism, growth delay, chronic diarrhea, premature ovarian failure, and variable neurological and cardiac anomalies. PMS is also characterized by a deviation of the sex-ratio, suggesting male lethality during pregnancy. Using Mendelian error mapping and whole-genome sequencing, we identified a 3.7 Mb deletion on the paternal bovine chromosome 2 encompassing ARHGAP15, GTDC1 and ZEB2 genes. We then produced control and affected 90-day old fetuses to characterize this syndrome by histological and expression analyses. Compared to wild type individuals, affected animals showed a decreased expression of the three deleted genes. Based on a comparison with human Mowat-Wilson syndrome, we suggest that deletion of ZEB2, is responsible for most of the effects of the mutation. Finally sperm-FISH, embryo genotyping and analysis of reproduction records confirmed somatic mosaicism in the founder bull and male-specific lethality during the first third of gestation. In conclusion, we identified a novel locus involved in bovid horn ontogenesis and suggest that epithelial-to-mesenchymal transition plays a critical role in horn bud differentiation. We also provide new insights into the pathogenicity of ZEB2 loss of heterozygosity in bovine and humans and describe the first case of male-specific lethality associated with an autosomal locus in a non-murine mammalian species. This result sets PMS as a unique model to study sex-specific gene expression/regulation.
As for other non-model species, genetic analyses in quail will benefit greatly from a higher marker density, now attainable thanks to the evolution of sequencing and genotyping technologies. Our objective was to obtain the first genome wide panel of Japanese quail SNP (Single Nucleotide Polymorphism) and to use it for the fine mapping of a QTL for a fear-related behaviour, namely tonic immobility, previously localized on Coturnix japonica chromosome 1. To this aim, two reduced representations of the genome were analysed through high-throughput 454 sequencing: AFLP (Amplified Fragment Length Polymorphism) fragments as representatives of genomic DNA, and EST (Expressed Sequence Tag) as representatives of the transcriptome.
The sequencing runs produced 399,189 and 1,106,762 sequence reads from cDNA and genomic fragments, respectively. They covered over 434 Mb of sequence in total and allowed us to detect 17,433 putative SNP. Among them, 384 were used to genotype two Advanced Intercross Lines (AIL) obtained from three quail lines differing for duration of tonic immobility. Despite the absence of genotyping for founder individuals in the analysis, the previously identified candidate region on chromosome 1 was refined and led to the identification of a candidate gene.
These data confirm the efficiency of transcript and AFLP-sequencing for SNP discovery in a non-model species, and its application to the fine mapping of a complex trait. Our results reveal a significant association of duration of tonic immobility with a genomic region comprising the DMD (dystrophin) gene. Further characterization of this candidate gene is needed to decipher its putative role in tonic immobility in Coturnix.
Quail; Tonic immobility; Sequencing; AFLP; Transcripts; SNP; AIL
Next generation sequencing platforms are now well implanted in sequencing centres and some laboratories. Upcoming smaller scale machines such as the 454 junior from Roche or the MiSeq from Illumina will increase the number of laboratories hosting a sequencer. In such a context, it is important to provide these teams with an easily manageable environment to store and process the produced reads.
We describe a user-friendly information system able to manage large sets of sequencing data. It includes, on one hand, a workflow environment already containing pipelines adapted to different input formats (sff, fasta, fastq and qseq), different sequencers (Roche 454, Illumina HiSeq) and various analyses (quality control, assembly, alignment, diversity studies,…) and, on the other hand, a secured web site giving access to the results. The connected user will be able to download raw and processed data and browse through the analysis result statistics. The provided workflows can easily be modified or extended and new ones can be added. Ergatis is used as a workflow building, running and monitoring system. The analyses can be run locally or in a cluster environment using Sun Grid Engine.
NG6 is a complete information system designed to answer the needs of a sequencing platform. It provides a user-friendly interface to process, store and download high-throughput sequencing data.
Leuconostoc citreum is a key microorganism in fermented foods of plant origin. Here we report the draft genome sequence for three strains of Leuconostoc citreum, LBAE C10, LBAE C11, and LBAE E16, which have been isolated from traditional French wheat sourdoughs.
Weissella confusa is a rod-shaped heterofermentative lactic acid bacterium from the family of Leuconostocaceae. Here we report the draft genome sequence of the strain W. confusa LBAE C39-2 isolated from a traditional French wheat sourdough.
The gut microbiota, which is considered a causal factor in metabolic diseases as shown best in animals, is under the dual influence of the host genome and nutritional environment. This study investigated whether the gut microbiota per se, aside from changes in genetic background and diet, could sign different metabolic phenotypes in mice.
The unique animal model of metabolic adaptation was used, whereby C57Bl/6 male mice fed a high-fat carbohydrate-free diet (HFD) became either diabetic (HFD diabetic, HFD-D) or resisted diabetes (HFD diabetes-resistant, HFD-DR). Pyrosequencing of the gut microbiota was carried out to profile the gut microbial community of different metabolic phenotypes. Inflammation, gut permeability, features of white adipose tissue, liver and skeletal muscle were studied. Furthermore, to modify the gut microbiota directly, an additional group of mice was given a gluco-oligosaccharide (GOS)-supplemented HFD (HFD+GOS).
Despite the mice having the same genetic background and nutritional status, a gut microbial profile specific to each metabolic phenotype was identified. The HFD-D gut microbial profile was associated with increased gut permeability linked to increased endotoxaemia and to a dramatic increase in cell number in the stroma vascular fraction from visceral white adipose tissue. Most of the physiological characteristics of the HFD-fed mice were modulated when gut microbiota was intentionally modified by GOS dietary fibres.
The gut microbiota is a signature of the metabolic phenotypes independent of differences in host genetic background and diet.
Gut microbes pyrosequencing; metabolic heterogeneity; high-fat diet responsiveness; type 2 diabetes; bacterial translocation; intestinal barrier function; intestinal bacteria; bone marrow transplantation; diabetes mellitus; gastrointestinal physiology; diabetes mellitus; ANAL; diabetes mellitus; diabetes mellitus
In a context of climate change, phenotypic plasticity provides long-lived species, such as trees, with the means to adapt to environmental variations occurring within a single generation. In eucalyptus plantations, water availability is a key factor limiting productivity. However, the molecular mechanisms underlying the adaptation of eucalyptus to water shortage remain unclear. In this study, we compared the molecular responses of two commercial eucalyptus hybrids during the dry season. Both hybrids differ in productivity when grown under water deficit.
Pyrosequencing of RNA extracted from shoot apices provided extensive transcriptome coverage - a catalog of 129,993 unigenes (49,748 contigs and 80,245 singletons) was generated from 398 million base pairs, or 1.14 million reads. The pyrosequencing data enriched considerably existing Eucalyptus EST collections, adding 36,985 unigenes not previously represented. Digital analysis of read abundance in 14,460 contigs identified 1,280 that were differentially expressed between the two genotypes, 155 contigs showing differential expression between treatments (irrigated vs. non irrigated conditions during the dry season), and 274 contigs with significant genotype-by-treatment interaction. The more productive genotype displayed a larger set of genes responding to water stress. Moreover, stress signal transduction seemed to involve different pathways in the two genotypes, suggesting that water shortage induces distinct cellular stress cascades. Similarly, the response of functional proteins also varied widely between genotypes: the most productive genotype decreased expression of genes related to photosystem, transport and secondary metabolism, whereas genes related to primary metabolism and cell organisation were over-expressed.
For the most productive genotype, the ability to express a broader set of genes in response to water availability appears to be a key characteristic in the maintenance of biomass growth during the dry season. Its strategy may involve a decrease of photosynthetic activity during the dry season associated with resources reallocation through major changes in the expression of primary metabolism associated genes. Further efforts will be needed to assess the adaptive nature of the genes highlighted in this study.
Expression microarrays are commonly used to study transcriptomes. Most of the arrays are now based on oligo-nucleotide probes. Probe design being a tedious task, it often takes place once at the beginning of the project. The oligo set is then used for several years. During this time period, the knowledge gathered by the community on the genome and the transcriptome increases and gets more precise. Therefore re-annotating the set is essential to supply the biologists with up-to-date annotations. SigReannot-mart is a query environment populated with regularly updated annotations for different oligo sets. It stores the results of the SigReannot pipeline that has mainly been used on farm and aquaculture species. It permits easy extraction in different formats using filters. It is used to compare probe sets on different criteria, to choose the set for a given experiment to mix probe sets in order to create a new one.
Database URL: http://sigreannot-mart.toulouse.inra.fr/
Roche 454 pyrosequencing platform is often considered the most versatile of the Next Generation Sequencing technology platforms, permitting the sequencing of large genomes, the analysis of variations or the study of transcriptomes. A recent reported bias leads to the production of multiple reads for a unique DNA fragment in a random manner within a run. This bias has a direct impact on the quality of the measurement of the representation of the fragments using the reads. Other cleaning steps are usually performed on the reads before assembly or alignment.
PyroCleaner is a software module intended to clean 454 pyrosequencing reads in order to ease the assembly process. This program is a free software and is distributed under the terms of the GNU General Public License as published by the Free Software Foundation. It implements several filters using criteria such as read duplication, length, complexity, base-pair quality and number of undetermined bases. It also permits to clean flowgram files (.sff) of paired-end sequences generating on one hand validated paired-ends file and the other hand single read file.
Read cleaning has always been an important step in sequence analysis. The pyrocleaner python module is a Swiss knife dedicated to 454 reads cleaning. It includes commonly used filters as well as specialised ones such as duplicated read removal and paired-end read verification.
Gene expression profiling studies of mastitis in ruminants have provided key but fragmented knowledge for the understanding of the disease. A systematic combination of different expression profiling studies via meta-analysis techniques has the potential to test the extensibility of conclusions based on single studies. Using the program Pointillist, we performed meta-analysis of transcription-profiling data from six independent studies of infections with mammary gland pathogens, including samples from cattle challenged in vivo with S. aureus, E. coli, and S. uberis, samples from goats challenged in vivo with S. aureus, as well as cattle macrophages and ovine dendritic cells infected in vitro with S. aureus. We combined different time points from those studies, testing different responses to mastitis infection: overall (common signature), early stage, late stage, and cattle-specific.
Ingenuity Pathway Analysis of affected genes showed that the four meta-analysis combinations share biological functions and pathways (e.g. protein ubiquitination and polyamine regulation) which are intrinsic to the general disease response. In the overall response, pathways related to immune response and inflammation, as well as biological functions related to lipid metabolism were altered. This latter observation is consistent with the milk fat content depression commonly observed during mastitis infection. Complementarities between early and late stage responses were found, with a prominence of metabolic and stress signals in the early stage and of the immune response related to the lipid metabolism in the late stage; both mechanisms apparently modulated by few genes, including XBP1 and SREBF1.
The cattle-specific response was characterized by alteration of the immune response and by modification of lipid metabolism. Comparison of E. coli and S. aureus infections in cattle in vivo revealed that affected genes showing opposite regulation had the same altered biological functions and provided evidence that E. coli caused a stronger host response.
This meta-analysis approach reinforces previous findings but also reveals several novel themes, including the involvement of genes, biological functions, and pathways that were not identified in individual studies. As such, it provides an interesting proof of principle for future studies combining information from diverse heterogeneous sources.
Meta-analysis; microarray analysis; mastitis infection; lipid metabolism; immune response
The Fagaceae family comprises about 1,000 woody species worldwide. About half belong to the Quercus family. These oaks are often a source of raw material for biomass wood and fiber. Pedunculate and sessile oaks, are among the most important deciduous forest tree species in Europe. Despite their ecological and economical importance, very few genomic resources have yet been generated for these species. Here, we describe the development of an EST catalogue that will support ecosystem genomics studies, where geneticists, ecophysiologists, molecular biologists and ecologists join their efforts for understanding, monitoring and predicting functional genetic diversity.
We generated 145,827 sequence reads from 20 cDNA libraries using the Sanger method. Unexploitable chromatograms and quality checking lead us to eliminate 19,941 sequences. Finally a total of 125,925 ESTs were retained from 111,361 cDNA clones. Pyrosequencing was also conducted for 14 libraries, generating 1,948,579 reads, from which 370,566 sequences (19.0%) were eliminated, resulting in 1,578,192 sequences. Following clustering and assembly using TGICL pipeline, 1,704,117 EST sequences collapsed into 69,154 tentative contigs and 153,517 singletons, providing 222,671 non-redundant sequences (including alternative transcripts). We also assembled the sequences using MIRA and PartiGene software and compared the three unigene sets. Gene ontology annotation was then assigned to 29,303 unigene elements. Blast search against the SWISS-PROT database revealed putative homologs for 32,810 (14.7%) unigene elements, but more extensive search with Pfam, Refseq_protein, Refseq_RNA and eight gene indices revealed homology for 67.4% of them. The EST catalogue was examined for putative homologs of candidate genes involved in bud phenology, cuticle formation, phenylpropanoids biosynthesis and cell wall formation. Our results suggest a good coverage of genes involved in these traits. Comparative orthologous sequences (COS) with other plant gene models were identified and allow to unravel the oak paleo-history. Simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs) were searched, resulting in 52,834 SSRs and 36,411 SNPs. All of these are available through the Oak Contig Browser http://genotoul-contigbrowser.toulouse.inra.fr:9092/Quercus_robur/index.html.
This genomic resource provides a unique tool to discover genes of interest, study the oak transcriptome, and develop new markers to investigate functional diversity in natural populations.
SNP (Single Nucleotide Polymorphism) discovery is now routinely performed using high-throughput sequencing of reduced representation libraries. Our objective was to adapt 454 GS FLX based sequencing methodologies in order to obtain the largest possible dataset from two reduced representations libraries, produced by AFLP (Amplified Fragment Length Polymorphism) for genomic DNA, and EST (Expressed Sequence Tag) for the transcribed fraction of the genome.
The expressed fraction was obtained by preparing cDNA libraries without PCR amplification from quail embryo and brain. To optimize the information content for SNP analyses, libraries were prepared from individuals selected in three quail lines and each individual in the AFLP library was tagged. Sequencing runs produced 399,189 sequence reads from cDNA and 373,484 from genomic fragments, covering close to 250 Mb of sequence in total.
Both methods used to obtain reduced representations for high-throughput sequencing were successful after several improvements.
The protocols may be used for several sequencing applications, such as de novo sequencing, tagged PCR fragments or long fragment sequencing of cDNA.
Although bivalves are among the most-studied marine organisms because of their ecological role and economic importance, very little information is available on the genome sequences of oyster species. This report documents three large-scale cDNA sequencing projects for the Pacific oyster Crassostrea gigas initiated to provide a large number of expressed sequence tags that were subsequently compiled in a publicly accessible database. This resource allowed for the identification of a large number of transcripts and provides valuable information for ongoing investigations of tissue-specific and stimulus-dependant gene expression patterns. These data are crucial for constructing comprehensive DNA microarrays, identifying single nucleotide polymorphisms and microsatellites in coding regions, and for identifying genes when the entire genome sequence of C. gigas becomes available.
In the present paper, we report the production of 40,845 high-quality ESTs that identify 29,745 unique transcribed sequences consisting of 7,940 contigs and 21,805 singletons. All of these new sequences, together with existing public sequence data, have been compiled into a publicly-available Website http://public-contigbrowser.sigenae.org:9090/Crassostrea_gigas/index.html. Approximately 43% of the unique ESTs had significant matches against the SwissProt database and 27% were annotated using Gene Ontology terms. In addition, we identified a total of 208 in silico microsatellites from the ESTs, with 173 having sufficient flanking sequence for primer design. We also identified a total of 7,530 putative in silico, single-nucleotide polymorphisms using existing and newly-generated EST resources for the Pacific oyster.
A publicly-available database has been populated with 29,745 unique sequences for the Pacific oyster Crassostrea gigas. The database provides many tools to search cleaned and assembled ESTs. The user may input and submit several filters, such as protein or nucleotide hits, to select and download relevant elements. This database constitutes one of the most developed genomic resources accessible among Lophotrochozoans, an orphan clade of bilateral animals. These data will accelerate the development of both genomics and genetics in a commercially-important species with the highest annual, commercial production of any aquatic organism.
Microarray is a powerful technology enabling to monitor tens of thousands of genes in a single experiment. Most microarrays are now using oligo-sets. The design of the oligo-nucleotides is time consuming and error prone. Genome wide microarray oligo-sets are designed using as large a set of transcripts as possible in order to monitor as many genes as possible. Depending on the genome sequencing state and on the assembly state the knowledge of the existing transcripts can be very different. This knowledge evolves with the different genome builds and gene builds. Once the design is done the microarrays are often used for several years. The biologists working in EADGENE expressed the need of up-to-dated annotation files for the oligo-sets they share including information about the orthologous genes of model species, the Gene Ontology, the corresponding pathways and the chromosomal location.
The results of SigReannot on a chicken micro-array used in the EADGENE project compared to the initial annotations show that 23% of the oligo-nucleotide gene annotations were not confirmed, 2% were modified and 1% were added. The interest of this up-to-date annotation procedure is demonstrated through the analysis of real data previously published.
SigReannot uses the oligo-nucleotide design procedure criteria to validate the probe-gene link and the Ensembl transcripts as reference for annotation. It therefore produces a high quality annotation based on reference gene sets.
Reliable annotation linking oligonucleotide probes to target genes is essential for functional biological analysis of microarray experiments. We used the IMAD, OligoRAP and sigReannot pipelines to update the annotation for the ARK-Genomics Chicken 20 K array as part of a joined EADGENE/SABRE workshop. In this manuscript we compare their annotation strategies and results. Furthermore, we analyse the effect of differences in updated annotation on functional analysis for an experiment involving Eimeria infected chickens and finally we propose guidelines for optimal annotation strategies.
IMAD, OligoRAP and sigReannot update both annotation and estimated target specificity. The 3 pipelines can assign oligos to target specificity categories although with varying degrees of resolution. Target specificity is judged based on the amount and type of oligo versus target-gene alignments (hits), which are determined by filter thresholds that users can adjust based on their experimental conditions. Linking oligos to annotation on the other hand is based on rigid rules, which differ between pipelines.
For 52.7% of the oligos from a subset selected for in depth comparison all pipelines linked to one or more Ensembl genes with consensus on 44.0%. In 31.0% of the cases none of the pipelines could assign an Ensembl gene to an oligo and for the remaining 16.3% the coverage differed between pipelines. Differences in updated annotation were mainly due to different thresholds for hybridisation potential filtering of oligo versus target-gene alignments and different policies for expanding annotation using indirect links. The differences in updated annotation packages had a significant effect on GO term enrichment analysis with consensus on only 67.2% of the enriched terms.
In addition to flexible thresholds to determine target specificity, annotation tools should provide metadata describing the relationships between oligos and the annotation assigned to them. These relationships can then be used to judge the varying degrees of reliability allowing users to fine-tune the balance between reliability and coverage. This is important as it can have a significant effect on functional microarray analysis as exemplified by the lack of consensus on almost one third of the terms found with GO term enrichment analysis based on updated IMAD, OligoRAP or sigReannot annotation.