Twelve cDNA libraries from two species of catfish have been sequenced, resulting in the generation of nearly 500,000 ESTs.
Through the Community Sequencing Program, a catfish EST sequencing project was carried out through a collaboration between the catfish research community and the Department of Energy's Joint Genome Institute. Prior to this project, only a limited EST resource from catfish was available for the purpose of SNP identification.
A total of 438,321 quality ESTs were generated from 8 channel catfish (Ictalurus punctatus) and 4 blue catfish (Ictalurus furcatus) libraries, bringing the number of catfish ESTs to nearly 500,000. Assembly of all catfish ESTs resulted in 45,306 contigs and 66,272 singletons. Over 35% of the unique sequences had significant similarities to known genes, allowing the identification of 14,776 unique genes in catfish. Over 300,000 putative SNPs have been identified, of which approximately 48,000 are high-quality SNPs identified from contigs with at least four sequences and the minor allele presence of at least two sequences in the contig. The EST resource should be valuable for identification of microsatellites, genome annotation, large-scale expression analysis, and comparative genome analysis.
This project generated a large EST resource for catfish that captured the majority of the catfish transcriptome. The parallel analysis of ESTs from two closely related Ictalurid catfishes should also provide powerful means for the evaluation of ancient and recent gene duplications, and for the development of high-density microarrays in catfish. The inter- and intra-specific SNPs identified from all catfish EST dataset assembly will greatly benefit the catfish introgression breeding program and whole genome association studies.
The fish gill, as one of the mucosal barriers, plays an important role in mucosal immune response. The fish swimbladder functions for regulating buoyancy. The fish swimbladder has long been postulated as a homologous organ of the tetrapod lung, but the molecular evidence is scarce. In order to provide new information that is complementary to gill immune genes, initiate new research directions concerning the genetic basis of the gill immune response and understand the molecular function of swimbladder as well as its relationship with lungs, transcriptome analysis of the fugu Takifugu rubripes gill and swimbladder was carried out by RNA-Seq. Approximately 55,061,524 and 44,736,850 raw sequence reads from gill and swimbladder were generated, respectively. Gene ontology (GO) and KEGG pathway analysis revealed diverse biological functions and processes. Transcriptome comparison between gill and swimbladder resulted in 3,790 differentially expressed genes, of which 1,520 were up-regulated in the swimbladder while 2,270 were down-regulated. In addition, 406 up regulated isoforms and 296 down regulated isoforms were observed in swimbladder in comparison to gill. By the gene enrichment analysis, the three immune-related pathways and 32 immune-related genes in gill were identified. In swimbladder, five pathways including 43 swimbladder-enriched genes were identified. This work should set the foundation for studying immune-related genes for the mucosal immunity and provide genomic resources to study the relatedness of the fish swimbladder and mammalian lung.
Comparative mapping is a powerful tool to study evolution of genomes. It allows transfer of genome information from the well-studied model species to non-model species. Catfish is an economically important aquaculture species in United States. A large amount of genome resources have been developed from catfish including genetic linkage maps, physical maps, BAC end sequences (BES), integrated linkage and physical maps using BES-derived markers, physical map contig-specific sequences, and draft genome sequences. Application of such genome resources should allow comparative analysis at the genome scale with several other model fish species.
In this study, we conducted whole genome comparative analysis between channel catfish and four model fish species with fully sequenced genomes, zebrafish, medaka, stickleback and Tetraodon. A total of 517 Mb draft genome sequences of catfish were anchored to its genetic linkage map, which accounted for 62% of the total draft genome sequences. Based on the location of homologous genes, homologous chromosomes were determined among catfish and the four model fish species. A large number of conserved syntenic blocks were identified. Analysis of the syntenic relationships between catfish and the four model fishes supported that the catfish genome is most similar to the genome of zebrafish.
The organization of the catfish genome is similar to that of the four teleost species, zebrafish, medaka, stickleback, and Tetraodon such that homologous chromosomes can be identified. Within each chromosome, extended syntenic blocks were evident, but the conserved syntenies at the chromosome level involve extensive inter-chromosomal and intra-chromosomal rearrangements. This whole genome comparative map should facilitate the whole genome assembly and annotation in catfish, and will be useful for genomic studies of various other fish species.
Catfish; Genome; Comparative mapping; Linkage mapping; Conserved synteny
Calpains, a superfamily of intracellular calcium-dependent cysteine proteases, are involved in the cytoskeletal remodeling and wasting of skeletal muscle. Calpains are generated as inactive proenzymes which are activated by N-terminal autolysis induced by calcium-ions.
In this study, we characterized the full-length cDNA sequences of three calpain genes, clpn1, clpn2, and clpn3 in channel catfish, and assessed the effect of nutrient restriction and subsequent re-feeding on the expression of these genes in skeletal muscle. The clpn1 cDNA sequence encodes a protein of 704 amino acids, Clpn2 of 696 amino acids, and Clpn3 of 741 amino acids. Phylogenetic analysis of deduced amino acid sequences indicate that catfish Clpn1 and Clpn2 share a sequence similarity of 61%; catfish Clpn1 and Clpn3 of 48%, and Clpn2 and Clpn3 of only 45%. The domain structure architectures of all three calpain genes in channel catfish are similar to those of other vertebrates, further supported by strong bootstrap values during phylogenetic analyses. Starvation of channel catfish (average weight, 15–20 g) for 35 days influenced the expression of clpn1 (2.3-fold decrease, P<0.05), clpn2 (1.3-fold increase, P<0.05), and clpn3 (13.0-fold decrease, P<0.05), whereas the subsequent refeeding did not change the expression of these genes as measured by quantitative real-time PCR analysis. Calpain catalytic activity in channel catfish skeletal muscle showed significant differences only during the starvation period, with a 1.2- and 1.4- fold increase (P<0.01) after 17 and 35 days of starvation, respectively.
We have assessed that fasting and refeeding may provide a suitable experimental model to provide us insight into the role of calpains during fish muscle atrophy and how they respond to changes in nutrient supply.
Construction of high-density genetic linkage maps is crucially important for quantitative trait loci (QTL) studies, and they are more useful when integrated with physical maps. Such integrated maps are valuable genome resources for fine mapping of QTL, comparative genomics, and accurate and efficient whole-genome assembly. Previously, we established both linkage maps and a physical map for channel catfish, Ictalurus punctatus, the dominant aquaculture species in the United States. Here we added 2030 BAC end sequence (BES)-derived microsatellites from 1481 physical map contigs, as well as markers from singleton BES, ESTs, anonymous microsatellites, and SNPs, to construct a second-generation linkage map. Average marker density across the 29 linkage groups reached 1.4 cM/marker. The increased marker density highlighted variations in recombination rates within and among catfish chromosomes. This work effectively anchored 44.8% of the catfish BAC physical map contigs, covering ∼52.8% of the genome. The genome size was estimated to be 2546 cM on the linkage map, and the calculated physical distance per centimorgan was 393 Kb. This integrated map should enable comparative studies with teleost model species as well as provide a framework for ordering and assembling whole-genome scaffolds.
catfish; linkage map; physical map; genome; map integration
Single nucleotide polymorphisms (SNPs) have become the marker of choice for genome-wide association studies. In order to provide the best genome coverage for the analysis of performance and production traits, a large number of relatively evenly distributed SNPs are needed. Gene-associated SNPs may fulfill these requirements of large numbers and genome wide distribution. In addition, gene-associated SNPs could themselves be causative SNPs for traits. The objective of this project was to identify large numbers of gene-associated SNPs using high-throughput next generation sequencing.
Transcriptome sequencing was conducted for channel catfish and blue catfish using Illumina next generation sequencing technology. Approximately 220 million reads (15.6 Gb) for channel catfish and 280 million reads (19.6 Gb) for blue catfish were obtained by sequencing gene transcripts derived from various tissues of multiple individuals from a diverse genetic background. A total of over 35 billion base pairs of expressed short read sequences were generated. Over two million putative SNPs were identified from channel catfish and almost 2.5 million putative SNPs were identified from blue catfish. Of these putative SNPs, a set of filtered SNPs were identified including 342,104 intra-specific SNPs for channel catfish, 366,269 intra-specific SNPs for blue catfish, and 420,727 inter-specific SNPs between channel catfish and blue catfish. These filtered SNPs are distributed within 16,562 unique genes in channel catfish and 17,423 unique genes in blue catfish.
For aquaculture species, transcriptome analysis of pooled RNA samples from multiple individuals using Illumina sequencing technology is both technically efficient and cost-effective for generating expressed sequences. Such an approach is most effective when coupled to existing EST resources generated using traditional sequencing approaches because the reference ESTs facilitate effective assembly of the expressed short reads. When multiple individuals with different genetic backgrounds are used, RNA-Seq is very effective for the identification of SNPs. The SNPs identified in this report will provide a much needed resource for genetic studies in catfish and will contribute to the development of a high-density SNP array. Validation and testing of these SNPs using SNP arrays will form the material basis for genome association studies and whole genome-based selection in catfish.
Genome annotation projects, gene functional studies, and phylogenetic analyses for a given organism all greatly benefit from access to a validated full-length cDNA resource. While increasingly common in model species, full-length cDNA resources in aquaculture species are scarce.
Methodology and Principal Findings
Through in silico analysis of catfish (Ictalurus spp.) ESTs, a total of 10,037 channel catfish and 7,382 blue catfish cDNA clones were identified as potentially encoding full-length cDNAs. Of this set, a total of 1,169 channel catfish and 933 blue catfish full-length cDNA clones were selected for re-sequencing to provide additional coverage and ensure sequence accuracy. A total of 1,745 unique gene transcripts were identified from the full-length cDNA set, including 1,064 gene transcripts from channel catfish and 681gene transcripts from blue catfish, with 416 transcripts shared between the two closely related species. Full-length sequence characteristics (ortholog conservation, UTR length, Kozak sequence, and conserved motifs) of the channel and blue catfish were examined in detail. Comparison of gene ontology composition between full-length cDNAs and all catfish ESTs revealed that the full-length cDNA set is representative of the gene diversity encoded in the catfish transcriptome.
This study describes the first catfish full-length cDNA set constructed from several cDNA libraries. The catfish full-length cDNA sequences, and data gleaned from sequence characteristics analysis, will be a valuable resource for ongoing catfish whole-genome sequencing and future gene-based studies of function and evolution in teleost fishes.
Heat shock proteins (HSPs) consist of a large group of chaperones whose expression is induced by high temperature, hypoxia, infection and a number of other stresses. Among all the HSPs, Hsp40 is the largest HSP family, which bind to Hsp70 ATPase domain in assisting protein folding. In this study, we identified 57 hsp40s in channel catfish (Ictalurus punctatus) through in silico analysis using RNA-Seq and genome databases. These genes can be classified into three different types, Type I, II and III, based on their structural similarities. Phylogenetic and syntenic analyses provided strong evidence in supporting the orthologies of these HSPs. Meta-analyses of RNA-Seq datasets were conducted to analyze expression profile of Hsp40s following bacterial infection. Twenty seven hsp40s were found to be significantly up- or down-regulated in the liver after infection with E. ictaluri; 19 hsp40s were found to be significantly regulated in the intestine after infection with E. ictaluri; and 19 hsp40s were found to be significantly regulated in the gill following infection with F. columnare. Altogether, a total of 42 Hsp40 genes were regulated under disease situations involving three tissues and two bacterial infections. The significant regulated expression of Hsp40 genes after bacterial infection suggested their involvement in disease defenses in catfish.
MicroRNAs (miRNAs) are a class of endogenous non-coding small RNA with average length of 22 nucleotides, participating in the post-transcriptional regulation of gene expression. In this study, we report the identification and characterization of miRNAs in the tube foot of sea cucumber (Apostichopus japonicus) by next generation sequencing with Illumina HiSeq 2000 platform. Through the bioinformatic analysis, we identified 260 conserved miRNAs and six novel miRNAs from the tube foot small RNA transcriptome. Quantitative realtime PCR (qRT-PCR) was performed to characterize the specific expression in the tube foot. The results indicated that four miRNAs, including miR-29a, miR-29b, miR-2005 and miR-278-3p, were significantly up-regulated in the tube foot. The target genes of the four specifically expressed miRNAs were predicted in silico and validated by performing qRT-PCR. Gene ontology (GO) and KEGG pathway analyses with the target genes of these four miRNAs were conducted to further understand the regulatory function in the tube foot. This is the first study to profile the miRNA transcriptome of the tube foot in sea cucumber. This work will provide valuable genomic resources to understand the mechanisms of gene regulation in the tube foot, and will be useful to assist the molecular breeding in sea cucumber.
Domestication and selection for important performance traits can impact the genome, which is most often reflected by reduced heterozygosity in and surrounding genes related to traits affected by selection. In this study, analysis of the genomic impact caused by domestication and artificial selection was conducted by investigating the signatures of selection using single nucleotide polymorphisms (SNPs) in channel catfish (Ictalurus punctatus). A total of 8.4 million candidate SNPs were identified by using next generation sequencing. On average, the channel catfish genome harbors one SNP per 116 bp. Approximately 6.6 million, 5.3 million, 4.9 million, 7.1 million and 6.7 million SNPs were detected in the Marion, Thompson, USDA103, Hatchery strain, and wild population, respectively. The allele frequencies of 407,861 SNPs differed significantly between the domestic and wild populations. With these SNPs, 23 genomic regions with putative selective sweeps were identified that included 11 genes. Although the function for the majority of the genes remain unknown in catfish, several genes with known function related to aquaculture performance traits were included in the regions with selective sweeps. These included hypoxia-inducible factor 1β· HIFιβ ¨ and the transporter gene ATP-binding cassette sub-family B member 5 (ABCB5). HIF1β· is important for response to hypoxia and tolerance to low oxygen levels is a critical aquaculture trait. The large numbers of SNPs identified from this study are valuable for the development of high-density SNP arrays for genetic and genomic studies of performance traits in catfish.
Single nucleotide polymorphisms (SNPs) have become the marker of choice for genome-wide association studies in many species. High-throughput sequencing of RNA was developed primarily to analyze global gene expression, while it is an efficient way to discover SNPs from the expressed genes. In this study, we conducted transcriptome sequencing of the swimbladder of Takifugu rubripes using Illumina HiSeq2000 platform to identify gene-associated SNPs in the swimbladder. A total of 30,312,181 unique-mapped-reads were obtained from 44,736,850 raw reads. A total of 62,270 putative SNPs were discovered, which were located in 11,306 expressed genes and 2,246 scaffolds. The average minor allele frequency (MAF) of the SNPs was 0.26. GO and KEGG pathway analysis were conducted to analyze the genes containing SNPs. Validation of selected SNPs revealed that 54% of SNPs (26/48) were true SNPs. The results suggest that RNA-Seq is an efficient and cost-effective approach to discover gene-associated SNPs. In this study, a large number of SNPs were identified and these data will be useful resources for population genetic study, evolution analysis, resource assessment, genetic linkage analysis and genome-wide association studies.
Quantitative traits, such as disease resistance, are most often controlled by a set of genes involving a complex array of regulation. The dissection of genetic basis of quantitative traits requires large numbers of genetic markers with good genome coverage. The application of next-generation sequencing technologies has allowed discovery of over eight million SNPs in catfish, but the challenge remains as to how to efficiently and economically use such SNP resources for genetic analysis.
In this work, we developed a catfish 250K SNP array using Affymetrix Axiom genotyping technology. The SNPs were obtained from multiple sources including gene-associated SNPs, anonymous genomic SNPs, and inter-specific SNPs. A set of 640K high-quality SNPs obtained following specific requirements of array design were submitted. A panel of 250,113 SNPs was finalized for inclusion on the array. The performance evaluated by genotyping individuals from wild populations and backcross families suggested the good utility of the catfish 250K SNP array.
This is the first high-density SNP array for catfish. The array should be a valuable resource for genome-wide association studies (GWAS), fine QTL mapping, high-density linkage map construction, haplotype analysis, and whole genome-based selection.
Catfish; Fish; Genome; SNP array; GWAS; Genotyping
The application of RNA-seq has accelerated gene expression profiling and identification of gene-associated SNPs in many species. However, the integrated studies of gene expression along with SNP mapping have been lacking. Coupling of RNA-seq with bulked segregant analysis (BSA) should allow correlation of expression patterns and associated SNPs with the phenotypes.
In this study, we demonstrated the use of bulked segregant RNA-seq (BSR-Seq) for the analysis of differentially expressed genes and associated SNPs with disease resistance against enteric septicemia of catfish (ESC). A total of 1,255 differentially expressed genes were found between resistant and susceptible fish. In addition, 56,419 SNPs residing on 4,304 unique genes were identified as significant SNPs between susceptible and resistant fish. Detailed analysis of these significant SNPs allowed differentiation of significant SNPs caused by genetic segregation and those caused by allele-specific expression. Mapping of the significant SNPs, along with analysis of differentially expressed genes, allowed identification of candidate genes underlining disease resistance against ESC disease.
This study demonstrated the use of BSR-Seq for the identification of genes involved in disease resistance against ESC through expression profiling and mapping of significantly associated SNPs. BSR-Seq is applicable to analysis of genes underlining various performance and production traits without significant investment in the development of large genotyping platforms such as SNP arrays.
Bulk segregant analysis; RNA-seq; Disease resistance; Catfish; Allele-specific expression
Along with the rapid advances of the nextgen sequencing technologies, more and more species are added to the list of organisms whose whole genomes are sequenced. However, the assembled draft genome of many organisms consists of numerous small contigs, due to the short length of the reads generated by nextgen sequencing platforms. In order to improve the assembly and bring the genome contigs together, more genome resources are needed. In this study, we developed a strategy to generate a valuable genome resource, physical map contig-specific sequences, which are randomly distributed genome sequences in each physical contig. Two-dimensional tagging method was used to create specific tags for 1,824 physical contigs, in which the cost was dramatically reduced. A total of 94,111,841 100-bp reads and 315,277 assembled contigs are identified containing physical map contig-specific tags. The physical map contig-specific sequences along with the currently available BAC end sequences were then used to anchor the catfish draft genome contigs. A total of 156,457 genome contigs (~79% of whole genome sequencing assembly) were anchored and grouped into 1,824 pools, in which 16,680 unique genes were annotated. The physical map contig-specific sequences are valuable resources to link physical map, genetic linkage map and draft whole genome sequences, consequently have the capability to improve the whole genome sequences assembly and scaffolding, and improve the genome-wide comparative analysis as well. The strategy developed in this study could also be adopted in other species whose whole genome assembly is still facing a challenge.
Catfish has a male-heterogametic (XY) sex determination system, but genes involved in gonadogenesis, spermatogenesis, testicular determination, and sex determination are poorly understood. As a first step of understanding the transcriptome of the testis, here, we conducted RNA-Seq analysis using high throughput Illumina sequencing.
A total of 269.6 million high quality reads were assembled into 193,462 contigs with a N50 length of 806 bp. Of these contigs, 67,923 contigs had hits to a set of 25,307 unigenes, including 167 unique genes that had not been previously identified in catfish. A meta-analysis of expressed genes in the testis and in the gynogen (double haploid female) allowed the identification of 5,450 genes that are preferentially expressed in the testis, providing a pool of putative male-biased genes. Gene ontology and annotation analysis suggested that many of these male-biased genes were involved in gonadogenesis, spermatogenesis, testicular determination, gametogenesis, gonad differentiation, and possibly sex determination.
We provide the first transcriptome-level analysis of the catfish testis. Our analysis would lay the basis for sequential follow-up studies of genes involved in sex determination and differentiation in catfish.
Comparative genomics is a powerful tool to transfer genomic information from model species to related non-model species. Channel catfish (Ictalurus punctatus) is the primary aquaculture species in the United States. Its existing genome resources such as genomic sequences generated from next generation sequencing, BAC end sequences (BES), physical maps, linkage maps, and integrated linkage and physical maps using BES-associated markers provide a platform for comparative genomic analysis between catfish and other model teleost fish species. This study aimed to gain understanding of genome organizations and similarities among catfish and several sequenced teleost genomes using linkage group 8 (LG8) as a pilot study.
With existing genome resources, 287 unique genes were identified in LG8. Comparative genome analysis indicated that most of these 287 genes on catfish LG8 are located on two homologous chromosomes of zebrafish, medaka, stickleback, and three chromosomes of green-spotted pufferfish. Large numbers of conserved syntenies were identified. Detailed analysis of the conserved syntenies in relation to chromosome level similarities revealed extensive inter-chromosomal and intra-chromosomal rearrangements during evolution. Of the 287 genes, 35 genes were found to be duplicated in the catfish genome, with the vast majority of the duplications being interchromosomal.
Comparative genome analysis is a powerful tool even in the absence of a well-assembled whole genome sequence. In spite of sequence stacking due to low resolution of the linkage and physical maps, conserved syntenies can be identified although the exact gene order and orientation are unknown at present. Through chromosome-level comparative analysis, homologous chromosomes among teleosts can be identified. Syntenic analysis should facilitate annotation of the catfish genome, which in turn, should facilitate functional inference of genes based on their orthology.
Comparative mapping; Synteny; Genome; Chromosome; Linkage map; Physical map; Catfish; Fish
Although a large set of full-length transcripts was recently assembled in catfish, annotation of large gene families, especially those with duplications, is still a great challenge. Most often, complexities in annotation cause mis-identification and thereby much confusion in the scientific literature. As such, detailed phylogenetic analysis and/or orthology analysis are required for annotation of genes involved in gene families. The ATP-binding cassette (ABC) transporter gene superfamily is a large gene family that encodes membrane proteins that transport a diverse set of substrates across membranes, playing important roles in protecting organisms from diverse environment.
In this work, we identified a set of 50 ABC transporters in catfish genome. Phylogenetic analysis allowed their identification and annotation into seven subfamilies, including 9 ABCA genes, 12 ABCB genes, 12 ABCC genes, 5 ABCD genes, 2 ABCE genes, 4 ABCF genes and 6 ABCG genes. Most ABC transporters are conserved among vertebrates, though cases of recent gene duplications and gene losses do exist. Gene duplications in catfish were found for ABCA1, ABCB3, ABCB6, ABCC5, ABCD3, ABCE1, ABCF2 and ABCG2.
The whole set of catfish ABC transporters provide the essential genomic resources for future biochemical, toxicological and physiological studies of ABC drug efflux transporters. The establishment of orthologies should allow functional inferences with the information from model species, though the function of lineage-specific genes can be distinct because of specific living environment with different selection pressure.
Upon the completion of whole genome sequencing, thorough genome annotation that associates genome sequences with biological meanings is essential. Genome annotation depends on the availability of transcript information as well as orthology information. In teleost fish, genome annotation is seriously hindered by genome duplication. Because of gene duplications, one cannot establish orthologies simply by homology comparisons. Rather intense phylogenetic analysis or structural analysis of orthologies is required for the identification of genes. To conduct phylogenetic analysis and orthology analysis, full-length transcripts are essential. Generation of large numbers of full-length transcripts using traditional transcript sequencing is very difficult and extremely costly.
In this work, we took advantage of a doubled haploid catfish, which has two sets of identical chromosomes and in theory there should be no allelic variations. As such, transcript sequences generated from next-generation sequencing can be favorably assembled into full-length transcripts. Deep sequencing of the doubled haploid channel catfish transcriptome was performed using Illumina HiSeq 2000 platform, yielding over 300 million high-quality trimmed reads totaling 27 Gbp. Assembly of these reads generated 370,798 non-redundant transcript-derived contigs. Functional annotation of the assembly allowed identification of 25,144 unique protein-encoding genes. A total of 2,659 unique genes were identified as putative duplicated genes in the catfish genome because the assembly of the corresponding transcripts harbored PSVs or MSVs (in the form of pseudo-SNPs in the assembly). Of the 25,144 contigs with unique protein hits, around 20,000 contigs matched 50% length of reference proteins, and over 14,000 transcripts were identified as full-length with complete open reading frames. The characterization of consensus sequences surrounding start codon and the stop codon confirmed the correct assembly of the full-length transcripts.
The large set of transcripts assembled in this study is the most comprehensive set of genome resources ever developed from catfish, which will provide the much needed resources for functional genome research in catfish, serving as a reference transcriptome for genome annotation, analysis of gene duplication, gene family structures, and digital gene expression analysis. The putative set of duplicated genes provide a starting point for genome scale analysis of gene duplication in the catfish genome, and should be a valuable resource for comparative genome analysis, genome evolution, and genome function studies.
Recent advances in next-generation sequencing technologies have drastically increased throughput and significantly reduced sequencing costs. However, the average read lengths in next-generation sequencing technologies are short as compared with that of traditional Sanger sequencing. The short sequence reads pose great challenges for de novo sequence assembly. As a pilot project for whole genome sequencing of the catfish genome, here we attempt to determine the proper sequence coverage, the proper software for assembly, and various parameters used for the assembly of a BAC physical map contig spanning approximately a million of base pairs.
A combination of low sequence coverage of 454 and Illumina sequencing appeared to provide effective assembly as reflected by a high N50 value. Using 454 sequencing alone, a sequencing depth of 18 X was sufficient to obtain the good quality assembly, whereas a 70 X Illumina appeared to be sufficient for a good quality assembly. Additional sequencing coverage after 18 X of 454 or after 70 X of Illumina sequencing does not provide significant improvement of the assembly. Considering the cost of sequencing, a 2 X 454 sequencing, when coupled to 70 X Illumina sequencing, provided an assembly of reasonably good quality. With several software tested, Newbler with a seed length of 16 and ABySS with a K-value of 60 appear to be appropriate for the assembly of 454 reads alone and Illumina paired-end reads alone, respectively. Using both 454 and Illumina paired-end reads, a hybrid assembly strategy using Newbler for initial 454 sequence assembly, Velvet for initial Illumina sequence assembly, followed by a second step assembly using MIRA provided the best assembly of the physical map contig, resulting in 193 contigs with a N50 value of 13,123 bp.
A hybrid sequencing strategy using low sequencing depth of 454 and high sequencing depth of Illumina provided the good quality assembly with high N50 value and relatively low cost. A combination of Newbler, Velvet, and MIRA can be used to assemble the 454 sequence reads and the Illumina reads effectively. The assembled sequence can serve as a resource for comparative genome analysis. Additional long reads using the third generation sequencing platforms are needed to sequence through repetitive genome regions that should further enhance the sequence assembly.
Comparative mapping is a powerful tool to transfer genomic information from sequenced genomes to closely related species for which whole genome sequence data are not yet available. However, such an approach is still very limited in catfish, the most important aquaculture species in the United States. This project was initiated to generate additional BAC end sequences and demonstrate their applications in comparative mapping in catfish.
We reported the generation of 43,000 BAC end sequences and their applications for comparative genome analysis in catfish. Using these and the additional 20,000 existing BAC end sequences as a resource along with linkage mapping and existing physical map, conserved syntenic regions were identified between the catfish and zebrafish genomes. A total of 10,943 catfish BAC end sequences (17.3%) had significant BLAST hits to the zebrafish genome (cutoff value ≤ e-5), of which 3,221 were unique gene hits, providing a platform for comparative mapping based on locations of these genes in catfish and zebrafish. Genetic linkage mapping of microsatellites associated with contigs allowed identification of large conserved genomic segments and construction of super scaffolds.
BAC end sequences and their associated polymorphic markers are great resources for comparative genome analysis in catfish. Highly conserved chromosomal regions were identified to exist between catfish and zebrafish. However, it appears that the level of conservation at local genomic regions are high while a high level of chromosomal shuffling and rearrangements exist between catfish and zebrafish genomes. Orthologous regions established through comparative analysis should facilitate both structural and functional genome analysis in catfish.
SNPs are abundant, codominantly inherited, and sequence-tagged markers. They are highly adaptable to large-scale automated genotyping, and therefore, are most suitable for association studies and applicable to comparative genome analysis. However, discovery of SNPs requires genome sequencing efforts through whole genome sequencing or deep sequencing of reduced representation libraries. Such genome resources are not yet available for many species including catfish. A large resource of ESTs is to become available in catfish allowing identification of large number of SNPs, but reliability of EST-derived SNPs are relatively low because of sequencing errors. This project was designed to answer some of the questions relevant to quality assessment of EST-derived SNPs.
wo factors were found to be most significant for validation of EST-derived SNPs: the contig size (number of sequences in the contig) and the minor allele sequence frequency. The larger the contigs were, the greater the validation rate although the validation rate was reasonably high when the contigs contain four or more EST sequences with the minor allele sequence being represented at least twice in the contigs. Sequence quality surrounding the SNP under test is also crucially important. PCR extension appeared to be limited to a very short distance, prohibiting successful genotyping when an intron was present, a surprising finding.
Stringent quality assessment measures should be used when working with EST-derived SNPs. In particular, contigs containing four or more ESTs should be used and the minor allele sequence should be represented at least twice. Genotyping primers should be designed from a single exon, completely avoiding introns. Application of such quality assessment measures, along with large resources of ESTs, should provide effective means for SNP identification in species where genome sequence resources are lacking.
EST sequencing is one of the most efficient means for gene discovery and molecular marker development, and can be additionally utilized in both comparative genome analysis and evaluation of gene duplications. While much progress has been made in catfish genomics, large-scale EST resources have been lacking. The objectives of this project were to construct primary cDNA libraries, to conduct initial EST sequencing to generate catfish EST resources, and to obtain baseline information about highly expressed genes in various catfish organs to provide a guide for the production of normalized and subtracted cDNA libraries for large-scale transcriptome analysis in catfish.
A total of 17 cDNA libraries were constructed including 12 from channel catfish (Ictalurus punctatus) and 5 from blue catfish (I. furcatus). A total of 31,215 ESTs, with average length of 778 bp, were generated including 20,451 from the channel catfish and 10,764 from blue catfish. Cluster analysis indicated that 73% of channel catfish and 67% of blue catfish ESTs were unique within the project. Over 53% and 50% of the channel catfish and blue catfish ESTs, respectively, had significant similarities to known genes. All ESTs have been deposited in GenBank. Evaluation of the catfish EST resources demonstrated their potential for molecular marker development, comparative genome analysis, and evaluation of ancient and recent gene duplications. Subtraction of abundantly expressed genes in a variety of catfish tissues, identified here, will allow the production of low-redundancy libraries for in-depth sequencing.
The sequencing of 31,215 ESTs from channel catfish and blue catfish has significantly increased the EST resources in catfish. The EST resources should provide the potential for microarray development, polymorphic marker identification, mapping, and comparative genome analysis.
Rapid advances of the next-generation sequencing technologies have allowed whole genome sequencing of many species. However, with the current sequencing technologies, the whole genome sequence assemblies often fall in short in one of the four quality measurements: accuracy, contiguity, connectivity, and completeness. In particular, small-sized contigs and scaffolds limit the applicability of whole genome sequences for genetic analysis. To enhance the quality of whole genome sequence assemblies, particularly the scaffolding capabilities, additional genomic resources are required. Among these, sequences derived from known physical locations offer great powers for scaffolding. In this mini-review, we will describe the principles, procedures and applications of physical-map-derived sequences, with the focus on physical map contig-specific sequences.
physical map contig-specific sequences; BAC end sequences; whole genome sequencing; assembly; scaffolding
The razor clam Sinonovacula constricta is a benthic intertidal bivalve species with important commercial value. Despite its economic importance, knowledge of its transcriptome is scarce. Next generation sequencing technologies offer rapid and efficient tools for generating large numbers of sequences, which can be used to characterize the transcriptome, to develop effective molecular markers and to identify genes associated with growth, a key breeding trait.
Total RNA was isolated from the mantle, gill, liver, siphon, gonad and muscular foot tissues. High-throughput deep sequencing of S. constricta using 454 pyrosequencing technology yielded 859,313 high-quality reads with an average read length of 489 bp. Clustering and assembly of these reads produced 16,323 contigs and 131,346 singletons with average lengths of 1,376 bp and 458 bp, respectively. Based on transcriptome sequencing, 14,615 sequences had significant matches with known genes encoding 147,669 predicted proteins. Subsequently, previously unknown growth-related genes were identified. A total of 13,563 microsatellites (SSRs) and 13,634 high-confidence single nucleotide polymorphism loci (SNPs) were discovered, of which almost half were validated.
De novo sequencing of the razor clam S. constricta transcriptome on the 454 GS FLX platform generated a large number of ESTs. Candidate growth factors and a large number of SSRs and SNPs were identified. These results will impact genetic studies of S. constricta.
Gene duplication has had a major impact on genome evolution. Localized (or tandem) duplication resulting from unequal crossing over and whole genome duplication are believed to be the two dominant mechanisms contributing to vertebrate genome evolution. While much scrutiny has been directed toward discerning patterns indicative of whole-genome duplication events in teleost species, less attention has been paid to the continuous nature of gene duplications and their impact on the size, gene content, functional diversity, and overall architecture of teleost genomes.
Here, using a Markov clustering algorithm directed approach we catalogue and analyze patterns of gene duplication in the four model teleost species with chromosomal coordinates: zebrafish, medaka, stickleback, and Tetraodon. Our analyses based on set size, duplication type, synonymous substitution rate (Ks), and gene ontology emphasize shared and lineage-specific patterns of genome evolution via gene duplication. Most strikingly, our analyses highlight the extraordinary duplication and retention rate of recent duplicates in zebrafish and their likely role in the structural and functional expansion of the zebrafish genome. We find that the zebrafish genome is remarkable in its large number of duplicated genes, small duplicate set size, biased Ks distribution toward minimal mutational divergence, and proportion of tandem and intra-chromosomal duplicates when compared with the other teleost model genomes. The observed gene duplication patterns have played significant roles in shaping the architecture of teleost genomes and appear to have contributed to the recent functional diversification and divergence of important physiological processes in zebrafish.
We have analyzed gene duplication patterns and duplication types among the available teleost genomes and found that a large number of genes were tandemly and intrachromosomally duplicated, suggesting their origin of independent and continuous duplication. This is particularly true for the zebrafish genome. Further analysis of the duplicated gene sets indicated that a significant portion of duplicated genes in the zebrafish genome were of recent, lineage-specific duplication events. Most strikingly, a subset of duplicated genes is enriched among the recently duplicated genes involved in immune or sensory response pathways. Such findings demonstrated the significance of continuous gene duplication as well as that of whole genome duplication in the course of genome evolution.
Gene duplication; Whole genome duplication; Teleost species; Tandem duplication