Atlantic salmon (Salmo salar
) is an important aquaculture species, and there is also a considerable commercial harvesting of wild salmon. As a consequence of the economic interest in salmon, various genomic resources have been developed to identify genes and genomic mechanisms responsible for commercially important traits. These resources include a BAC library, the corresponding physical map and several linkage maps [1
]. In addition several cDNA libraries have been constructed [6
] and at present 494,094 ESTs have been submitted to GenBank [11
The salmon genome is complex due to a relatively recent genome duplication believed to have occurred between 25 and 120 million years ago in the common salmonid ancestor [12
]. Analysis of segregation ratios in salmonids has revealed disomic inheritance in females while there is a mixture of disomic and tetrasomic inheritance in males [12
]. Divergence of loci duplicated via tetraploidy is depending on complete reestablishment of disomic inheritance, and present salmonids appear to have retained more than 50% of loci as duplicates [14
]. This suggests that salmonids are in the process of re-establishing disomic inheritance. Studies of the salmon genome might therefore contribute important biological knowledge on the evolution of ohnologs (duplicate sequences that originate from a whole genome duplication) [15
Transcript sequences that represent the coding regions of genes may be predicted based on consensus assemblies of overlapping ESTs such as the gene indices in TIGR database [16
]. These clusters of tentative consensus sequences (TCs) serve as a valuable resource for putative gene products. However, these reconstructions are prone to error caused by low quality of single-pass sequences, alternative splice forms, expressed pseudogenes and sequence similarities between transcripts within gene families. One would expect that the large number of almost identical ohnologous sequences would make such gene transcript predictions particularly challenging in salmonid species. In agreement with this, results from studies using salmon EST data as a source for SNP discovery show that when clustering EST-sequences into consensus sequences there is a high frequency of "SNPs" with heterozygote excess. This indicates that a large amount of the ESTs in such clusters are derived from different loci [5
The most useful transcript sequences are derived from high quality full-length sequencing of inserts from cDNA clones (FLIcs) that contain the complete protein coding sequence (cCDS). By determining the cCDS from one single clone the errors caused by incorrect clustering of non-allelic sequences are omitted. High quality sequences based on multi-pass reads of the CDS from FLIcs are therefore the most reliable source for transcript prediction. In addition to representing the most suitable mean to predict protein sequences, the data from FLIcs might also be used to identify splice variants as well as to differ between closely related paralogs. Complete CDS FLIcs are also important in genome clustering and annotation. Genomic sequencing of Atlantic salmon is being organised by an international consortium and due to the problems related to the recent duplication of the salmon genome clustering and annotation of the sequence might prove difficult. Thus, a large number of high quality cCDS FLIcs would therefore be of great value in a salmon genome sequencing project. Finally, in full-length insert sequences, where the boundaries of the coding sequences are defined, the additional transcript sequences provide sequence information from 5' and 3'UTRs. Within these non-coding mRNA segments there are sequence motifs that are important in regulation of gene expression. Access to reliable sequence information from UTRs is a precondition to identify such functional motifs. Together, the above mentioned use of cCDS FLIcs and their source cDNA clones has led to large scale sequencing of full-length inserts in several species [18
]. Despite the apparent usefulness of cCDS FLIcs few salmon FLIcs were available in public databases at the time this study was initiated.
The aim of this study has been to determine the full sequence of the inserts in a set of selected Atlantic salmon cDNA clones to provide a larger amount of high quality sequenced transcripts with complete CDSs from a single tissue and developmental stage. Clones were selected from a white muscle tissue specific library from pre-smolt developmental stage since our research at the time this study was initiated also focused on discovering genes that might be important to the ability of depositing dietary cartenoid pigments in muscle tissue. The results from the annotation of full-length sequenced inserts and identification of cDNA transcripts likely to contain complete CDSs are presented. We also describe some general characteristics of Salmo salar transcripts such as the Kozak consensus sequence, polyadenylation signal variation and we identify conserved, putatively functional, elements in the UTRs.