The main goal of the Bovine Genome Project, the identification of the whole bovine genome sequence, is almost achieved [
33], but still there are several regions in the bovine genome not sufficiently annotated and characterized. Genome assemblies also rely on the existence of transcript sequences to merge contigs together, verify the assembly of whole genome shotgun reads, and annotate genes. Further analysis of QTL regions of interest may include physical and transcription mapping, identification of positional and functional candidate genes and isolation of the corresponding full-length cDNA as well as association studies on the selected genes.
This study represents a further step in the ongoing molecular and genetic analysis of complex traits and annotation of genes and transcripts localized in the region on BTA6 containing QTL for milk and meat production, health and conformation traits.
As an initial step towards systematic analysis of transcripts and genes in this region, we carried out exon trapping using selected bovine BAC clones previously mapped to QTL intervals on BTA6. Mining a genomic interval comprising about 1 Mb for transcribed sequences using this technique, we identified a total of 92 unique exon trapping sequences. Genome similarity searches revealed sequence identity matches to sequence scaffolds on BTA6 for most unique ETS (91%). With two ETS, which were identified on sequence scaffolds not yet assigned to the current NCBI sequence assembly of the bovine genome, gaps could be closed on the genome sequence level. There were about 2% of ETS, which could not be unambiguously assigned to the bovine genome sequence, because of matches to multiple chromosomes due to repetitive sequence motifs. We found only 5% of ETS without any hits to known sequences contained in the archive of bovine sequence databases. These ETS can provide new additional sequence information to complete the current genome sequence assembly. This result further indicates that targeted deep sequencing within the corresponding genomic regions would be required to improve the accuracy of the BTA6 sequence assembly.
Comparative sequence similarity search to human and mouse genome sequences revealed that 11% of the isolated ETS displayed high similarity to genomic sequences located on the syntenic chromosomes HSA4 and MMU5 of the human and mouse reference genome assemblies pointing to highly conserved genome regions in these species. Almost a third of the ETS identified similar equivalent sequences in genomic sequence scaffolds from the alternative Celera-based sequence assembly of the human genome. The residual 62% of ETS without comparative genomic sequence counterparts in human and mouse refer to presumably species-specific genomic regions in the bovine genome.
Screening the gene, ESTs and protein databases at NCBI detected only a few known transcribed sequences revealing identity to the ETS isolated by exon trapping in our study (17%, Figure ).
Whereas 6% of all ETS identified known bovine transcripts, a further 16% of ETS pinpointed to bovine gene models predicted ab initio.
For the evaluation of the relatively low proportion of ETS identifying known transcripts, we have to consider, that the BAC clones subjected to exon trapping in our study had been selected from regions on BTA6 poorly covered with protein-coding genes, which could still be noted on the current sequence assembly Btau4.0 (Figure ). Eleven BACs were assigned to gene desert regions [
9], which have been found conserved in mammals and birds [
34,
35] and were thought to be transcriptionally silent. Hence, we could not expect
a priori that many sequences would have been annotated as known genes. It should also to be considered that even the current annotation of the bovine genome is still limited and consequently, the set of functional elements is not completely identified to date. Additionally, transcripts from the most lowly expressed genes, or genes specifically expressed in important but relatively minor cell types may very likely be under-represented in the ESTs database predominantly established by large scale ESTs projects.
At the beginning of our experiments it was known that only one BAC clone (BBI_750F0243) was assigned near a human gene (KCNIP4) on the syntenic region on HSA4 by in silico comparative mapping. Thus, this BAC clone could serve as a proof for the efficacy of the exon trapping procedure. Indeed, we identified ETS in this BAC clone pinpointing to four exons of the bovine KCNIP4 gene, which underlined and validated the usefulness of the exon trapping method for targeted mining of transcribed sequences in defined chromosomal regions based on genomic DNA from BAC clones. In addition, by identifying two additional exons of the bovine KCNIP4 gene, which are not present in the current bovine genome assembly Btau4.0, it was exemplarily demonstrated that this experimental approach is a useful complement for the annotation of the bovine genome sequence.
Because there was no identity detected to known genes and ESTs by in silico sequence comparison for the majority of ETS identified in our study, these sequences are assumed to be novel and could be predicted to originate from unknown bovine transcripts. Expression analysis was performed to validate this hypothesis. Examination of a subset of the trapped putative transcripts showed exemplarily that in a lactating cow numerous ETS displayed a divergent, tissue-specific expression pattern (Figure , Figure ). Most expression signals were observed in liver, thyroid gland, small intestine, kidney, and pituitary gland. Tissue-dependent expression pattern of the ETS may indicate to potentially specific functions in the corresponding tissues of the lactating cow. Some of the ETS were found to be expressed in all tissues examined, indicating a ubiquitous expression pattern and suggesting them being probably part of housekeeping genes or conserved structural genes, if similarity to repetitive sequences could be excluded. As shown in Figure , 16% of the analyzed ETS did not display expression signals in the multi-tissue panel. This could likely be due to the fact, that these transcripts were not expressed in the tissues contained in the panel of a lactating cow analyzed here. In this context it should be mentioned that the advantage of the exon trapping approach is that the method is independent of spatio-temporal expression patterns due to the identification of transcripts based only on intrinsic characteristics of the genuine genomic sequence. But non-expressed ETS, possibly, could also represent false-positive sequences isolated from regions of the genomic DNA due to existing sequence similarities to splice site consensus sequences (e.g., splice donor/acceptor, branch point region), which the exon trapping technique is based on.
From the results of the expression analysis it could be inferred that the ETS revealing expression in the bovine multi-tissue panel may represent
bona fide transcribed sequences. However, the ETS have to be characterized in further studies with regard to their functional significance. Based on the presented data, we can not exclude that a part of the identified ETS may be attributed to the class of pseudogenes or to non-functional RNA. Pseudogene transcription has been observed in small-scale gene-centred studies and genome-scale unbiased mapping of transcriptionally active regions in the human and mouse genomes. Surveys of Gerstein and Zheng [
36,
37] have revealed that for example, 5–20% of human pseudogenes can be transcriptionally active. However, considering the relatively high percentage of ETS (84%), for which expression has been demonstrated in our study, it could be assumed that a number of them might attain to another category of transcribed sequence elements as for example noncoding RNA.
Continued submission of ESTs and other sequence information in a variety of species points to the existence of transcripts that do not map to currently annotated genes [
12,
38-
40]. These transcripts may possibly correspond to novel protein coding genes, genes encoding small unknown peptides, pseudogenes or noncoding RNA. Evidence of transcription had increasingly been found in unannotated intergenic genome regions of the human genome, which were thought to be transcriptionally silent (e.g., [
41-
46]). The ENCODE consortium reported that a vast amount of DNA, not annotated as known genes, is transcribed into RNA. While the majority of the genome appears to be transcribed at the level of primary transcripts, only about the half of the processed transcripts is mapped as currently annotated genes [
36,
42,
46]. Particularly, a high number of new transcriptionally active regions (more than 50%) were detected in non-annotated intergenic regions. These studies indicated that genomic regions previously considered as "junk" encode for multiple polyadenylated and non-polyadenylated transcripts of unknown function. According to Gerstein et al. [
36] the ENCODE project provided evidence that there is much activity between annotated genes and intergenic space in the human genome contributed by transcribed non-protein-coding RNAs and transcribed pseudogenes. The authors highlighted that a number of these transcribed pseudogenes and noncoding RNA genes are located even within introns of protein-coding genes and assumed that these components may possibly influence the expression of their host genes. It is also possible that these transcripts themselves do not have a direct function, but rather are important for a particular process (e.g., chromatin accessibility for transcription factor binding). Continuously, numerous noncoding RNA sequences are recognized in the transcriptomes of different eukaryotes as having important regulatory functions in controlling various levels of gene expression in physiological and developmental processes and diseases of complex organisms (e.g., [
44,
47-
49]). The detailed investigation of the functional relevance of the numerous unknown transcripts was postulated as a prospective task in the post ENCODE era.
The findings of our study provide experimental support for transcripts lacking ESTs or other cDNA evidence in the targeted regions of BTA6. The majority of unknown ETS presumably identified novel noncoding transcripts located in intergenic regions of the chromosome. However, prospective studies should be performed to further characterize the transcripts with regard to their putative functional significance. In this respect it has to be proven, if these transcripts belong to non-functional RNA or if they have any specific regulatory function in the bovine genome. Currently, there is scare information on the function of bovine noncoding RNA genes compared to the state in mouse and human. The prevalence of bovine noncoding RNAs, their regulatory impact on gene expression and their physiological effects are not yet examined in detail.
While the mammalian genomes contain nearly similar repertoires of protein-coding gene sequences comprising only a fraction of about 1.5% of the whole genome, the majority of the mammalian genome is obviously transcribed. Our results support the increasingly accepted concept suggesting that the physiological complexity and the unique phenotypes of species-specific or individual genomes might evolve from combinatorial features contributed by the entire genome sequence including previously neglected genome regions [
33,
42,
45,
46,
50]. Consequently, variation in noncoding sequences might be important effectors of phenotypic variation in complex traits and diseases (see reviews [
51,
52]) in livestock.
The results of our study on BTA6 demonstrate that the exon trapping method based on region-specific BAC clones is applicable to targeted screening for novel transcripts located within a defined chromosomal region sparsely covered with annotated genes. The novel transcript sequences obtained will contribute to establish a detailed transcription map for targeted specific subchromosomal BTA6 regions. Our results show that the computational prediction and identification of genes and transcripts and manual inspection solely are not sufficient to annotate the final bovine genome in the absence of experimentally derived data. Experiences from genome studies in other species revealed that genome annotation is never complete or final (e.g., [
39,
40,
42,
45,
53,
54]). Therefore, correcting and refining the genome annotation is a reiterative task, which is continuously being done and depends on experimental data for final validation, especially for the identification of rare transcripts and alternative splice variants. Compared to high-throughput sequencing technologies like transcriptome sequencing initiated currently, the method of exon trapping has some advantages. Detection of transcripts by high-throughput sequencing requires knowledge about the temporal-spatial expression pattern of the targeted group of unknown transcripts. Frequently, the time point of expression is difficult to predict, e.g., for transcripts of high relevance for developmental regulation. Furthermore, high-throughput technologies have their limits regarding detection of rare transcripts. Therefore, a targeted approach independent of amount, time and locus of expression using a method like exon trapping will complement high throughput technologies for the analysis of defined chromosomal intervals, for example to trace transcripts in fine-mapped QTL regions.