Table shows the amount of high quality sequence obtained from 5 pig breeds (NCBI Trace repository under center name "SDJVP", and project name "Sino-Danish Pig Genome Project"). The average trimmed length of the ~3.84 million sequences was 543 base pairs, yielding a total of 2.1 billion base pairs, equivalent to 0.66X coverage of redundancy of the 3.15 billion base pair pig genome. It is expected that 1-(1-543/3.15 × 109
)^3.84 × 106
= 48% of the pig genome sequence has been hit at least once by this sequencing project. The low coverage prevents making a real assembly of the pig sequences and, thus, the contig coverage is not estimated. The analyses are therefore based on a very large number of short alignments. Repeatmasking (supplementary Table ) masked 36% of all base pairs. The distribution of repeat types is overall very similar to what is observed in human, except for the expected absence of Alu-elements (Additional file 1
). Overall, 38% of the coding fraction of the human-mouse alignment, 38% of the 5' UTR, 33 % of the 3' UTR, 23% of the intron region and 24% of the intergenic region could be expanded to a three-species alignment with the addition of the pig reads. This coverage of the human-mouse alignment by the pig genome sequences was close to our prior expectation. Since only 48% of the base pairs in the pig genome are expected to have been hit, we would only expect to hit at most 48% of the human-mouse alignment, assuming perfect conservation. However, in practice there is some lack of power in BLAST due to the fragmented nature of the pig shotgun reads (being fragmented even more by the repeatmasking), and we expect that some of the human-mouse alignment has no longer an orthologues region in the pig genome. For the non-coding regions, the coverage of the human-mouse alignment by the pig genome sequences is lower than for the coding regions, but this may be explained by lower selective constraints and a much higher rate of insertions-deletions in these regions.
Overview of the number of raw reads generated from each breed.
The alignments were used to generate the phylogenetic trees in Figure . As the pig, mouse and human lineages are believed to have diverged at approximately the same time, the trees allow for separate studies of evolution on the human and mouse branches since the divergence of the two species (the root). Due to a generally lower rate of nucleotide substitutions in the pig and human lineages, the porcine sequences are more similar to the human than to the mouse sequences. Overall, the exonic sequences show the slowest evolution, followed by 5' UTR, 3'UTR, intergenic and intronic regions, reflecting different levels of selective constraint on these domains.
Evolutionary distances between mouse, pig and human for conserved sequences divided into functional classes using the annotation of the human genome. Branch lengths are estimated using the HKY substitution model with gamma correction .
By aligning the set of ultra-conserved regions against the pig genome reads using BLAST, we were able to find 239 of the 481 known regions reported in Bejerano et al. (2004) with a significant hit of at least 150 bp. Only 12 of these regions were less than 98% conserved (85–97% identity). This result agrees very well with the expected 48% of the pig genome being covered and the assumption that these regions are very well conserved within Mammalia.
By aligning the pig shotgun data against all human transcripts (NCBI build 34) we found 758 completely conserved sequences exceeding 200 bp in length. Of these, 41 were also found to be completely conserved in the mouse genome, while 590 were less conserved (more than 95% identity over at least 80% of the length). BLASTing human transcripts vs. the fully assembled mouse genome (NCBI build 32), we found 2709 ultra-conserved regions. When aligning this set of sequences against the artificially fragmented mouse genomic dataset using BLAST it was only possible to classify 664 (24.5%) as ultra-conserved – less than the 758 elements found in the human-pig comparison.
The set of pig miRNAs (Additional file 2
) was compared to human and mouse and it was possible to obtain 50 three-way alignments. The evolutionary tree in Figure was constructed using the HKY+gamma model from these alignments with gap positions removed. By construction, the miRNAs are more conserved than even the protein coding sequences, but with pig and human being phylogenetically closest. For the 50 triple-alignments, we obtained 25 cases where pig is closer to human than to mouse, 2 cases where pig is closer to mouse than to human, and 23 cases where pig is equally distant to human and mouse.
The intra-genomic variation in GC content among the individual alignments reflects the isochore structure of the genome. Thus, from the three species alignments, we calculated the GC content for each functional sequence class for each aligned fragment. For a given type of sequence, only alignments having more than 40 nucleotides of the specific type were used. Table shows that the mean GC content is similar among the three species. The variance among alignments in GC content is generally lower in mouse than in pig and human, but mostly so for coding sequences, followed by the UTR and intron regions (Table ). Figure shows the distribution of GC% for the coding alignments. The reduced variability in GC content in mouse compared to human has been shown previously, e.g. Figure 8a in [4
]. The results presented here suggest a very similar pattern in human and pig.
Average GC content and the variance among alignments exceeding 40 bp for each species and each functional category. Variance is standardized to the variance observed in the human sequence.
The distribution of GC content in exons for human, pig and mouse. Only alignments with more than 40 base pairs of exon sequence were used.