The genomic data of chromosomal positions for transcripts and the positions of exons within those transcripts were downloaded from the UCSC genome browser through the table browser portal [8
] from the UCSC genes table [9
]. Chromosomal positions of repeat elements were taken from the UCSC genome browser from the repeat-masker table [10
]. SNP data for this analysis was based on the NCBI repository for SNPs; dbSNP version 129 (dbSNPv.129), that holds 12, 483,371 true SNPs. 6, 726,791 of these SNPs currently hold validated status, and 6, 406,772 of these lay within the autosomal chromosomes [11
]. Coordinates of pairwise alignments to the human genome were taken from the ECR Browser through the ECRBase portal [12
]. The species aligned to human were Pt; Pan troglodytes, Rm; Macaca mulatta (Rhesus Macaque), Cf; Canis familiaris, Mm; Mus musculus, Rn; Rattus novergicus, Md; Monodelphis domestica, Gg; Gallus gallus, Xt Xenopus tropicalis and Dr; Danio rerio.
These data were held on a MySQL database implemented on a 56 node High Performance Cluster (HPC) IBM blade array operated by Microsoft compute cluster server 2003. All programs were written in Visual Basic .net on the Microsoft .net 3.5 framework, in a parallel design using the database to pass messages and data to the worker nodes. The database was designed so that the queries were optimized during the analysis process, and also to optimize subsequent analysis of the results. In utilizing set theory, each chromosome was considered as a set with its members being its base pairs. Each chromosome was considered separately as an entity of DNA of independent evolutionary path. The bases of each chromosome were categorized according to their position with respect to the different annotation information gathered.
All autosomal chromosomes were analysed (2, 867,732,772 bases), although the × and Y chromosomes were removed from the analysis due to being under different selective pressures and being represented differently within the population. The mitochondrial genome was also removed from the analysis. We also removed repetitive regions (1, 288,883,792 bases) as the repetition, frequency and random nature of these repeat regions present problems when using pairwise alignment analyses. Un-sequenced regions of the genome, such as centromere regions of each chromosome were also removed from the analysis as no alignments or SNPs can be mapped to these regions (185, 443,999 bases). From the starting genomic annotations, set algebra was used to define subsets for further investigation, as described in Table . The use of set theory in this manner exploited the data currently available for polymorphisms (SNPs) and also the intronic, exonic and intergenic regions of the genome. The total number of bases within each region type was calculated. Using the chromosomal coordinates of the SNPs, the number of SNPs within each region type was also calculated. This allowed a basic description of SNP density within each region type to be calculated as:
Set theory algebra of genomic regions from annotation data
The analysis was carried out on each chromosome with pairwise alignments of each species aforementioned at three different selective "stringencies" of 70%, 80% and 90% over 100 base pairs. However at the higher level of stringencies the size of conserved genome for large evolutionary depth was small and the number of SNPs reduces to a very small number. Therefore, in order to keep the analysis statistically valid across all species the 70% data was selected for the majority of the analysis in this paper although the 80% and 90% data demonstrated a similar trend. Statistical analyses of the results were carried out in MATLAB version 7.1 (Mathworks) and Microsoft Excel 2003. Tests of normality were undertaken using the Jarque-Bera test (JB test) on the mean SNP density counts for all regions as described in Table for each chromosome and the null hypothesis of normality could not be rejected[13
]. Thus, the average chromosomal SNP density is a fair method of representing the data across the chromosomes and allows the ANOVA statistical test to be used for comparison between the subsets at a 95% confidence level. The average validated SNP density per kilo base (kb) of the total genomic sequence, based on the most current dbSNP database (dbSNPv.129), is approximately 2.6.