The African buffalo has become a species of interest in recent years due to its role as a wildlife maintenance host for a variety of infectious and zoonotic diseases, such as corridor disease, foot-and-mouth disease and bovine tuberculosis 
. There is no African buffalo reference genome available for use in disease association studies, and much benefit would be gained by the generation of sequence data and large-scale SNP discovery in this species. The cost efficiency, large data output and fast turnaround time of the next-generation sequencing technologies have greatly facilitated the generation of novel sequence data as well as large-scale SNP discovery in non-model organisms. ABI SOLiD technology was used in this study to generate over 400 million 50 bp reads of African buffalo genome sequence, and for the preliminary identification of approximately a quarter of a million novel SNPs within the buffalo genome. When investigating a species for which a complete genome sequence is not available, the reference genome of a related species can be used for mapping and SNP discovery. The success of this approach was recently shown by mapping sequence reads of the great tit, Parus major
, to the reference genome of the zebra finch, Taeniopygia guttata
. Using this method, the authors were able to identify 20,000 novel SNPs. A similar approach was used in mapping sequence reads of the turkey, Meleagris gallopavo
, to the sequenced genome of the chicken, and approximately 8000 SNPs were identified in the turkey genome 
In this study, the Cape buffalo sequence reads were mapped to the reference genome of the domestic cow, Bos taurus
. The most recent common ancestor of the domestic cow and the African buffalo is estimated to have existed approximately 5–10 million years ago (MYA), at the time of the divergence of the subtribes Bubalina, which consists of the Syncerus
genera, and Bovina, which is comprised of the Bos
. Despite the relatively recent split of these two genera, only 19% to 23% of the buffalo short reads mapped to the reference cow genome using BWA and Bowtie. These percentages are comparable to those achieved by van Bers et al
, where 26% and 32% of the great tit sequence reads generated in two pools were mapped to the zebra finch genome using data generated by the Illumina Genome Analyser. Similarly, Kerstens et al
mapped approximately 30% of the raw sequence reads generated in the turkey to the chicken genome. The low percentage mapping resulted in a depth of coverage that was substantially below a level that would be ideal for SNP discovery. It would appear that 2–4 times more raw data is required when using a distant relative as a reference sequence, although the longer reads that are now available from the Illumina sequencers (100–150 bp) might improve the percentage mapping. Nevertheless the extra data required when using a distant relative is still substantially less than would be required for a de novo
assembly (>80× coverage), and mapping to a related species dramatically simplifies the SNP discovery and annotation compared to that required for a de novo
Significantly different results were found in this study when using different mapping and SNP calling methods (). From the combined BWA pileup and Bowtie pileup SNP pool, 173 SNPs from within the Cape buffalo population were selected for validation. The selection was based on gene function and SNP consequence, in order to use the validation data in a subsequent case-control association study. Validation required SNPs to pass two tests 1) that the loci amplified and 2) that the loci were polymorphic in Cape buffalo. The number of SNPs detected by each method that were validated ranged from 45–75, and the percentage ranged from 43–58% (). BWA and GATK had highest percent validated (57%), although it is probably not statistically significantly better than the Bowtie/mpileup combination (54% validated), which yielded more SNPs. The BWA/GATK SNPs were used for the functional analyses discussed below. It is not possible to accurately estimate false positive rates for any method except Bowtie and pileup since all SNPs assayed were originally identified with this method. The identification of false positives in SNP discovery may be a result of sequencing errors, alignment errors or the occurrence of paralogous sequence variants 
. The error rate of the ABI SOLiD is estimated at 0.0006% 
, which would lead to about 50,000 erroneous base calls or 10% of the BWA and 0.5% of the Bowtie SNPs; in either case this is insufficient to explain the observed error rate, particularly since the Ti/Tv ratio was 2.18 as would be expected for a mammalian genome, rather than a rate nearer 0.5 which would be expected from random SNP calls. Therefore it seems more likely that the errors resulted from the low coverage (2.7×) and problems with aligning to the genome of a different species 
. Since selected SNPs from this study will be used in future candidate gene association studies, it was desirable to identify as many SNPs as possible in order to obtain the largest number of candidates, and therefore false positives were considered less problematic than false negatives. SNPs within genes were annotated using the Ensembl SNP annotation API. Although this provided consequences for 57,054 SNPs, no general conclusions could be drawn from the data. There were no pathways overrepresented amongst the genes associated with the 27 SNPs that modified stop codons. It should be noted that the 1000 genomes project found 250–300 loss of function variants within genes in each individual 
, so the discovery of 27 in a pool of DNA from nine individuals probably does not have significant biological implications.
The non-synonymous/synonymous substitution ratios provided very little evidence for genes being under positive selection, although a large number appeared to be under purifying selection. This may be because this approach was developed for comparing distantly related taxa, and the relatively small number of SNPs between the Cape buffalo and the cow (mean 4.2 per gene) means that the power to detect positive selection is very limited 
. The large number of genes that are apparently under purifying selection in the buffalo may contain many that are too slowly evolving to exhibit a detectable signal between these two species. Housekeeping genes appeared to be overrepresented in this list but this may be a consequence of their tendency to be relatively slowly evolving rather than evidence of purifying selection.
This investigation of the Cape buffalo, a species without an assembled genome, has yielded a wealth of data that will provide useful tools for further study of this species and related aspects.