Next generation sequencing (NGS) technologies are revolutionizing the study of variation among individuals in a population. The ability of sequencing platforms such as AB SOLiD and Illumina (Solexa) to sequence one billion basepairs (gigabase) or more in a few days has enabled the cheap re-sequencing of human genomes, with the genomes of a Chinese individual 
, a Yoruban individual 
, and matching tumor and healthy samples from a female individual 
sequenced in the last few months. These resequencing efforts have been enabled by the development of extremely efficient mapping tools, capable of aligning millions of short (25–70 bp) reads to the human genome 
. In order to accelerate the computation, most of these methods allow for only a fixed number of mismatches (usually two or three) between the reference genome and the read, and usually do not allow for the matching of reads with insertion/deletion (indel) polymorphisms. These methods are extremely effective for mapping reads to the human genome, most of which has a low polymorphism rate, and so the likelihood that a single read spans multiple SNPs is small. While matching with up to a few differences (allowing for a SNP and 1–2 errors) is sufficient in these regions, these methods fail when the polymorphism level is high.
NGS technologies are also opening the door to the study of population genomics of non-model individuals in other species. Various organisms have a wide range of polymorphism rates - from 0.1% in humans to 4.5% in the marine ascidian Ciona savignyi
. The polymorphisms present in a particular species can be used to discern its evolutionary history and understand the selective pressures in various genomic loci. For example, the large amount of variation in C. savignyi
(two individuals' genomes are as different as Human and Macaque) was found to be due to a large effective population size 
. The re-sequencing of species like C. savignyi
(and regions of the human genome with high variability) requires methods for short read mapping that allow for a combination of several SNPs, indels, and sequencing errors within a single (short) read. Furthermore, due to larger-scale “structural” variation, only a fraction of the read may match to the genome, necessitating the use of local, rather than global, alignment methods.
Previous short read mapping tools typically allow for a fixed number of mismatches by separating a read into several sections and requiring some number of these to match perfectly, while others are allowed to vary 
. An alternative approach generates a set of subsequences from the read (often represented as spaced seeds 
), again in such a manner that if a read were to match at a particular location with some number of mismatches, at least one of the subsequences would match the genome 
. While these methods are extremely fast, they were developed for genomes with relatively low levels of polymorphism, and typically cannot handle a highly polymorphic, non-model genome.
This becomes especially apparent when working with data from Applied Biosystem's SOLiD sequencing platform (AB SOLiD). AB SOLiD uses a di-base sequencing chemistry that generates one of four possible calls (colors) for each pair of nucleotides. While a sequencing error is a change of one color-call to another, a single SNP will change two adjacent color positions. Hence a read with two (non-adjacent) SNPs and a sequencing error will differ from the reference genome in five different positions. Simultaneously, the nature of the di-base sequencing code allows for the identification (and correction) of sequencing errors, so by carefully analyzing the exact sequence of matches and mismatches within a read, it is possible to determine that the read and the genome differ by two SNPs. While efficient mappers for color-space sequences have been developed 
, they translate the genome to color-space, and directly compare to the color-space read. The complexity of the color-space representation makes the identification of complex variation such as adjacent SNPs and short indels challenging or impossible with these tools.
In this paper we develop algorithms for the mapping of short reads to highly polymorphic genomes and methods for the analysis of the mappings. We demonstrate an algorithm for mapping short reads in the presence of a large amount of polymorphism. By employing a fast k-mer hashing step and a simple, very efficient implementation of the Smith-Waterman algorithm, our method conducts a full alignment of each read to all areas of the genome that are potentially homologous. Secondly, we introduce a novel, specialized algorithm for mapping di-base (color-space) reads, which allows for an accurate, non-heuristic alignment of AB SOLiD reads to a reference genome. Finally, we introduce methodology for evaluating the accuracy of discovered alignments. Because a read may match the genome in several locations with variable amounts of polymorphism, we develop a statistical method for scoring the hits, allowing for the selection of the most probable variants, and filtering of false positives.
Our methods are implemented as part of SHRiMP: the SHort Read Mapping Package. To demonstrate the usefulness of SHRiMP we re-sequenced a Japanese Ciona savignyi genome on the SOLiD platform. Preliminary estimates obtained in the course of sequencing the reference genome indicate that the SNP heterozygosity is 4.5%, whereas indel heterozygosity is 16.6%. This species represents the most challenging known test case for the detection of polymorphisms with short read technologies. We aligned the SOLiD reads of the Japanese individual to the C. savignyi reference genome using both SHRiMP and AB's read mapper. SHRiMP is able to identify 5-fold more SNPs than AB's mapper, while also capturing 70,000 indel variants.