These three publications are the current published state of the art in RADSeq analysis. In the busy months since these studies were performed, the RADSeq technology has evolved significantly, most importantly with the improvement in Illumina read lengths (now up to 150 bases) and in the use of paired end sequencing (generating 300 bases per sequenced fragment). RADSeq yields two kinds of markers: presence–absence markers resulting from polymorphism in the restriction enzyme cut site, and substitutional (SNP, indel) markers in the tag sequences. In populations with a hypothetical expected level of between-haploid variation of one difference in 1000 bases, short reads (such as the 36 base reads used by Baird
et al. [
5]) would be predicted to identify one SNP per 28 RAD tags. Illumina 150 base reads would yield one SNP per seven RAD tags.
Paired-end sequencing can also be performed from RADSeq libraries. Because fragments are randomly sheared, the paired sequences associated with each RAD tag will begin at different positions downstream of the restriction site. These RADSeq pair tags can be assembled to produce extended (200–300 base) contigs linked to each RAD site (F). The ‘target’ for identification of SNPs and indels is thus four times the length of the RAD tag. The paired contigs can also be used for the development of PCR-based assays for higher throughput analyses.
RADSeq has several advantages compared to other SNP genotyping approaches such as AFLPs and oligonucleotide SNP chips. First, rather than just detecting changes in 16-base (AFLP) or 20-base (oligonucleotide hybridization) targets, the full sequences of the RAD tag and its paired contig can be screened for SNPs and indels. There is no variant discovery and assay design step, and there are no hybridization optimization issues. For example, acquiring additional sets of markers to increase density simply requires the use of different restriction enzymes, rather than an extended discovery and design process. The initial analysis of RAD tag sequences is also arguably simpler than the interpretation of AFLP gels or oligoarray images. Obviously, like all genetic association experiments, the resolution of RADSeq in identification of the loci underpinning traits of interest depends sensitively on the number of independent markers assayed, their levels of variability, and the numbers of crossovers that have occurred in the mapping population. By increasing the numbers of markers (resampling the same genomes with additional restriction enzymes) and the numbers of individual cross progeny or population representatives, the accuracy of RADSeq mapping can be improved.
The simple bioinformatics steps outlined above provide a framework for RADSeq analysis, but this is just the start [
14]. Simple approaches to SNP calling and error correction may yield thousands of markers, but will leave much of the data unmined. Genomes differ by complex patterns of substitution and indels, and the error rate of the Illumina platform may obscure some true alleles. RAD tags that appear to be ‘repetitive’ paralogue clouds when just the RAD tag is considered may be revealed as sets of alleles when paired contigs are analysed. Much more comprehensive approaches are possible, particularly for
de novo RAD, where no reference genome is available.
Hohenlohe
et al. [
6] established a maximum likelihood framework for calculating the likelihood of each homozygote or heterozygote genotype at each locus given the bases called at the locus and the sequencing error at the locus. The error model used treats the error rate as varying across a single read, as opposed to being identical at all bases across all reads. More complex error models could be used, taking into account issues such as chimeras and copying errors induced during PCR and the GC bias in Illumina sequencing data [
15]. The GC bias issue makes it difficult to separate heterozygotes, homozygotes, repeats and paralogues by read count alone. Further normalization of raw count data will be required to take full advantage of
de novo RAD sequencing for population genetics studies.
In addition, the availability of tens of thousands of markers across hundreds of individuals will considerably deepen the possibilities for population genomic analyses. It will be possible to detect subtle effects in multiple markers across the genome that were previously unobservable, not least because it will be possible to calculate a high confidence, genome-wide average for any chosen statistic simultaneously with the scoring of outliers, enabling simple identification of divergence from neutrality wherever it occurs. Signatures of positive, balancing, divergent and background selection can be identified separately within and across populations.