The genome serves simultaneously as a basic blueprint, encoding information for proper cellular and developmental processes necessary to produce an organism, and as a historical record of the demographic processes and selective forces acting in a given lineage. Exploration of mechanistic details through biochemistry, genetics, and development has lead to a deeper understanding of how genotype leads to phenotype, while exploitation of the historical record has enabled the fields of systematics, population genetics, and molecular ecology to elucidate the pressures and processes that shape diversity in populations and divergence between species. Studies of genetic information both encoded and recorded in genomes work with the same currency–comparison of homologous sequences across individuals–but these approaches employ very different modes of inference, and as such the details of a particular experiment dictate optimal marker resolution (). To address the need for flexibility in marker number, we describe a next-generation sequencing-based method for determining individual sequence genotypes that can be tuned to sample a large range (from hundreds to hundreds of thousands) of randomly distributed regions genome-wide.
A flexible genotyping method can be used to optimize the number of genetic markers for a specific experimental approach in a given biological system.
The plummeting cost and skyrocketing throughput of DNA sequencing has begun to enable sequencing of entire genomes of study populations of some focal species 
; however, even in traditional model species (e.g., humans, laboratory mice, and Drosophila
) resources for complete genome resequencing of large numbers of individuals by single investigators are still limited. As nearly all population and comparative analyses depend on an increasing number of individuals or samples for statistical power, several methods have emerged to increase the number of individuals sampled for the same resource investment by reducing the fraction of each individual genome sequenced. The crucial hurdle that must be overcome in reducing sampling for each individual is ensuring that the same (homologous) regions are examined between individuals. An early solution to this problem took advantage of sequence specificity of restriction endonucleases to construct a “reduced representation” sequencing library for polymorphism discovery 
. While initially limited to SNP discovery rather than individual genotype determination by the cost and throughput of Sanger sequencing, later studies using a similar approach capitalized on high-throughput massively parallel sequencing such as 454 (454 Life Sciences, Branford, CT) and Genome Analyzer (Illumina, Inc., San Diego, CA) and reported both reliable SNP discovery and genotyping 
. Second-generation sequencing of DNA libraries comprised only of regions adjacent to restriction sites was later dubbed Restriction Associated DNA sequencing (RADseq; ; 
and developed further in 
). More refined methods have since emerged (e.g., Multiplexed Shotgun Genotyping [MSG]; 
), but rely on having a complete reference sequence available. Subsequently, studies extended RADseq to species that lack a reference genome sequence, but have restricted variant discovery to only those regions that contained at most one or two polymorphic sites 
Double digest RAD sequencing improves efficiency and robustness while minimizing cost.
While these approaches permit genotyping of multiple individuals with substantially reduced sequencing investment, they are limited in their ability to allow researchers to tune the fraction of genome sampled (i.e., to genotype only the number of markers needed for a given experiment). Furthermore, while the RADseq method is suitable for systems that lack a sequenced reference genome, the existing computational tools for analyzing resulting data perform with relatively poor efficiency. In published examples of RADseq data analyzed without a reference genome 
, approximately half of the sequence data in each case was discarded because the analysis was not robust to error in sequence reads, and an additional ~30–50% of loci were discarded due to the presence of more than 1–3 variable sites in each region. In addition to inefficiency, including only reads below a set number of nucleotide differences between haplotypes at a locus introduces bias in these data, removing rapidly diverging regions and complicating analyses such as phylogenetic rate and coalescence time estimation 
. Thus, the ability to optimize the number of loci sequenced and maximize the number of sequence reads incorporated in the analysis, and to take advantage of multiple sites per locus would improve both the efficiency and utility of this approach.
To increase the breadth of RADseq applications, we have elaborated on the method described by Baird et al. 
by eliminating random shearing and explicitly using size selection to recover a tunable number of regions, which are distributed randomly throughout the genome. Moreover, to maximize our ability to multiplex (i.e., increase the number of samples per sequencing lane), we also have developed a two-index combinatorial tagging approach (e.g., n * m
individuals using n+m
indices) and an accompanying computational analysis toolkit and lightweight data management component to facilitate high-order multiplexing of many hundreds of individuals. We also developed a graph clustering-based pipeline to maximize sequence read inclusion in analysis and permit detection of orthologous haplotypes regardless of divergence (i.e., without arbitrary similarity requirements), thereby improving analysis sensitivity and efficiency. Our software pipeline utilizes a novel approach for filtering resulting loci independent of coverage depth and converts the resulting haplotype multiple alignments into standard SAM/BAM format for downstream analysis, such as variant detection using the Genome Analysis Toolkit 
or samtools 
. This method has proven inexpensive (i.e., fractions of a penny per individual per site), rapid (i.e., approximately 8 hours of hands-on time), requires little starting material (i.e., 100 ng of DNA), and is suitable for high-throughput applications (all steps can be carried out in microtiter plates). In addition, this method can be employed, and its results efficiently analyzed, with no prior knowledge of genome sequence.