The introduction of massively parallel sequencing platforms has opened up entirely new possibilities in all fields of functional genomics. However, the capacity of these platforms greatly exceeds the needs in most studies involving targeted profiling and typing. In an attempt to better exploit the capacity of these systems and reduce the cost per sequenced sample, indexing procedures has been introduced, allowing multiplex identification and sorting of over 100 samples 
. However, increasing the sample size to thousands necessitates additional tagged adaptors leading to preparation of thousands of libraries, which is time consuming and expensive. This problem has been addressed by Erlich et al. 
by implementing a sample-pooling pattern combined with tagging of the pools and Galan et al. 
by combination of tags in the forward and reverse PCR primers. Although these studies demonstrated that the number of unique tags does not necessarily need to match the number of samples, they considerably increase the complexity of sample handling by employing experimentally complicated sample or primer mixing procedures. In this study we present a two tagging strategy that employs a combination of two tags, added in two steps, which not only enables accurate multiplex analysis of thousands of samples in parallel, using a reduced number of tags, but also reduces the complexity of sample handling.
The principle of the two tagging strategy is depicted in . The tags are added in two steps. First, the target is amplified by tagged PCR primers giving amplicons that are tagged in both ends (denoted position-tags). By using 96 position-tags and strictly handle them in a 96-well PCR-plate, 96 samples are uniquely tagged in a single round. The PCR products from a plate are then pooled and a second tag (denoted plate-tag), which is incorporated next to the sequencing primer, is ligated to the pooled samples. The degree of sample multiplexity is thus substantially increased by adding up indexed plates. The number of required tags can be described as tags=(N/96)+96
, where N is the total number of samples (so that N/96 expresses the number of required plate-tags while the constant 96 is the number of position tags in the PCR plate). However, if more than one sequencing lane is utilized, the number of tags is reduced to tags=(N/L*96)+96
where L is the number of lanes used for sequencing the sample set. In this study, N was set to 4992 and L to 2, giving a total of 122 (26+96) tags.
To prove the concept of the two tagging approach, the 2nd
exon of the DLA-DRB1 gene in dogs and wolves was amplified using the 96 position-tags in 52 plates. Out of the 4992 wells, samples were added to 4708, while 284 wells were used for negative PCR controls. After gel electrophoresis, the number of successful PCR products was estimated to 3700 (a success rate of about 79%). The success rate of the PCR was largely dependent on the quality of the samples. For example, 59% of the DNA samples originated from hair and 85% from FTA cards resulted in detectable products. The rate of successfully amplified samples would however increase if the quality of all samples were checked prior to amplification. In addition, two of the position-tagged primers (D11 and D12 (supplementary table S1
)) did not give any products at all and position-tagged primers C3 and G9 performed poorly. This is probably due to poor primer synthesis and since the position-tags exist in all 52 plates, over 150 of the failed PCR reactions could be related to these primers.
After initial PCR amplification, the library pooling strategy was deployed, where each library was pooled prior to ssDNA isolation, enabling time and cost effective sample processing (). Worth mentioning, to save time and reagents, no concentration measurement of the 4992 PCR products was done. The 52 libraries could be processed in two days using an automated setup, and using third party reagents substantially reduced the cost. Uneven read distribution between indexed libraries is a common problem with increasing degree of multiplexing, raising high requirements on concentration determination. By using an automated electrophoresis station it was possible to distinguish the ligated amplicon peak from the non-ligated peak (supplementary figure S1
), allowing for correct quantification of ligated amplicons and volume adjustment and thereby enabling accurate library pooling. With this set-up an even read distribution was obtained ().
Distribution of reads across the MIDs.
The 52 libraries were sequenced on the 454 GS FLX Titanium Chemistry using three out of four lanes, generating a total of 700,000 reads. As mentioned above, the project was designed to employ 2 lanes (L
2) but as we managed to access three lanes, the libraries were divided into three emPCR and sequencing lanes instead of two (see and supplementary table S2
). For 285,000 reads, the read-lengths were long enough to identify the plate-tag and both position-tags. The low rate of useable sequence reads could be explained by the use of a strict criterion not allowing sequence errors in the tags and the fact that the target region harbors a high number of homopolymeric sequences (a well known problem with Pyrosequencing chemistry). 10% of these sequence-reads showed different position-tags in the ends, indicating chimeric formation. These chimers are most probably formed from different single-stranded PCR-products (from different samples) that are recombined when amplicons within each plate are pooled and extended to completion during the end-polish reaction. To minimize this effect PCRs with fewer cycles could be performed (reducing the amount of single stranded amplicons) or the end-polish reaction could be skipped (by employing a polymerase that lacks terminal transferase activity in the PCR). However, the fact that the inner position-tags are incorporated in both ends of the sequencing templates makes it possible to detect chimeric sequences from different samples and exclude them from the data analysis. This demonstrates that our straightforward approach to detect chimeric sequences is needed for obtaining correct results during multiplex sequencing and sorting of polymorphic genes. These erroneous constructs would have gone undetected or at least difficult to detect employing previously mentioned tagging methods 
. To further investigate the extent of multi-sample chimeric sequence formations, a library was created in which the plate-tags were incorporated by a second PCR (using the C and D handles, see primer sequences in supplementary table S1
) after sample pooling (instead of ligation). This means that there were 96 samples in each plate-tagging PCR and thus the extent of chimeric products was expected to increase. After sequencing, chimeric sequences were observed in 97% of the reads where all tags could be identified (data not shown). We therefore strongly recommend that samples should not be pooled before a PCR reaction when sample sorting and individual genotyping is the aim of the study. Allelotyping of pooled samples 
is not included in this recommendation since the aim of these studies is to estimate and compare allele frequencies in different cohorts.
A ten-fold difference between the most and the least sequenced position-tags was observed (data not shown). However, since each position-tag is incorporated in 52 samples, there should be an even spread of sample quality across the position-tags, hence less difference would be expected. The difference in sequence depth within the position-tags is therefore more likely an effect of primer qualities in combination with the fact that different tag sequences could affect the primer annealing by forming secondary structures. Still, the majority of the PCR products were correctly genotyped. The minimum requirement for correct genotyping was set to 10 reads per allele resulting in 20 reads per individual 
. Out of the estimated 3700 PCR-products, 3465 generated more than 20 reads (). To investigate how the polymorphisms are distributed across the sequenced exon, all generated consensus sequences were aligned and the fraction of polymorphisms for each position was plotted (). This resulted in a number of hot spot positions where the polymorphisms are concentrated.
Distribution of reads across the samples.
To conclude, we here demonstrate a simple, robust and reliable method for sequencing thousands of samples in parallel in one single sequencing run. The method is robust enough to omit the time and reagent consuming step of equimolar pooling at the individual level by performing equimolar pooling of 96 samples at a time after introduction of the second plate specific tag. The need for making unique primer combinations for each sample is replaced by having the same 96 position specific primer pairs for all PCR plates. By strictly handling the PCR tagging procedure in 96-wells format and by automating the ligation of plate-tags and library preparation, it is possible to increase the number of samples without affecting the sample handling complexity. This higher control over pipetting errors combined with automated library preparation is one of the main strengths of the presented method since this approach does not require less number of unique primers compared to the competing methods. We have shown that it is possible to obtain sufficient sequence depth for 94% of the successfully amplified samples when running at a multiplexing level of 4992 tag combinations on the 454 sequencing system. Furthermore, incorporation of the same position-tags at both ends of the fragments allows detecting chimeric sequences from different samples that is a prerequisite for accurate identification of samples. We believe that the reliability of the method combined with scalability makes it suitable for sequencing targeted enriched DNA or RNA of even greater sample sizes on platforms such as HiSeq 2000 and SOLiD where the number of reads per run is in the magnitude of 2000 times greater than the 454 system.