|Home | About | Journals | Submit | Contact Us | Français|
Variation within individual genomes ranges from single nucleotide polymorphisms (SNPs) to kilobase, and even megabase, sized structural variants (SVs), such as deletions, insertions, inversions, and more complex rearrangements. Although much is known about the extent of SVs in humans and mice, species in which they exert significant effects on phenotypes, very little is known about the extent of SVs in the 2.5-times smaller and less repetitive genome of the chicken.
We identified hundreds of shared and divergent SVs in four commercial chicken lines relative to the reference chicken genome. The majority of SVs were found in intronic and intergenic regions, and we also found SVs in the coding regions. To identify the SVs, we combined high-throughput short read paired-end sequencing of genomic reduced representation libraries (RRLs) of pooled samples from 25 individuals and computational mapping of DNA sequences from a reference genome.
We provide a first glimpse of the high abundance of small structural genomic variations in the chicken. Extrapolating our results, we estimate that there are thousands of rearrangements in the chicken genome, the majority of which are located in non-coding regions. We observed that structural variation contributes to genetic differentiation among current domesticated chicken breeds and the Red Jungle Fowl. We expect that, because of their high abundance, SVs might explain phenotypic differences and play a role in the evolution of the chicken genome. Finally, our study exemplifies an efficient and cost-effective approach for identifying structural variation in sequenced genomes.
Structural variation within the genome, including insertions, duplications, deletions, and inversions of up to multiple kilobase pairs, have recently been described in a variety of species, including humans [1-3], mice , rats , silkworms  drosophila , and dogs . These genomic variations were recently found to be widespread, encompassing 5% of the human genome , and are thought to be involved in (co)determining complex phenotypes [10,11].
The contribution of structural variants (SVs) to complex phenotypes has been measured by association analyses of variance in gene expression levels (traits) and the presence of SVs. SNPs and SVs have been shown to account for 83.6% and 17.7%, respectively, of the total detected genetic variation in gene expression, with only a limited overlap . The effect that SVs have on gene expression is likely underestimated given the much less completeness and accuracy with which SVs could be queried at that time. In humans, SVs have been associated with sporadic and Mendelian diseases, such as Williams-Beuren syndrome, mental retardation, and red-green color blindness. SVs have also been associated with complex human traits, such as autism, schizophrenia, Crohn's disease, and susceptibility to HIV infection . Because of their association with human diseases, the importance of SVs has become increasingly apparent [9,14,15]. For most other species, including the major farm animals, chickens, cattle, and pigs, the extent and biological consequences of SVs have remained largely unknown due to the lack of a cost-effective approach for detecting SVs.
Until recently, comparative genomic hybridization (array-CGH) was the most commonly used method for detecting SVs . Fosmid paired-end sequencing, which is a more laborious technique, has been used to detect SVs larger than 8 kb [17,18]. The inability to resolve smaller SVs using array-CGH results in the over-representation of larger SVs in current databases of structural variation (e.g., http://projects.tcag.ca/variation/). The resolution of array-CGH, though extremely costly, can be improved by using high-resolution whole-genome tiling arrays. Most of these SVs have been identified by methods that do not resolve SV end points at the base pair level. In addition, methods like array-CGH are based on a reference genome that currently does not encompass all SVs within the population and, thus, is limited in scope. Genomic regions that are the result of deletions not present in the reference genome are not captured by the array and not analyzed for SVs.
Next generation sequencing (NGS) technology was recently shown to be a powerful alternative to array-CGH for identifying genomic structural variation [1,7,19]. Using paired-end sequencing, SVs can be identified with single base pair resolution. Moreover paired-end sequencing allows for the detection of balanced rearrangements in which there is no gain or loss of a genomic region, such as inversions and translocations, which cannot be identified by array-CGH. Paired-end sequencing and mapping (PEM) involves sequencing the paired ends of fragments of known insert size from a genomic DNA library and computationally mapping DNA reads to a reference genome.
Here, we used PEM on reduced representation libraries (RRLs) of pooled chicken DNA samples. In the chicken genome, only 43 (larger) SVs have been described thus far . These SVs encompass 16 chicken-turkey inter-specific copy number variants (CNV) and 32 chicken-duck inter-specific CNVs, of which five CNVs overlap with inter-specific chicken-turkey CNVs . In chicken, some phenotypes have already been linked to structural variation, including the pea-comb  and late feathering  phenotypes. With PEM of an RRL, we provide a cost-effective approach for exploring the presence of SVs at high resolution within four chicken breeds.
To identify genomic rearrangements in the chicken genome, we applied massively parallel sequencing using the Illumina Genome Analyzer platform to sequence both ends of the genomic DNA fragments derived from the RRLs. We used pooled samples from 25 individuals to construct AluI RRLs for a white egg layer line, brown egg layer line, and two different broiler lines. For the white and brown egg layer lines, the 150-200 bp AluI fragments were used for creating the RRL; for the two broiler lines, 125-200 bp AluI fragments were used. From the brown and white egg layer RRLs, we obtained 31.61 million and 29.70 million raw reads, respectively, and from broiler 1 and broiler 2 we obtained a total of 34.8 million and 32.4 million raw reads, respectively. Reads were filtered for the presence of the restriction enzyme tag and trimmed to 32 bases. We required a phred quality score  of at least 20 (Table (Table1)1) for each base in the 32-bp read. The fraction of read pairs for which both reads mapped back to the reference chicken genome (Red Jungle Fowl built WASHUC2) was 78% for broiler 1 and 77% for broiler 2 (Table (Table1).1). In the layers, the fraction was 76% (brown egg layer) and 73% (white egg layer). In all breeds the were approximately hundred thousand paired reads (0.5-0.6%)of which only one read mapped back to the reference genome, whereas up to 26% of the read pairs had no end uniquely mapping back to the reference genome.
To calculate the sequence coverage of the RRL, we estimated the number of fragments in the RRL by performing an in silico AluI digest of the chicken genome build WASHUC2, which resulted in 583,826 fragments of 150-200 bp, whereas 947,538 fragments of 125-200 bp were obtained. We calculated RRL sequence coverage based on the paired-end reads that passed our sequence quality filters. Coverage of the RRLs ranged from 11-13X in broiler lines to 18-20X in the layer lines, indicating that we analyzed, on average, 22-40% of the haplotypes of the 25 individuals used for constructing the RRL (Table (Table22).
The real sequence coverage of the RRL was estimated by clustering identical paired reads and plotting the distribution of clusters according to the numbers of reads per cluster (Figure (Figure1).1). The majority of the fragments in the RRL was covered by 10 paired reads.
For each breed, we calculated insert sizes for paired ends that mapped in the correct orientation (Figure (Figure2).2). The results show a peak at ~185 bp and a shoulder of smaller fragments, indicating that the insert sizes were not equally distributed. The upper limit of fragment size was clearly demarcated at ~210 bp, which corresponded well to the size range of the excised fragments. Based on these results, the lower limit was estimated to be ~135 bp in the layer lines and ~110 bp in the broiler lines, which is consistent with the applied size selection. To eliminate false positives, we established size thresholds of 100 and 220 bp and considered mapping paired reads within this range as consistent with the reference genome.
In each breed, roughly 0.1% of the mapping read pairs had no concordant alignment in the reference genome, referred to as discordant paired-end reads [2,17], indicating a potential SV. Discordantly mapping read pairs are those whose distance apart is less or greater than expected from the RRL size range or in another relative orientation than expected based on the reference genome (Table (Table1).1). Paired reads that mapped to two different chromosomes (up to 0.12%) were excluded from further analysis. Discordantly mapping read pairs of the larger chicken chromosomes (1-15,20 and Z) with similar mapping coordinates and predicting a similar putative SV were clustered in 10,559 clusters. Clusters were classified as having an insert size that was too large (deletions, n = 5135), too small (insertions, n = 5241), or an incorrect orientation of ends (inversion breakpoints, n = 183) with respect to the chicken genome sequence.
Because of the high number, not all of the clusters are presumed to represent a true genomic rearrangement, but some are incorrectly mapped reads caused by sequencing errors that result in low quality mapping. Therefore, the average mapping quality of discordantly mapping read pairs was evaluated per chromosome compared to the average mapping quality scores of read pairs that mapped consistently within the reference genome. However, the average mapping quality of discordantly mapping reads was similar to the mapping quality of concordantly mapping read pairs (Table (Table3).3). We also observed that the average coverage by paired reads differed up to two-fold between chromosomes, but the number of fragments per chromosome in the RRL correlated well with chromosome size.
To be considered as a true putative SV cluster, we required both ends to have an average mapping quality similar to concordantly mapping reads, which was ~60. In total, 7,789 clusters consisting of 3794 deletions, 3931 insertions, and 64 inversion breakpoints met this criterion. SV clusters predicting a deletion or insertion were further prioritized for confirmation screening on the basis of parameters listed in the Methods section. To validate our approach for identifying SVs, we initially evaluated 15 (SV13-28) predicted SVs (Table (Table4)4) using PCR to genotype pooled samples from the four chicken breeds with primers spanning predicted breakpoint junctions. A total of eight SVs yielded a clear PCR product of the expected size (Figure (Figure3A).3A). For these SVs, PCR was performed on eight individuals from breeds in which the SV was confirmed to be present by the SV-specific PCR product (Figure (Figure3B).3B). Individual SV-specific PCR products typed homozygous for the SV were sequenced to disentangle the rearrangement at the base-pair level. The sequence analysis results for these eight identified rearrangements were all consistent with our SV predictions.
The results suggest that the presence of concordantly mapping reads partly overlapping the predicted SV region did not correlate with the quality of SV prediction, whereas reference errors in the predicted SV region correlated negatively. Furthermore, the results indicate that putative SVs predicted by a single or a few discordantly mapping read pairs that mapped a slightly different distance than expected were false positives, whereas the majority of putative SVs with greatly deviating mapping distances were confirmed as being true SVs. With this limited number of observations, we formulated a simple but fitting rule to determine SV clusters with a high likelihood of representing a genomic rearrangement from false positives.
We hypothesize that the size range of targeted DNA fragments isolated from the gel might contain a very small fraction of fragments outside the established size thresholds (Figure (Figure2).2). This lack of proper separation is likely caused by migration artifacts caused by secondary DNA structures. To compensate for this bias we required that SVs, predicted based on discordantly mapping read pairs that mapped to the reference between 220 and 720 bp apart, meet a representation constraint. In our proposed validation rule, we assumed an inverse relationship between the span-size deviation of a predicted SV and the number of discordantly mapping read pairs (n) required to predict a true SV. We hypothetically state that SVs meeting the abundance constraint (span-size deviation) × n >500 can be validated as true deletions. We assumed that this empirical rule is also applicable to insertions predicted by read pairs that map (too short) a distance of 32-100 base pairs. To test our empirical rule, we applied it to the subset of deletion (n = 3794) and insertion (n = 3931) clusters used in the previous validation study, obtaining 186 candidate putative deletions and two insertions. Both insertion candidates (SV50 and SV51) and a total of eight deletions (SV52-SV59), four of which narrowly met the rule constraints (Figure (Figure4),4), were selected for confirmation. PCR-based genotyping analysis showed that all selected candidates were confirmed in the pooled samples (Figure (Figure3A).3A). We also observed that the PCR-based SV genotyping results for pools correlated well with the predicted presence of a particular SV in the breeds based on the sequence dataset (Table (Table44).
Genotyping results suggested that the presence or absence of SVs in a particular breed is fairly well predicted by the sequencing data. Therefore, we further analyzed 186 rearrangements (deletions) validated by our rule for breed specificity. We also analyzed breed specificity for 280 putative deletions that resulted from applying a less stringent read mapping quality constraint, which was also applied in previous SV detection studies [19,25]. The results were compared by plotting both data subsets in weighted Venn diagrams (Figure (Figure5).5). In the validated dataset of 186 deletions, we detected the most SVs in broilers, 114 in broiler 1 and 109 in broiler 2, whereas fewer SVs were detected in the layer lines, 60 in white egg layers and 85 in brown egg layers. Ten percent of the rearrangements were present in all four breeds. SVs detected in white egg layers were 23% breed-specific, and the other 77% were evenly shared with the other breeds. The brown egg layers had the fewest breed-specific SVs (18%) and shared a remarkably high percentage (65%) with broiler 1. Broiler 1 and broiler 2 showed similar percentages of breed-specific SVs, and 36% of the SVs in broiler 2 were shared with broiler 1. Applying a less stringent mapping quality constraint resulted in a 50% increase in SVs, whereas the distribution of SVs over the four chicken breeds remained approximately the same.
The majority of detected SVs were small (Figure (Figure6);6); approximately 85% of all SVs were <1 kb whereas 60% were <500 bp. However, we also predicted and validated SVs spanning multiple kilobases. Predicted SVs validated by our rule were mapped to the chicken genome, and we observed an even distribution on the chromosomes (Figure (Figure7).7). Sequence annotations of the regions overlapping the identified SVs were extracted from Ensembl ; 44% of the SV read pairs mapped within genes. The read pairs for a minor fraction of the SVs (~2%) spanned predicted exons; these SVs were analyzed for their effects on gene annotations or gene models (Table (Table5).5). The majority of all predicted SVs represented a putative deletion of low complexity and repetitive sequence motifs in intronic or intergenic regions (Table (Table6).6). An exception is SV52, representing a deletion within gene ENSGALG00000010719, which has been annotated as DNA glycosylase FPG2.
All PCR-validated SVs were characterized by traditional sequence analysis to reveal their exact breakpoint locations, from which the chromosomal position and deletion/insertion sizes were derived (Table (Table4).4). Sequence losses were annotated using Ensembl . For rearrangements in SV52, we analyzed the effect on the in silico transcript to which it was mapped. The majority of intronic deletions resulted in a loss of a variety of known repetitive motifs (Table (Table7).7). In contrast, we could not find annotations in Ensembl  for most losses in intergenic regions or known repeats using RepeatMasker (Smith and Green unpublished). DNA sequences at the SV breakpoints were analyzed for signatures indicating the mechanism by which the SVs formed. We identified microhomology in three sequenced SVs (Figure (Figure8).8). Finally, the SV we observed in a coding region involved a deletion in the end of the last exon (ENSGALE00000116074) of transcript ENSGALT00000038211.
By sampling a portion of the genome from four chicken lines using stringent SV detection constraints, we detected 188 SVs encompassing ~130 kb. Assuming considerable limitation in the detection of classes of SVs by our method, the chicken genome may differ in SVs to a greater extent than in SNPs. Therefore, we counted the total number of nucleotides involved. The majority of SVs identified by our method were small deletions, most of which resulted in a loss of repetitive motifs in intronic regions or a loss of unannotated sequences in intergenic regions. Both insertions mapped to intergenic regions as sequences of a few tens of base pairs and low complexity. We also predicted rearrangements in coding regions, revealed the exact breakpoints on the reference genome for 16 SVs, and confirmed our predictions. To what extent SVs in intronic and intergenic regions contribute to the evolution of the chicken genome or chicken phenotypes remains unclear, especially because the functions of these genomic regions are largely unknown . To date, studies involving the detection and exploitation of genetic variation in chicken encompass large SVs by means of CNVs but do not include smaller SVs. Our study reveals that, given their high frequency, these smaller SVs will need to be incorporated in genotyping because they might explain phenotypic differences. In addition, our data suggest that structural variation has contributed to genetic differentiation among current domesticated chicken breeds and the Red Jungle Fowl, and might have played a role in chicken genome evolution.
Currently, sequence-based genome-wide surveys of SVs involve the preparation of whole genome fragment libraries in combination with paired-end sequencing. Such approaches require relatively large investments, particularly if multiple individuals from multiple breeds have to be screened. This study demonstrated the potential of massive parallel paired-end sequencing of RRLs constructed from the pooled DNA of multiple individuals. SVs were predicted based on the read pair information from the paired-end sequenced small insert RRL, which was purposely created for SNP detection. The small RRL size allowed for PCR-based confirmation and characterization of the SV at the base pair level of acquired deletions and small insertions with minimal sequencing efforts. Revealing inversion and translocation breakpoints is much more laborious due to the limited information RRL approaches provide. We showed that read pair analysis of a paired-end sequenced RRL is already sufficient for obtaining a first glimpse of SVs in a particular sequenced species. This RRL based strategy put constraints on the quality of the reference genome because assembly errors will result in false positive SV predictions in reference based detection approaches. Uncertainty about the quality of assembly of some of the smaller micro-chromosomes together with computational limits at the time of this study were the reasons why we did not analyze the whole genome for SVs. An enhanced assembly of the chicken reference genome and the increasing computational power allow for improvement in the detection of SVs using our approach. Furthermore the use of multiple RRLs including large and small fragments pools, that are separately tagged and paired-end sequenced together in bulk, will considerably improve SV detection at small increase of cost. More demanding is PEM of a randomly sheared and size-selected whole genome library providing a more complete catalog of rearrangements characterized between a sample and a reference [1,19]. An even more complete picture including SVs of a larger size and more complex rearrangements will require paired-end sequencing of several libraries of different insert sizes . The detection of all structural variation, which requires whole genome sequencing and de novo assembly, is extremely demanding. However, the identification of (small) deletions and insertions with comparable or shorter length than the standard deviation of paired-end insert sizes requires de novo assemblies, because such SVs cannot reliably be identified by mapping approaches. Moreover, reference-based approaches, included mapping approaches, are biased to the completeness of the reference and, thus, ignore variants in regions that are missing from the reference genome due to structural variation. Finally, de novo assembly has the advantage of resolving SVs to a single base pair level, and inserted sequences can be obtained .
We used a NGS approach to identify genomic rearrangements within four commercial chicken breeds by comparing their genomes to the sequenced chicken genome (Red Jungle Fowl). We excluded several classes of sequence reads from further analysis, including reads that did not show the restriction enzyme tag and those that showed more than one mismatch in the alignment. The first constraint was applied to eliminate false positive insertion predictions due to a breakdown of the RRL resulting in shorter spans of paired-end reads, whereas the second constraint was applied to reduce the number of false predictions due to sequencing errors. However, we realize that by taking these measures we also discard many read pairs because of true nucleotide variation, which occur in one of every 200 bp in the chicken . The inclusion of read pairs with more than one mismatch in the alignment can be considered but has a risk of falsely predicted SVs due to mapping errors, requiring a revalidation of our proposed SV size deviation versus the observed frequency rule (Figure (Figure4).4). On the other hand, reducing the mapping constraints might reveal additional true SVs potentially hidden in the considerable fraction of read pairs with only one end or no end mapped to the reference when using our mapping constraints. However, this fraction of read pairs with mapping problems might also largely represent sequences of gaps in the genome (estimated to encompass ~100 Mb in total) and, thus, cannot be mapped.
Theoretically, our approach for identifying SVs allows the prediction of SVs and insight into how a predicted SV is distributed across breeds. We showed that the observed distribution of SVs is a good predictor for the actual distribution of the SV in breeds. Even with limited sampling, predicted SV distributions correlated with the PCR-based genotyping results of pooled samples (Table (Table3).3). In general, PCR-based genotyping revealed that predicted SVs are more widely shared in breeds than predicted by our sequencing-based estimation. This situation is caused by limited sampling, and the reduction of target sequence complexity by creating RRLs might have contributed to this difference. Our sampling regimen required enzyme recognition sequences flanking a SV within the size range for the RRL to include a particular SV in the RRL. Breed-specific SNPs in AluI sites may have caused one or both SV alleles to not be sampled and are, thus, not predicted to be present in that breed, consequently affecting our sequencing-based estimation of SV distribution across breeds. Conversely, our PCR-based genotyping approach with pooled samples was not affected by sampling limitation or AluI SNPs and revealed the presence of SVs in a breed even at allele frequencies of 0.1 (data not shown).
Because of the difference in the predicted presence of a SV in a breed and the genotyping results, we realize that the 186 SVs with which we estimated breed specificity might not be fully representative. The use of different RRL sizes (150-200 bp in layers and 125-200 in broilers) is reflected in a 1.5-2-fold difference in the SVs detected in broilers and layers. The fairly large percentage of SVs shared in broilers can be interpreted as being due to the effects of selection during line development by commercial companies and is consistent with the results of recent SNP genotyping , but it might be over-estimated in our study due to the difference in RRL construction. The percentage of predicted SVs shared by brown egg layers and broiler 1, however, is an indication that these breeds are more genetically related compared to the other breeds. Recent SNP genotyping results for brown and white egg layers and three broiler lines also indicated that the brown egg layer breed is more closely related to broiler lines than to white egg layers , which is in agreement with our conclusion based on SV distribution.
The reduction in the percentage of the genome covered by sequencing a RRL instead of randomly sampling the whole genome placed high constraints on the detection of SVs. The actual amount of SVs is likely much higher because we only sampled those that are flanked by restriction sites, and such that the intermediate sequence length of the variant was in the size range of the RRL. Large insertions were not expected to be detected because our RRL approach only allows for the detection of up to about 170 bp, the size between the maximum RRL fragment size (~200 bp) minus the mapping size of two completely overlapping reads (32 bp)
Although the larger SVs are most likely under-represented in our data due to the constraints of the applied detection method, we can conclude that the majority of SVs in the chicken genome are smaller than 1 kb (Figure (Figure6).6). This finding is consistent with human studies  in which SV abundance inversely correlated with SV size. We observed that 99% of the predicted SVs were located in intronic (43%) and intergenic regions (56%), which together comprise ~90% of the chicken genome. As expected, SVs were less abundant in coding regions because, like SNPs, they are more likely to have negative impacts and be eliminated by purifying selection. Moreover the observed lower abundance of SVs in coding region is consistent with the idea that the most common rearrangement mechanism requires substrates, such as microhomology, low copy repeats, and segmental duplications, which are more abundant in non-coding regions [10,32,33]. In 3 of 15 sequenced SV breakpoints, we were able to identify signatures in the DNA sequence indicating the mechanism by which SVs are formed. All identified signatures involved microhomology at the breakpoint junction that resulted from either nonhomologous end-joining or replication fork stalling and template switching events . Other SVs did not show a clear sequence signature.
We provided a first glimpse of the abundance and genomic locations of structural variation in the chicken genome by identifying 188, mostly small, rearrangements, some of which were in coding regions, though a majority was located in non-coding regions. Based on the present data, we expect to find thousands of small (<1 kb) and hundreds of larger rearrangements in the whole chicken genome, encompassing more nucleotides than SNPs, and that are putatively involved in phenotypic variation. We observed that structural variation has contributed to genetic differentiation among current domesticated chicken breeds and the Red Jungle Fowl. Finally, we showed that little sequencing effort on a reduced representation of a genome is sufficient for the detection and base pair level annotation of a variety of SVs in a sequenced genome.
Individual DNA samples were pooled according to breed and the genome complexity reduced by isolating a fraction of a complete genome digest. The isolated reduced representation library (RRL) was paired-end sequenced using Illumina genome Analyzer technology. The paired-end reads were aligned to the reference chicken genome WASHUC2 build and SVs are identified as significant differences between the mapping distances identified by the paired-end reads and the size range used for constructing the RRLs. Deletions relative to the reference genome were identified by paired ends spanning a genomic region in the reference genome longer than the size in the RRL, whereas insertions were identified by paired ends spanning a shorter genomic region in the reference sequence than expected based on the RRL. Inversion breakpoints were detected by paired ends that mapped in a different relative orientation compared to the reference genome.
Genomic DNA was extracted from 30 μl of blood from 25 unrelated F0 individuals from brown and white egg layer lines and two broiler lines consisting of 13 males and 12 females (Broiler 1) and 25 males (Broiler 2) using a Puregene DNA isolation kit (D-70KA; Gentra Systems, Inc., USA).
The RRLs were prepared by digesting 25 μg of pooled DNA using 1,000 units of the restriction enzyme AluI in a total volume of 240 μl. The selection of the restriction enzyme was based on the 10-fold reduction of genome complexity in the optimum size range (100-200 bp) of the sequencing technology platform (Genome Analyzer, Illumina). The digested DNA sample was fractionated on a 10% precast polyacrylamide gel (Biorad) at 100 V for 3 h and stained with ethidium bromide. The size fractions were sliced out of the gel and the DNA was mechanically sheared and and eluted over night in 300 μl recovery buffer (8 mM Tris pH 8.0, 0.08 mM EDTA, 1.25 M ammonium acetate. After a 15-min incubation at 65°C, the eluent was purified using a Montage DNA Gel Extraction Device (Millipore Corporation, Bedford, MA) and precipitated with isopropanol. The DNA was washed with ethanol and re-suspended in DNA hydration solution (Gentra Systems, Inc., USA).
We prepared the Genome Analyzer paired-end flow cell according to the manufacturer's protocol.
Five picomole aliquots of the RRLs were processed using the Illumina Cluster Generation Station (Illumina, Inc., USA) following the manufacturer's recommendations. The Illumina GAII Genome Analyzer (Illumina, Inc., USA) was programmed to produce a theoretical fixed read length of 36 bp.
Images from the instrument were processed using the manufacturer's software to generate FASTQ sequence files. Paired reads that had both the RRL restriction tag and a per base phred (Ewing and Green, 1998) quality score of at least 20 were selected using custom Perl scripts and aligned to the chicken genome (WASHUC2) using the MAQ  algorithm v0.7.1 with parameters -1 32 -2 32 -a 220.
Alignment results were analysed according to the MAQ  documentation by using custom perl and bash scripts. Paired reads in which one or both ends were mapped with more than one mismatch or mapped ambiguously on the reference sequence were excluded from analysis, as these would not reliably detect SVs. Discordantly mapping read pairs in which the two ends mapped >220 bp apart were classified as deletions and subsequently clustered based on overlapping mapping positions. SVs longer than 100 kb disrupted clustering and were excluded. Read pairs that mapped within 100 bp of each other were classified as insertions, whereas read pairs that mapped with one of the two ends in the incorrect orientation were classified as inversions. Both insertions and inversions were also clustered based on mapping positions by applying custom made Perl scripts.
For each SV cluster, we recorded the number of reads spanning the rearrangement, regardless of whether a normally mapping pair was observed or whether a sequence gap in the WASHUC2 build was present within the genomic range in which the deletion was predicted. SV clusters were prioritized for validation as follows: (i) an alternative mapping quality score of at least 60, (ii) both reads of a discordantly mapping pair mapped within a single predicted Ensembl exon or gene , and (iii) the genomic sequence flanking the SV allows primer design (Primer3Plus ) within 200 bp. We applied these criteria for selecting candidates distributed over the 220 bp-20 Kbp (deletions) and 32 bp-100 bp (insertions) size ranges. If these criteria yielded more than one candidate, the candidate with the highest alternative mapping quality score was selected.
Primers were designed to span the possible breakpoint by locating them 40-200 bp outside the mapping location of discordantly mapping read pairs. The minimum and maximum aberrant PCR product size was expected to be the sum of the minimum/maximum fragment size in the RLL and required flanking genomic region for primer development. PCR reactions were initially performed on DNA of the Red Jungle Fowl reference animal UCD001 and the pooled samples of all four breeds. For breeds in which the rearrangements were detected, individual samples were genotyped by PCR. The PCR products of homozygous individuals, or samples in which only the aberrantly sized product resulted, were sequenced on a conventional Sanger capillary sequencer and the results compared to the reference sequence using megablast with parameter -F F to identify breakpoints. Both ends of the PCR product on the reference (Red Jungle Fowl) were sequenced and mapped to the reference to ensure that it originated from the expected genomic position.
Confirmed SVs were defined as those for which PCR reactions resulted in a distinct band in the expected size range in at least the breed for which the rearrangement was predicted and with no matching band in the UCD001 reference animal. The PCR results had to be supported by unambiguous sequencing data mapping confirming the rearrangement.
The data from this paper have been submitted to the NCBI Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA026771.
The SVs identified in this study that have not been confirmed and annotated at the base pair level are available upon request, awaiting a central repository of structural variation in genomes.
HHDK designed and developed the SV prediction method and wrote the manuscript. BWD and RPMAC prepared the samples and performed the initial validation and genotyping analysis. AV and RO selected the animals to be sequenced and collected the samples. MAMG and RPMAC coordinated and supervised experiment implementation and assisted in the preparation of the manuscript. All authors read and approved the final manuscript.
We thank Mari Smits and Hendrik-Jan Megens for critically reading the manuscript and their helpful comments. This study was funded by European Union grant FOOD-CT-2004-506416 (Eadgene). Sequencing of the RRLs was funded by Cobb-Vantress Inc, USA and Hendrix Genetics, The Netherlands.